Phase 2 -- Kubernetes Hardening¶

Status: In progress

Goal: Close the software gaps that could cause outages or security incidents under normal operation.

2.1 Add ResourceQuotas and LimitRanges¶

Define a default LimitRange for every namespace (default CPU/memory requests and limits)
Define a ResourceQuota per namespace (hard ceiling on total resource consumption)
Use Goldilocks/VPA recommendations to inform initial values
Deploy via Kustomize components or per-namespace manifests
Start generous, tighten based on observed usage


Why	A single misbehaving pod can consume all node memory and cascade-kill neighbors. On 56 GB total worker RAM, one pod can cause a cluster-wide outage.

Add topologySpreadConstraints to: Authentik, Grafana, Prometheus, ArgoCD, Vault
Use whenUnsatisfiable: ScheduleAnyway (soft constraint) to avoid blocking on a 3-node cluster
Verify pods distribute across nodes after rollout


Why	Without topology hints, the scheduler may co-locate critical services on one node. A single node failure could take out auth, monitoring, and GitOps simultaneously.

Create a Kyverno ClusterPolicy restricting image pulls to trusted registries
Allowlist: docker.io, ghcr.io, quay.io, registry.k8s.io, lscr.io, and any others in use
Verify all existing workloads pass the new policy before enforcing


Why	Currently any registry is allowed. A typo or malicious upstream could pull from an untrusted source.

Add PrometheusRules for cert-manager pod readiness
Add PrometheusRules for certificate renewal failures (certmanager_certificate_ready_status == 0)
Add PrometheusRules for issuer errors


Why	Existing alerts fire when a certificate is 14 days from expiry. But if cert-manager is dead, renewals silently stop and the alert only fires when expiry is imminent.

Audit current pod egress patterns (DNS, NFS, external APIs, container registries)
Add CiliumNetworkPolicy egressDeny or implicit-deny rules per namespace
Allowlist required destinations: DNS (kube-dns), NFS (192.168.1.158), and service-specific external endpoints
Verify all workloads function after applying policies


Why	A compromised pod can reach any external destination. Restricting egress limits the blast radius of a container breakout or supply chain attack.
Approach	Use Cilium's implicit deny model (allow specific egress, deny all else). Do not combine `egressDeny` world CIDRs with `egress` allow rules on the same policy -- this causes silent drops due to Cilium policy evaluation order.