Phase 2 -- Kubernetes Hardening¶
Status: In progress
Goal: Close the software gaps that could cause outages or security incidents under normal operation.
Addresses: K3, K9, K11, K15, N6
2.1 Add ResourceQuotas and LimitRanges¶
- Define a default
LimitRangefor every namespace (default CPU/memory requests and limits) - Define a
ResourceQuotaper namespace (hard ceiling on total resource consumption) - Use Goldilocks/VPA recommendations to inform initial values
- Deploy via Kustomize components or per-namespace manifests
- Start generous, tighten based on observed usage
| Why | A single misbehaving pod can consume all node memory and cascade-kill neighbors. On 56 GB total worker RAM, one pod can cause a cluster-wide outage. |
2.2 Add Pod Topology Spread Constraints¶
- Add
topologySpreadConstraintsto: Authentik, Grafana, Prometheus, ArgoCD, Vault - Use
whenUnsatisfiable: ScheduleAnyway(soft constraint) to avoid blocking on a 3-node cluster - Verify pods distribute across nodes after rollout
| Why | Without topology hints, the scheduler may co-locate critical services on one node. A single node failure could take out auth, monitoring, and GitOps simultaneously. |
2.3 Add Image Registry Allowlist¶
- Create a Kyverno
ClusterPolicyrestricting image pulls to trusted registries - Allowlist:
docker.io,ghcr.io,quay.io,registry.k8s.io,lscr.io, and any others in use - Verify all existing workloads pass the new policy before enforcing
| Why | Currently any registry is allowed. A typo or malicious upstream could pull from an untrusted source. |
2.4 cert-manager Health Alerting¶
- Add PrometheusRules for cert-manager pod readiness
- Add PrometheusRules for certificate renewal failures (
certmanager_certificate_ready_status == 0) - Add PrometheusRules for issuer errors
| Why | Existing alerts fire when a certificate is 14 days from expiry. But if cert-manager is dead, renewals silently stop and the alert only fires when expiry is imminent. |
2.5 Restrict Pod Egress to Known Destinations¶
- Audit current pod egress patterns (DNS, NFS, external APIs, container registries)
- Add CiliumNetworkPolicy
egressDenyor implicit-deny rules per namespace - Allowlist required destinations: DNS (kube-dns), NFS (192.168.1.158), and service-specific external endpoints
- Verify all workloads function after applying policies
| Why | A compromised pod can reach any external destination. Restricting egress limits the blast radius of a container breakout or supply chain attack. |
| Approach | Use Cilium's implicit deny model (allow specific egress, deny all else). Do not combine egressDeny world CIDRs with egress allow rules on the same policy -- this causes silent drops due to Cilium policy evaluation order. |
Definition of Done¶
- Every namespace has a ResourceQuota and LimitRange
- Critical pods spread across nodes
- Only trusted registries allowed
- cert-manager failures trigger alerts
- Pod egress restricted to known destinations per namespace