Assessment¶
Analysis of the homelab's current strengths and gaps, used to prioritize the roadmap phases.
Strengths¶
Full IaC pipeline. Every layer from Proxmox host configuration through application deployment is codified and reproducible. make k8s-deploy rebuilds from zero. This matches how production infrastructure teams operate and is rare in homelabs.
GitOps discipline. ApplicationSet with Git File Generator for automatic app discovery, automated sync with prune and self-heal, Renovate with digest pinning on a weekly schedule. No manual kubectl apply for day-2 operations.
Externalized secrets. Vault with AWS KMS auto-unseal and ESO using Kubernetes auth is the industry-standard pattern. No secrets in Git, no static credentials.
Layered security. Cilium network policies (default-deny per namespace), Kyverno admission policies, non-root security contexts with dropped capabilities, gitleaks + Trivy in CI, Authentik SSO on every service. Multiple independent controls at different layers.
Complete observability. Metrics, logs, alerting, audit logging, capacity planning (VPA/Goldilocks), and synthetic monitoring (Uptime Kuma). Custom PrometheusRules for app health, infrastructure, backups, and node resources.
Exceptional documentation. 15 ADRs, architecture docs for every subsystem, runbooks for DR/upgrades/troubleshooting/SSO bypass, auto-published MkDocs site.
Clean operational interface. Makefile targets for every operational task. Reloader for config-driven restarts. Descheduler for pod rebalancing. Low operator toil.
Gaps¶
Physical Layer¶
| # | Gap | Risk | Severity |
|---|---|---|---|
| P1 | No UPS | Power event corrupts NVMe mid-write, kills NAS mid-IO, or causes unclean Proxmox/etcd shutdown. | Resolved |
| P2 | Single NAS drive | One drive failure loses all NFS-backed data: media, app configs, Prometheus, Loki, Vault, Velero backups. | Critical |
| P3 | Running at GbE when 10G is available | MS-01 has 2x 10G SFP+ unused. NFS throughput and future live migration bottlenecked at 1 Gbps. USW-16-PoE has 1G SFP only. | Low |
| P4 | Single compute host | All VMs on one machine. Hardware failure means total cluster loss. | High |
| P5 | Unused PCIe x16 slot | Half-height PCIe 4.0 x16 available for a dedicated GPU, HBA, or NIC. | Informational |
| P6 | No IPMI/remote management | MS-01 supports Intel vPro AMT but it is not configured. Hung host requires physical access. | Resolved |
Network Layer¶
| # | Gap | Risk | Severity |
|---|---|---|---|
| N1 | No dedicated management VLAN | Proxmox, switch, PDU, and NAS management share VLANs with production or household traffic. | Resolved |
| N2 | No IoT VLAN | Smart home devices (if any) share the default VLAN with household devices and the NAS. | Low |
| N3 | DNS is manual static entries | Adding a service requires a manual UniFi console edit. | Medium |
| N4 | WireGuard VPN not configured | No way to reach the homelab off-site. | Medium |
| N5 | No external access path | No reverse proxy, Cloudflare Tunnel, or Tailscale Funnel for sharing services externally. | Low |
| N6 | Unrestricted internet egress from Homelab VLAN | A compromised pod can reach any external destination. | Low |
Kubernetes / Software Layer¶
| # | Gap | Risk | Severity |
|---|---|---|---|
| K1 | Single control plane | API server, etcd, and scheduler are a single point of failure. | High |
| K2 | Kyverno audit-mode policies not enforced | require-resource-limits, require-run-as-nonroot, require-readonly-rootfs only report. |
Resolved |
| K3 | No ResourceQuotas or LimitRanges | A runaway pod can OOM an entire node and cascade-kill neighbors. | Medium |
| K4 | Vault standalone, no HA | Single Vault pod on NFS. Pod failure loses secret access cluster-wide. | Medium |
| K5 | No offsite backup copy | Velero backs up to MinIO on the same NAS as production data. | Resolved |
| K6 | Authentik Redis unauthenticated | auth.enabled: false. Network policies mitigate but any pod in the auth namespace has access. |
Resolved |
| K7 | Prometheus TSDB on NFS | Heavy random I/O on NFS degrades query performance and risks TSDB corruption. | Resolved |
| K8 | No HPA | Nothing scales horizontally under load. | Low |
| K9 | No pod topology spread constraints | Scheduler may co-locate critical services on one node. | Medium |
| K10 | No distributed tracing | Debugging cross-service request flows requires manual log correlation. | Low |
| K11 | No image registry allowlist | Any registry allowed. No protection against pulls from untrusted sources. | Low |
| K12 | No chaos testing | DR runbooks exist but are never automatically validated. | Low |
| K13 | No supply chain verification | No cosign signature verification or SBOM generation. | Low |
| K14 | Grafana dashboards are click-ops | Dashboards not stored in Git. DR event could lose custom dashboards. | Medium |
| K15 | No cert-manager health alerting | cert-manager pod failures or renewal errors are not monitored. | Low |
| K16 | No etcd snapshot schedule | Single control plane with no dedicated etcd backup. Velero backs up API resources but an etcd corruption or quorum loss requires a snapshot to restore. | Resolved |
| K17 | No Loki retention policy | Logs grow unbounded on NFS. No compaction or retention limits configured. | Resolved |
Gap-to-Phase Mapping¶
| Gap | Addressed In |
|---|---|
| P2 | Phase 1 -- Foundations |
| K3, K9, K11, K15, N6 | Phase 2 -- Kubernetes Hardening |
| P3, N3, N4, N5 | Phase 3 -- Network |
| P4, K1, K4 | Phase 4 -- Compute & Storage |
| K10, K14, K8 | Phase 5 -- Observability |
| K12, K13 | Phase 6 -- Platform Engineering |
| P5, N2 | Phase 7 -- Long-Term Vision |