Assessment¶

Analysis of the homelab's current strengths and gaps, used to prioritize the roadmap phases.

Strengths¶

Full IaC pipeline. Every layer from Proxmox host configuration through application deployment is codified and reproducible. make k8s-deploy rebuilds from zero. This matches how production infrastructure teams operate and is rare in homelabs.

GitOps discipline. ApplicationSet with Git File Generator for automatic app discovery, automated sync with prune and self-heal, Renovate with digest pinning on a weekly schedule. No manual kubectl apply for day-2 operations.

Externalized secrets. Vault with AWS KMS auto-unseal and ESO using Kubernetes auth is the industry-standard pattern. No secrets in Git, no static credentials.

Layered security. Cilium network policies (default-deny per namespace), Kyverno admission policies, non-root security contexts with dropped capabilities, gitleaks + Trivy in CI, Authentik SSO on every service. Multiple independent controls at different layers.

Complete observability. Metrics, logs, alerting, audit logging, capacity planning (VPA/Goldilocks), and synthetic monitoring (Uptime Kuma). Custom PrometheusRules for app health, infrastructure, backups, and node resources.

Exceptional documentation. 15 ADRs, architecture docs for every subsystem, runbooks for DR/upgrades/troubleshooting/SSO bypass, auto-published MkDocs site.

Clean operational interface. Makefile targets for every operational task. Reloader for config-driven restarts. Descheduler for pod rebalancing. Low operator toil.

Gaps¶

Physical Layer¶

#	Gap	Risk	Severity
P1	No UPS	Power event corrupts NVMe mid-write, kills NAS mid-IO, or causes unclean Proxmox/etcd shutdown.	Resolved
P2	Single NAS drive	One drive failure loses all NFS-backed data: media, app configs, Prometheus, Loki, Vault, Velero backups.	Critical
P3	Running at GbE when 10G is available	MS-01 has 2x 10G SFP+ unused. NFS throughput and future live migration bottlenecked at 1 Gbps. USW-16-PoE has 1G SFP only.	Low
P4	Single compute host	All VMs on one machine. Hardware failure means total cluster loss.	High
P5	Unused PCIe x16 slot	Half-height PCIe 4.0 x16 available for a dedicated GPU, HBA, or NIC.	Informational
P6	No IPMI/remote management	MS-01 supports Intel vPro AMT but it is not configured. Hung host requires physical access.	Resolved

Network Layer¶

#	Gap	Risk	Severity
N1	No dedicated management VLAN	Proxmox, switch, PDU, and NAS management share VLANs with production or household traffic.	Resolved
N2	No IoT VLAN	Smart home devices (if any) share the default VLAN with household devices and the NAS.	Low
N3	DNS is manual static entries	Adding a service requires a manual UniFi console edit.	Medium
N4	WireGuard VPN not configured	No way to reach the homelab off-site.	Medium
N5	No external access path	No reverse proxy, Cloudflare Tunnel, or Tailscale Funnel for sharing services externally.	Low
N6	Unrestricted internet egress from Homelab VLAN	A compromised pod can reach any external destination.	Low

Kubernetes / Software Layer¶

#	Gap	Risk	Severity
K1	Single control plane	API server, etcd, and scheduler are a single point of failure.	High
K2	Kyverno audit-mode policies not enforced	`require-resource-limits`, `require-run-as-nonroot`, `require-readonly-rootfs` only report.	Resolved
K3	No ResourceQuotas or LimitRanges	A runaway pod can OOM an entire node and cascade-kill neighbors.	Medium
K4	Vault standalone, no HA	Single Vault pod on NFS. Pod failure loses secret access cluster-wide.	Medium
K5	No offsite backup copy	Velero backs up to MinIO on the same NAS as production data.	Resolved
K6	Authentik Redis unauthenticated	`auth.enabled: false`. Network policies mitigate but any pod in the auth namespace has access.	Resolved
K7	Prometheus TSDB on NFS	Heavy random I/O on NFS degrades query performance and risks TSDB corruption.	Resolved
K8	No HPA	Nothing scales horizontally under load.	Low
K9	No pod topology spread constraints	Scheduler may co-locate critical services on one node.	Medium
K10	No distributed tracing	Debugging cross-service request flows requires manual log correlation.	Low
K11	No image registry allowlist	Any registry allowed. No protection against pulls from untrusted sources.	Low
K12	No chaos testing	DR runbooks exist but are never automatically validated.	Low
K13	No supply chain verification	No cosign signature verification or SBOM generation.	Low
K14	Grafana dashboards are click-ops	Dashboards not stored in Git. DR event could lose custom dashboards.	Medium
K15	No cert-manager health alerting	cert-manager pod failures or renewal errors are not monitored.	Low
K16	No etcd snapshot schedule	Single control plane with no dedicated etcd backup. Velero backs up API resources but an etcd corruption or quorum loss requires a snapshot to restore.	Resolved
K17	No Loki retention policy	Logs grow unbounded on NFS. No compaction or retention limits configured.	Resolved

Gap-to-Phase Mapping¶

Gap	Addressed In
P2	Phase 1 -- Foundations
K3, K9, K11, K15, N6	Phase 2 -- Kubernetes Hardening
P3, N3, N4, N5	Phase 3 -- Network
P4, K1, K4	Phase 4 -- Compute & Storage
K10, K14, K8	Phase 5 -- Observability
K12, K13	Phase 6 -- Platform Engineering
P5, N2	Phase 7 -- Long-Term Vision