Phase 4 -- Compute & Storage
Status: Not started
Goal: Eliminate the single-host and single-controller dependencies.
Addresses: P4 (single compute host), K1 (single control plane), K4 (Vault standalone)
4.1 Add a Second Compute Host
|
|
| Why |
All VMs run on one machine. Hardware failure means total cluster loss with no recovery until hardware is replaced. |
| Unlocks |
Proxmox HA (automatic VM failover), rolling Proxmox upgrades, rolling K8s upgrades without downtime, proper pod anti-affinity. |
| Sizing |
A matching MS-01 with 64 GB RAM is ideal. A smaller node (32 GB) is sufficient for one worker and one control plane node. |
| IaC |
The existing Terraform module and Ansible inventory are parameterized. Adding a host means a new target_node and rebalancing VM placement. |
4.2 Expand to 3 Control Plane Nodes
|
|
| Why |
Single control plane means API server, etcd, and scheduler are all SPOFs. 3 nodes across 2 hosts survives any single failure. |
| Prerequisites |
Second compute host (4.1). Load balancer for API server. |
| Resource cost |
~2 vCPU and 4-8 GB RAM per control plane node. With 128 GB across 2 hosts, easily accommodated. |
4.3 Migrate Vault to HA (Raft)
|
|
| Why |
Vault is a single pod. If it crashes or NFS stalls, every ExternalSecret stops refreshing. New deployments and secret rotations fail immediately. |
| Approach |
Vault's integrated Raft replicates data across replicas without a separate etcd or Consul cluster. All replicas use the same KMS key for auto-unseal. |
4.4 Expand NAS Storage
|
|
| Why |
With the Phase 1.2 mirror, usable capacity is 8 TB. RAID 10 doubles usable space and improves read performance. |
| Timing |
Flexible. Monitor with the existing NFSStorageLow PrometheusRule alert. |
Definition of Done