Phase 4 -- Compute & Storage¶

Status: Not started

Goal: Eliminate the single-host and single-controller dependencies.

Addresses: P4 (single compute host), K1 (single control plane), K4 (Vault standalone)

4.1 Add a Second Compute Host¶


Why	All VMs run on one machine. Hardware failure means total cluster loss with no recovery until hardware is replaced.
Unlocks	Proxmox HA (automatic VM failover), rolling Proxmox upgrades, rolling K8s upgrades without downtime, proper pod anti-affinity.
Sizing	A matching MS-01 with 64 GB RAM is ideal. A smaller node (32 GB) is sufficient for one worker and one control plane node.
IaC	The existing Terraform module and Ansible inventory are parameterized. Adding a host means a new `target_node` and rebalancing VM placement.


Why	Single control plane means API server, etcd, and scheduler are all SPOFs. 3 nodes across 2 hosts survives any single failure.
Prerequisites	Second compute host (4.1). Load balancer for API server.
Resource cost	~2 vCPU and 4-8 GB RAM per control plane node. With 128 GB across 2 hosts, easily accommodated.

Switch Vault from standalone file storage to integrated Raft storage (3 replicas)
Move Vault PVCs from nfs-client to local-path (Raft needs local disk)
Verify AWS KMS auto-unseal works for all replicas
Verify ESO can reach Vault through a Vault service (not a specific pod)
Test failover: kill the Raft leader, confirm a new leader is elected
Write an ADR


Why	Vault is a single pod. If it crashes or NFS stalls, every ExternalSecret stops refreshing. New deployments and secret rotations fail immediately.
Approach	Vault's integrated Raft replicates data across replicas without a separate etcd or Consul cluster. All replicas use the same KMS key for auto-unseal.


Why	With the Phase 1.2 mirror, usable capacity is 8 TB. RAID 10 doubles usable space and improves read performance.
Timing	Flexible. Monitor with the existing `NFSStorageLow` PrometheusRule alert.