Disaster Recovery¶
This runbook covers recovery procedures for the most severe failure scenarios, from a single node going down to a complete cluster rebuild.
Complete Cluster Rebuild¶
Use this procedure when the entire Kubernetes cluster is lost and must be rebuilt from scratch.
Prerequisites¶
Before a disaster occurs, ensure the following are available outside the cluster:
- Vault root token (stored in password manager after
make vault-init) - AWS credentials and KMS key ID for Vault auto-unseal (the
vault-aws-kmsSecret must be recreated after a rebuild) - Velero backups stored on MinIO (local, running on the NAS) or AWS S3 (offsite,
velero-offsite-homelabbucket in us-east-1) - etcd snapshots and PKI tarballs stored on NAS (
etcd-snapshotsPVC) or AWS S3 (velero-offsite-homelab/etcd-snapshots/) - AWS credentials for the
velero-offsite-homelabIAM user (stored in password manager, also in Vault atinfrastructure/velero-offsite) - This Git repository (the single source of truth for all cluster state)
Critical
Vault uses AWS KMS for auto-unseal. The vault-aws-kms Kubernetes Secret must be recreated before the Vault pod starts (step 3 below). Store the AWS credentials and KMS key ID in your password manager.
Procedure¶
-
Rebuild the cluster infrastructure:
make k8s-deployThis provisions VMs with Terraform, bootstraps Kubernetes with Ansible, and installs ArgoCD with the ApplicationSet.
-
Retrieve the kubeconfig:
make k8s-kubeconfig export KUBECONFIG=$(pwd)/kubeconfig -
Before ArgoCD deploys Vault, create the
vault-aws-kmsSecret so Vault can auto-unseal:kubectl create namespace vault kubectl create secret generic vault-aws-kms \ --namespace vault \ --from-literal=AWS_ACCESS_KEY_ID="<access_key_id>" \ --from-literal=AWS_SECRET_ACCESS_KEY="<secret_access_key>" \ --from-literal=AWS_REGION="us-east-1" \ --from-literal=VAULT_AWSKMS_SEAL_KEY_ID="<kms_key_id>" -
Wait for Vault to become available (it will auto-unseal via KMS):
kubectl -n vault wait --for=condition=ready pod/vault-0 --timeout=300s -
Verify ESO is syncing secrets:
kubectl get externalsecret --all-namespaces -
Wait for ArgoCD to sync all applications. Monitor progress in the ArgoCD UI or with:
kubectl get applications -n argocd -
Once Velero is running, restore from the most recent backup. If the NAS is intact, use a local MinIO backup. If the NAS is lost, restore from offsite:
From local (MinIO):
velero backup get velero restore create --from-backup <backup-name>From offsite (AWS S3) -- use when NAS is unavailable:
First, create the offsite credentials Secret (Vault is not yet available to sync via ESO):
kubectl create secret generic velero-offsite-credentials \ --namespace backups \ --from-file=cloud=<(printf '[default]\naws_access_key_id=<key>\naws_secret_access_key=<secret>')Then restore from the offsite BSL:
velero backup get --storage-location offsite velero restore create --from-backup <backup-name> -
Verify the restore completed successfully:
velero restore get kubectl get pods --all-namespaces
etcd Restore (Control Plane Corruption)¶
Use this procedure when the control plane is unresponsive due to etcd corruption but the underlying node is intact. If the node itself is lost, use the Complete Cluster Rebuild procedure instead, restoring PKI and etcd before running kubeadm init.
Prerequisites¶
- An etcd snapshot (
snapshot-YYYYMMDD-HHMMSS.db) from NAS or S3 - The matching PKI tarball (
pki-YYYYMMDD-HHMMSS.tar.gz) if certs are also lost - SSH access to the control plane node
Procedure¶
-
Retrieve the snapshot and PKI tarball. From S3:
aws s3 ls s3://velero-offsite-homelab/etcd-snapshots/ aws s3 cp s3://velero-offsite-homelab/etcd-snapshots/snapshot-YYYYMMDD-HHMMSS.db /tmp/snapshot.db aws s3 cp s3://velero-offsite-homelab/etcd-snapshots/pki-YYYYMMDD-HHMMSS.tar.gz /tmp/pki.tar.gz -
If PKI certs are lost or corrupted, restore them:
sudo tar xzf /tmp/pki.tar.gz -C /etc/kubernetes -
Stop the API server and etcd by moving their static pod manifests:
sudo mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/ sudo mv /etc/kubernetes/manifests/etcd.yaml /tmp/ -
Wait for the containers to stop:
sudo crictl ps | grep -E 'etcd|kube-apiserver' -
Restore the etcd snapshot:
sudo ETCDCTL_API=3 etcdctl snapshot restore /tmp/snapshot.db \ --data-dir=/var/lib/etcd-restore sudo rm -rf /var/lib/etcd sudo mv /var/lib/etcd-restore /var/lib/etcd -
Restart the control plane components:
sudo mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/ sudo mv /tmp/etcd.yaml /etc/kubernetes/manifests/ -
Wait for the API server to become available and verify:
kubectl --kubeconfig /etc/kubernetes/admin.conf get nodes kubectl --kubeconfig /etc/kubernetes/admin.conf get pods -A
Warning
Restoring an etcd snapshot replaces the entire cluster state. All changes made after the snapshot was taken (deployments, secret rotations, scaling events) will be lost. ArgoCD will reconcile GitOps-managed resources on its next sync cycle.
Single Node Failure¶
If a single Kubernetes node fails (VM crash, disk corruption, etc.), re-provision it with Terraform and re-join it to the cluster with Ansible:
make k8s-infra && make k8s-configure
Terraform will recreate the failed VM and Ansible will configure it and join it back to the cluster. Pods that were scheduled on the failed node will be rescheduled automatically by Kubernetes.
NAS Failure¶
All PersistentVolumeClaim data is stored on NFS shares hosted by the NAS. If the NAS becomes unavailable:
- Pods with NFS-backed volumes will hang on volume mount operations.
- New pods requiring NFS volumes will remain in
ContainerCreatingstate. - Running pods that already have volumes mounted may continue to work temporarily but will fail on any new I/O operations.
Recovery depends entirely on the NAS hardware and its RAID/backup configuration. Once the NAS is restored and NFS exports are available again, pods will recover automatically.
Tip
If the NAS will be down for an extended period, you can delete the hanging pods to prevent them from consuming cluster resources. They will be recreated (and hang again) only if their controllers attempt rescheduling.
Vault KMS Credential Loss¶
Vault uses AWS KMS auto-unseal. If the AWS credentials or KMS key ID (stored in the vault-aws-kms Secret) are lost, Vault cannot auto-unseal.
- If the KMS key still exists in AWS but credentials are lost: recreate the IAM user and generate new access keys, then recreate the
vault-aws-kmsSecret and restart the Vault pod. - If the KMS key itself is deleted or scheduled for deletion: contact AWS support. A key scheduled for deletion has a minimum 7-day waiting period and can be cancelled during that window.
- If both the KMS key and a Velero backup of Vault data are lost: you must re-initialize Vault and repopulate all secrets.
Recovery Procedure (credentials lost, KMS key intact)¶
- Generate new AWS access keys for the Vault IAM user.
-
Recreate the
vault-aws-kmsSecret:kubectl delete secret vault-aws-kms -n vault kubectl create secret generic vault-aws-kms \ --namespace vault \ --from-literal=AWS_ACCESS_KEY_ID="<new_access_key_id>" \ --from-literal=AWS_SECRET_ACCESS_KEY="<new_secret_access_key>" \ --from-literal=AWS_REGION="us-east-1" \ --from-literal=VAULT_AWSKMS_SEAL_KEY_ID="<kms_key_id>" -
Restart the Vault pod:
kubectl -n vault delete pod vault-0 - Vault auto-unseals via KMS. Verify:
vault status
Recovery Procedure (full Vault data loss)¶
- Locate all
.examplefiles in the repository -- these contain the secret structure with placeholder values. - Recreate the
vault-aws-kmsSecret (see above). -
Re-initialize Vault:
make vault-initSave the root token from the output to your password manager.
-
For each secret, write values to Vault:
vault kv put homelab/<path> key1=value1 key2=value2 -
ESO syncs the new secrets from Vault automatically.
What Is NOT Backed Up¶
Warning
The following data is not included in Velero backups and will be lost in a disaster:
- Volumes using
emptyDir(ephemeral per-pod storage) - Node-local storage (hostPath volumes, local PVs)
- Data written to container filesystems outside of mounted volumes
- Kubernetes secrets that exist only in etcd and have no corresponding secret material in Vault or ExternalSecret manifests in Git