Skip to content

Disaster Recovery

This runbook covers recovery procedures for the most severe failure scenarios, from a single node going down to a complete cluster rebuild.

Complete Cluster Rebuild

Use this procedure when the entire Kubernetes cluster is lost and must be rebuilt from scratch.

Prerequisites

Before a disaster occurs, ensure the following are available outside the cluster:

  • Vault root token (stored in password manager after make vault-init)
  • AWS credentials and KMS key ID for Vault auto-unseal (the vault-aws-kms Secret must be recreated after a rebuild)
  • Velero backups stored on MinIO (local, running on the NAS) or AWS S3 (offsite, velero-offsite-homelab bucket in us-east-1)
  • etcd snapshots and PKI tarballs stored on NAS (etcd-snapshots PVC) or AWS S3 (velero-offsite-homelab/etcd-snapshots/)
  • AWS credentials for the velero-offsite-homelab IAM user (stored in password manager, also in Vault at infrastructure/velero-offsite)
  • This Git repository (the single source of truth for all cluster state)

Critical

Vault uses AWS KMS for auto-unseal. The vault-aws-kms Kubernetes Secret must be recreated before the Vault pod starts (step 3 below). Store the AWS credentials and KMS key ID in your password manager.

Procedure

  1. Rebuild the cluster infrastructure:

    make k8s-deploy
    

    This provisions VMs with Terraform, bootstraps Kubernetes with Ansible, and installs ArgoCD with the ApplicationSet.

  2. Retrieve the kubeconfig:

    make k8s-kubeconfig
    export KUBECONFIG=$(pwd)/kubeconfig
    
  3. Before ArgoCD deploys Vault, create the vault-aws-kms Secret so Vault can auto-unseal:

    kubectl create namespace vault
    kubectl create secret generic vault-aws-kms \
      --namespace vault \
      --from-literal=AWS_ACCESS_KEY_ID="<access_key_id>" \
      --from-literal=AWS_SECRET_ACCESS_KEY="<secret_access_key>" \
      --from-literal=AWS_REGION="us-east-1" \
      --from-literal=VAULT_AWSKMS_SEAL_KEY_ID="<kms_key_id>"
    
  4. Wait for Vault to become available (it will auto-unseal via KMS):

    kubectl -n vault wait --for=condition=ready pod/vault-0 --timeout=300s
    
  5. Verify ESO is syncing secrets:

    kubectl get externalsecret --all-namespaces
    
  6. Wait for ArgoCD to sync all applications. Monitor progress in the ArgoCD UI or with:

    kubectl get applications -n argocd
    
  7. Once Velero is running, restore from the most recent backup. If the NAS is intact, use a local MinIO backup. If the NAS is lost, restore from offsite:

    From local (MinIO):

    velero backup get
    velero restore create --from-backup <backup-name>
    

    From offsite (AWS S3) -- use when NAS is unavailable:

    First, create the offsite credentials Secret (Vault is not yet available to sync via ESO):

    kubectl create secret generic velero-offsite-credentials \
      --namespace backups \
      --from-file=cloud=<(printf '[default]\naws_access_key_id=<key>\naws_secret_access_key=<secret>')
    

    Then restore from the offsite BSL:

    velero backup get --storage-location offsite
    velero restore create --from-backup <backup-name>
    
  8. Verify the restore completed successfully:

    velero restore get
    kubectl get pods --all-namespaces
    

etcd Restore (Control Plane Corruption)

Use this procedure when the control plane is unresponsive due to etcd corruption but the underlying node is intact. If the node itself is lost, use the Complete Cluster Rebuild procedure instead, restoring PKI and etcd before running kubeadm init.

Prerequisites

  • An etcd snapshot (snapshot-YYYYMMDD-HHMMSS.db) from NAS or S3
  • The matching PKI tarball (pki-YYYYMMDD-HHMMSS.tar.gz) if certs are also lost
  • SSH access to the control plane node

Procedure

  1. Retrieve the snapshot and PKI tarball. From S3:

    aws s3 ls s3://velero-offsite-homelab/etcd-snapshots/
    aws s3 cp s3://velero-offsite-homelab/etcd-snapshots/snapshot-YYYYMMDD-HHMMSS.db /tmp/snapshot.db
    aws s3 cp s3://velero-offsite-homelab/etcd-snapshots/pki-YYYYMMDD-HHMMSS.tar.gz /tmp/pki.tar.gz
    
  2. If PKI certs are lost or corrupted, restore them:

    sudo tar xzf /tmp/pki.tar.gz -C /etc/kubernetes
    
  3. Stop the API server and etcd by moving their static pod manifests:

    sudo mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/
    sudo mv /etc/kubernetes/manifests/etcd.yaml /tmp/
    
  4. Wait for the containers to stop:

    sudo crictl ps | grep -E 'etcd|kube-apiserver'
    
  5. Restore the etcd snapshot:

    sudo ETCDCTL_API=3 etcdctl snapshot restore /tmp/snapshot.db \
      --data-dir=/var/lib/etcd-restore
    sudo rm -rf /var/lib/etcd
    sudo mv /var/lib/etcd-restore /var/lib/etcd
    
  6. Restart the control plane components:

    sudo mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/
    sudo mv /tmp/etcd.yaml /etc/kubernetes/manifests/
    
  7. Wait for the API server to become available and verify:

    kubectl --kubeconfig /etc/kubernetes/admin.conf get nodes
    kubectl --kubeconfig /etc/kubernetes/admin.conf get pods -A
    

Warning

Restoring an etcd snapshot replaces the entire cluster state. All changes made after the snapshot was taken (deployments, secret rotations, scaling events) will be lost. ArgoCD will reconcile GitOps-managed resources on its next sync cycle.

Single Node Failure

If a single Kubernetes node fails (VM crash, disk corruption, etc.), re-provision it with Terraform and re-join it to the cluster with Ansible:

make k8s-infra && make k8s-configure

Terraform will recreate the failed VM and Ansible will configure it and join it back to the cluster. Pods that were scheduled on the failed node will be rescheduled automatically by Kubernetes.

NAS Failure

All PersistentVolumeClaim data is stored on NFS shares hosted by the NAS. If the NAS becomes unavailable:

  • Pods with NFS-backed volumes will hang on volume mount operations.
  • New pods requiring NFS volumes will remain in ContainerCreating state.
  • Running pods that already have volumes mounted may continue to work temporarily but will fail on any new I/O operations.

Recovery depends entirely on the NAS hardware and its RAID/backup configuration. Once the NAS is restored and NFS exports are available again, pods will recover automatically.

Tip

If the NAS will be down for an extended period, you can delete the hanging pods to prevent them from consuming cluster resources. They will be recreated (and hang again) only if their controllers attempt rescheduling.

Vault KMS Credential Loss

Vault uses AWS KMS auto-unseal. If the AWS credentials or KMS key ID (stored in the vault-aws-kms Secret) are lost, Vault cannot auto-unseal.

  • If the KMS key still exists in AWS but credentials are lost: recreate the IAM user and generate new access keys, then recreate the vault-aws-kms Secret and restart the Vault pod.
  • If the KMS key itself is deleted or scheduled for deletion: contact AWS support. A key scheduled for deletion has a minimum 7-day waiting period and can be cancelled during that window.
  • If both the KMS key and a Velero backup of Vault data are lost: you must re-initialize Vault and repopulate all secrets.

Recovery Procedure (credentials lost, KMS key intact)

  1. Generate new AWS access keys for the Vault IAM user.
  2. Recreate the vault-aws-kms Secret:

    kubectl delete secret vault-aws-kms -n vault
    kubectl create secret generic vault-aws-kms \
      --namespace vault \
      --from-literal=AWS_ACCESS_KEY_ID="<new_access_key_id>" \
      --from-literal=AWS_SECRET_ACCESS_KEY="<new_secret_access_key>" \
      --from-literal=AWS_REGION="us-east-1" \
      --from-literal=VAULT_AWSKMS_SEAL_KEY_ID="<kms_key_id>"
    
  3. Restart the Vault pod: kubectl -n vault delete pod vault-0

  4. Vault auto-unseals via KMS. Verify: vault status

Recovery Procedure (full Vault data loss)

  1. Locate all .example files in the repository -- these contain the secret structure with placeholder values.
  2. Recreate the vault-aws-kms Secret (see above).
  3. Re-initialize Vault:

    make vault-init
    

    Save the root token from the output to your password manager.

  4. For each secret, write values to Vault:

    vault kv put homelab/<path> key1=value1 key2=value2
    
  5. ESO syncs the new secrets from Vault automatically.

What Is NOT Backed Up

Warning

The following data is not included in Velero backups and will be lost in a disaster:

  • Volumes using emptyDir (ephemeral per-pod storage)
  • Node-local storage (hostPath volumes, local PVs)
  • Data written to container filesystems outside of mounted volumes
  • Kubernetes secrets that exist only in etcd and have no corresponding secret material in Vault or ExternalSecret manifests in Git