ADR-013: Backup Strategy¶

Status¶

Accepted

Context¶

The cluster needs a backup strategy that protects against accidental deletion, data corruption, and site-level disaster. Backups must cover three distinct layers: Kubernetes resource manifests, persistent volume data, and the etcd database. The 3-2-1 rule requires at least three copies on two media types with one offsite.

These layers require two separate backup mechanisms. Velero operates at the Kubernetes API layer — it can back up manifests and PVC data but requires a running API server and cannot snapshot etcd. An etcd snapshot can bootstrap a cluster from nothing but knows nothing about application data. Every major Kubernetes distribution (OpenShift, Rancher, RKE2, Gardener) maintains separate pipelines for these two concerns. On a single control-plane kubeadm cluster, etcd corruption without a snapshot means a full cluster rebuild.

Decision¶

Two backup pipelines, both storing data locally on the NAS for fast recovery and offsite in AWS S3 for disaster recovery:

Velero handles application-layer backups. It uses the AWS plugin for S3-compatible storage, Kopia for file-system PVC backup, and two storage targets: standalone MinIO on an NFS-backed PVC for local backups, and an S3 bucket (velero-offsite-homelab) in us-east-1 for offsite copies. Three schedules run nightly: daily stateful namespaces (7-day retention), weekly full-cluster (30-day retention), and weekly offsite (30-day retention).

A CronJob handles etcd snapshots. It runs etcdctl snapshot save daily at 2:00 AM on the control plane node, stores snapshots on an NFS-backed PVC with 7-snapshot retention, and uploads each snapshot to the same S3 bucket under an etcd-snapshots/ prefix.

Alternatives Considered¶

NFS snapshots only: Covers data but not Kubernetes resource state (Secrets, ConfigMaps, RBAC). Restoration requires manual re-creation of all cluster resources.
Restic/Kopia standalone: Can back up PVC data but doesn't handle Kubernetes resource backup or integrate with kubectl-style restore workflows.
Backblaze B2: Cheaper storage ($0.006/GB/month) but adds another vendor dependency. AWS was chosen because the account and IAM infrastructure already exist for Vault KMS auto-unseal.
Longhorn/Rook volume snapshots: Require distributed storage (see ADR-006). Not applicable with NFS.
Kasten K10: Can orchestrate both Velero-style and etcd backups via Kanister blueprints. Heavyweight and commercial — overkill for a homelab.
Velero for etcd: Not possible. Velero requires a running API server and cannot snapshot or restore etcd. Confirmed by upstream maintainers.
adfinis/kubernetes-etcd-backup Helm chart: Handles snapshot scheduling and local retention but does not support offsite upload. A custom CronJob provides the same functionality with S3 upload included.

Rationale¶

Two pipelines by design: etcd snapshots solve "the cluster is dead" scenarios; Velero solves "I deleted a namespace" scenarios. These are fundamentally different recovery paths that cannot be unified without fragility.
Every layer local + offsite: Both Velero and etcd snapshots land on the NAS (fast restore) and S3 (disaster recovery). Consistent storage strategy across both pipelines.
In-cluster S3 via MinIO: Provides S3-compatible storage without cloud dependency. Velero's AWS plugin works unmodified against MinIO. NFS-backed PVC with Retain policy ensures backup data survives cluster rebuilds.
Resource + volume data: Velero backs up both Kubernetes manifests and PVC data via Kopia file-system backup, providing a complete application-layer snapshot.
Three Velero schedules: Daily backups for stateful namespaces catch frequent changes. Weekly full-cluster and offsite backups provide broader coverage with longer retention.
AWS S3 Standard-IA lifecycle: Objects transition from S3 Standard to Standard-IA after 30 days. Noncurrent versions expire after 90 days. Cost is ~$1/month.
Selective restore: Velero supports namespace-scoped and resource-scoped restores, allowing targeted recovery without affecting the rest of the cluster.
Reuse existing AWS account: The account, IAM patterns, and Terraform module already exist for Vault KMS auto-unseal. Adding an S3 bucket and IAM user follows the same pattern.

Consequences¶

Two backup mechanisms to operate and monitor. PrometheusRules alert on both Velero schedule failures and etcd snapshot staleness.
Offsite backups add a runtime dependency on AWS S3 availability and internet egress. Cilium network policy allows HTTPS egress (0.0.0.0/0:443) from the backups namespace.
AWS credentials are stored in Vault and synced via ExternalSecret. During a DR rebuild where Vault is not yet available, credentials must be manually created from a password manager.
Velero restore may conflict with ArgoCD's desired state. ArgoCD sync should be verified after any restore.
The vault-aws-kms Secret is not backed up by Velero and must be manually created before Vault can start during disaster recovery.
The etcd-backup CronJob requires hostPath access to /etc/kubernetes/pki/ and connects to etcd via the node IP (downward API).
The CronJob backs up the full control plane PKI alongside each etcd snapshot. Both are required for disaster recovery on a replacement node.