Backup & Restore¶
This runbook covers backup schedules, manual backup procedures, and restore operations using Velero with MinIO (local) and AWS S3 (offsite) as storage backends.
Automated Backup Schedules¶
Velero runs three automated backup schedules:
| Schedule | Scope | Retention | Time | Target |
|---|---|---|---|---|
daily-stateful |
arr, monitoring, auth namespaces |
7 days | 3:00 AM daily | MinIO (local) |
weekly-full-cluster |
All namespaces (excluding kube-system, kube-public) |
30 days | 4:00 AM Sunday | MinIO (local) |
weekly-offsite |
All namespaces (excluding kube-system, kube-public) |
30 days | 5:00 AM Sunday | AWS S3 (offsite) |
Both local schedules back up Kubernetes resources and PVC data using file-system-level backup via Kopia. The offsite schedule mirrors the weekly full-cluster backup to AWS S3 Standard-IA in us-east-1 for disaster recovery.
Offsite backup
The weekly-offsite schedule writes to an S3 bucket (velero-offsite-homelab) in AWS us-east-1. Objects are stored in S3 Standard and transitioned to Standard-IA after 30 days via lifecycle policy. Estimated cost is ~$1/month for a typical homelab backup set.
Note
The kube-system and kube-public namespaces are excluded from backups because their resources are managed by kubeadm and ArgoCD. These are recreated during a cluster rebuild rather than restored from backup.
etcd Snapshots¶
A separate CronJob backs up the etcd database directly. Velero cannot back up or restore etcd — it operates at the Kubernetes API layer and requires a running API server. etcd snapshots are the only way to recover a cluster whose control plane is corrupted or unrecoverable.
| Schedule | Retention | Local Storage | Offsite Storage |
|---|---|---|---|
| 2:00 AM daily | 7 snapshots | NFS PVC (etcd-snapshots) |
S3 (velero-offsite-homelab/etcd-snapshots/) |
The CronJob runs on the control plane node with hostNetwork: true to reach the etcd endpoint at 127.0.0.1:2379. An init container takes the snapshot using etcdctl, then the main container uploads it to S3.
Checking etcd Backup Status¶
kubectl get cronjob -n backups etcd-backup
kubectl get jobs -n backups -l app.kubernetes.io/name=etcd-backup --sort-by=.status.startTime
Manual etcd Snapshot¶
To trigger an immediate backup:
kubectl create job -n backups etcd-backup-manual --from=cronjob/etcd-backup
Restoring from etcd Snapshot¶
Warning
Restoring an etcd snapshot replaces the entire cluster state. All changes made after the snapshot was taken will be lost.
-
Copy the snapshot to the control plane node:
# From local NFS kubectl cp backups/<etcd-backup-pod>:/snapshots/snapshot-YYYYMMDD-HHMMSS.db /tmp/snapshot.db # Or from S3 aws s3 cp s3://velero-offsite-homelab/etcd-snapshots/snapshot-YYYYMMDD-HHMMSS.db /tmp/snapshot.db -
Stop the API server and etcd (on the control plane node):
sudo mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/ sudo mv /etc/kubernetes/manifests/etcd.yaml /tmp/ -
Restore the snapshot:
sudo ETCDCTL_API=3 etcdctl snapshot restore /tmp/snapshot.db \ --data-dir=/var/lib/etcd-restore sudo rm -rf /var/lib/etcd sudo mv /var/lib/etcd-restore /var/lib/etcd -
Restart the control plane:
sudo mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/ sudo mv /tmp/etcd.yaml /etc/kubernetes/manifests/ -
Verify the cluster is healthy:
kubectl get nodes kubectl get pods -A
Manual Backup¶
Creating a Backup¶
make k8s-backup
Checking Backup Status¶
make k8s-backup-status
Or use the Velero CLI directly for more detail:
velero backup get
velero backup describe <backup-name> --details
velero schedule get
Restoring from Backup¶
Full Restore¶
-
List available backups:
make k8s-restoreOr:
velero backup get -
Create a restore from the desired backup:
velero restore create --from-backup <backup-name> -
Monitor the restore progress:
velero restore get velero restore describe <restore-name> --details -
Verify pods are running after the restore completes:
kubectl get pods -n arr kubectl get pods -n monitoring
Warning
A restore does not delete existing resources. If restoring into a cluster that already has running workloads, existing resources that conflict with the backup will be skipped. For a clean restore, use a freshly rebuilt cluster.
Partial Restore¶
Restore only specific namespaces:
velero restore create --from-backup <backup-name> --include-namespaces arr
Restore only specific resource types:
velero restore create --from-backup <backup-name> \
--include-resources persistentvolumeclaims,persistentvolumes
Combine both filters:
velero restore create --from-backup <backup-name> \
--include-namespaces arr \
--include-resources deployments,services,persistentvolumeclaims
Troubleshooting Backups¶
Backup Stuck in InProgress¶
A backup that remains in InProgress for longer than expected may indicate an issue with the Velero server or node agent.
# Check Velero server logs
kubectl logs -n backups -l app.kubernetes.io/name=velero
# Check for errors in the backup description
velero backup describe <backup-name> --details
Node Agent Issues¶
The node-agent DaemonSet handles file-system-level PVC backups. If PVC data is not being backed up:
# Verify node-agent pods are running on all nodes
kubectl get pods -n backups -l name=node-agent -o wide
# Check node-agent logs
kubectl logs -n backups -l name=node-agent
S3/MinIO Connectivity¶
If backups fail with storage-related errors, verify MinIO is running and accessible:
# Check MinIO pod
kubectl get pods -n backups -l app=minio
# Check MinIO logs
kubectl logs -n backups -l app=minio
# Verify all BackupStorageLocations are available
velero backup-location get
A BackupStorageLocation in Unavailable status indicates that Velero cannot reach the storage endpoint. Check the service, credentials, and network connectivity.
Offsite (AWS S3) Connectivity¶
If the offsite BackupStorageLocation shows Unavailable:
-
Verify the
velero-offsite-credentialsSecret exists and is synced:kubectl get externalsecret -n backups velero-offsite-credentials -
Verify the Cilium network policy allows egress to S3:
kubectl get ciliumnetworkpolicy -n backups backups-egress -o yaml -
Test S3 connectivity from the Velero pod:
kubectl exec -n backups -it deploy/velero -- \ wget -qO- --spider https://s3.us-east-1.amazonaws.com
Backup Contains No PVC Data¶
If a restore completes but PVC data is missing:
- Verify the backup included volume data:
velero backup describe <backup-name> --details - Check that the pod volumes are annotated for backup or that the
defaultVolumesToFsBackupflag is set in the Velero schedule - Confirm that node-agent pods were running and healthy at the time of the backup