Backup & Restore¶

This runbook covers backup schedules, manual backup procedures, and restore operations using Velero with MinIO (local) and AWS S3 (offsite) as storage backends.

Automated Backup Schedules¶

Velero runs three automated backup schedules:

Schedule	Scope	Retention	Time	Target
`daily-stateful`	`arr`, `monitoring`, `auth` namespaces	7 days	3:00 AM daily	MinIO (local)
`weekly-full-cluster`	All namespaces (excluding `kube-system`, `kube-public`)	30 days	4:00 AM Sunday	MinIO (local)
`weekly-offsite`	All namespaces (excluding `kube-system`, `kube-public`)	30 days	5:00 AM Sunday	AWS S3 (offsite)

Both local schedules back up Kubernetes resources and PVC data using file-system-level backup via Kopia. The offsite schedule mirrors the weekly full-cluster backup to AWS S3 Standard-IA in us-east-1 for disaster recovery.

Offsite backup

The weekly-offsite schedule writes to an S3 bucket (velero-offsite-homelab) in AWS us-east-1. Objects are stored in S3 Standard and transitioned to Standard-IA after 30 days via lifecycle policy. Estimated cost is ~$1/month for a typical homelab backup set.

Note

The kube-system and kube-public namespaces are excluded from backups because their resources are managed by kubeadm and ArgoCD. These are recreated during a cluster rebuild rather than restored from backup.

etcd Snapshots¶

A separate CronJob backs up the etcd database directly. Velero cannot back up or restore etcd — it operates at the Kubernetes API layer and requires a running API server. etcd snapshots are the only way to recover a cluster whose control plane is corrupted or unrecoverable.

Schedule	Retention	Local Storage	Offsite Storage
2:00 AM daily	7 snapshots	NFS PVC (`etcd-snapshots`)	S3 (`velero-offsite-homelab/etcd-snapshots/`)

The CronJob runs on the control plane node with hostNetwork: true to reach the etcd endpoint at 127.0.0.1:2379. An init container takes the snapshot using etcdctl, then the main container uploads it to S3.

Checking etcd Backup Status¶

kubectl get cronjob -n backups etcd-backup
kubectl get jobs -n backups -l app.kubernetes.io/name=etcd-backup --sort-by=.status.startTime

Manual etcd Snapshot¶

To trigger an immediate backup:

kubectl create job -n backups etcd-backup-manual --from=cronjob/etcd-backup

Restoring from etcd Snapshot¶

Warning

Restoring an etcd snapshot replaces the entire cluster state. All changes made after the snapshot was taken will be lost.

Copy the snapshot to the control plane node:

# From local NFS
kubectl cp backups/<etcd-backup-pod>:/snapshots/snapshot-YYYYMMDD-HHMMSS.db /tmp/snapshot.db

# Or from S3
aws s3 cp s3://velero-offsite-homelab/etcd-snapshots/snapshot-YYYYMMDD-HHMMSS.db /tmp/snapshot.db

Stop the API server and etcd (on the control plane node):

sudo mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/
sudo mv /etc/kubernetes/manifests/etcd.yaml /tmp/

Restore the snapshot:

sudo ETCDCTL_API=3 etcdctl snapshot restore /tmp/snapshot.db \
  --data-dir=/var/lib/etcd-restore
sudo rm -rf /var/lib/etcd
sudo mv /var/lib/etcd-restore /var/lib/etcd

Restart the control plane:

sudo mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/
sudo mv /tmp/etcd.yaml /etc/kubernetes/manifests/

Verify the cluster is healthy:
```
kubectl get nodes
kubectl get pods -A
```

Manual Backup¶

Creating a Backup¶

make k8s-backup

Checking Backup Status¶

make k8s-backup-status

Or use the Velero CLI directly for more detail:

velero backup get
velero backup describe <backup-name> --details
velero schedule get

Restoring from Backup¶

Full Restore¶

List available backups:
```
make k8s-restore
```
Or:
```
velero backup get
```

Create a restore from the desired backup:

velero restore create --from-backup <backup-name>

Monitor the restore progress:

velero restore get
velero restore describe <restore-name> --details

Verify pods are running after the restore completes:

kubectl get pods -n arr
kubectl get pods -n monitoring

Warning

A restore does not delete existing resources. If restoring into a cluster that already has running workloads, existing resources that conflict with the backup will be skipped. For a clean restore, use a freshly rebuilt cluster.

Partial Restore¶

Restore only specific namespaces:

velero restore create --from-backup <backup-name> --include-namespaces arr

Restore only specific resource types:

velero restore create --from-backup <backup-name> \
  --include-resources persistentvolumeclaims,persistentvolumes

Combine both filters:

velero restore create --from-backup <backup-name> \
  --include-namespaces arr \
  --include-resources deployments,services,persistentvolumeclaims

Troubleshooting Backups¶

Backup Stuck in InProgress¶

A backup that remains in InProgress for longer than expected may indicate an issue with the Velero server or node agent.

# Check Velero server logs
kubectl logs -n backups -l app.kubernetes.io/name=velero

# Check for errors in the backup description
velero backup describe <backup-name> --details

Node Agent Issues¶

The node-agent DaemonSet handles file-system-level PVC backups. If PVC data is not being backed up:

# Verify node-agent pods are running on all nodes
kubectl get pods -n backups -l name=node-agent -o wide

# Check node-agent logs
kubectl logs -n backups -l name=node-agent

S3/MinIO Connectivity¶

If backups fail with storage-related errors, verify MinIO is running and accessible:

# Check MinIO pod
kubectl get pods -n backups -l app=minio

# Check MinIO logs
kubectl logs -n backups -l app=minio

# Verify all BackupStorageLocations are available
velero backup-location get

A BackupStorageLocation in Unavailable status indicates that Velero cannot reach the storage endpoint. Check the service, credentials, and network connectivity.

Offsite (AWS S3) Connectivity¶

If the offsite BackupStorageLocation shows Unavailable:

Verify the velero-offsite-credentials Secret exists and is synced:

kubectl get externalsecret -n backups velero-offsite-credentials

Verify the Cilium network policy allows egress to S3:

kubectl get ciliumnetworkpolicy -n backups backups-egress -o yaml

Test S3 connectivity from the Velero pod:

kubectl exec -n backups -it deploy/velero -- \
  wget -qO- --spider https://s3.us-east-1.amazonaws.com

Backup Contains No PVC Data¶

If a restore completes but PVC data is missing:

Verify the backup included volume data: velero backup describe <backup-name> --details
Check that the pod volumes are annotated for backup or that the defaultVolumesToFsBackup flag is set in the Velero schedule
Confirm that node-agent pods were running and healthy at the time of the backup