Troubleshooting¶
This runbook covers common issues and the commands used to diagnose them.
General Debugging¶
Start any investigation with these commands:
# Pod status overview
kubectl get pods -n <namespace>
# Detailed pod information and events
kubectl describe pod -n <namespace> <pod>
# Pod logs (current instance)
kubectl logs -n <namespace> <pod> [-c <container>]
# Pod logs (previous crashed instance)
kubectl logs -n <namespace> <pod> --previous
# Recent events sorted by time
kubectl get events -n <namespace> --sort-by=.lastTimestamp
ArgoCD Sync Issues¶
App Stuck in "Progressing"¶
The application reports Progressing but never reaches Healthy.
Possible causes:
- Resource quotas preventing pod scheduling
- Missing CRDs that the application depends on (check sync wave ordering)
- Failed health checks (readiness/liveness probes misconfigured)
- Pods stuck in
Pendingdue to insufficient node resources
Diagnosis:
kubectl get application -n argocd <app-name> -o yaml
kubectl get pods -n <namespace> -l app.kubernetes.io/instance=<app-name>
kubectl describe pod -n <namespace> <pod>
App Shows "OutOfSync" but Is Healthy¶
The application is running correctly but ArgoCD reports it as out of sync.
Possible causes:
- Server-side apply drift -- fields managed by controllers (e.g., defaulted values) differ from the Git source
- Missing
ignoreDifferencesconfiguration for fields that are legitimately modified at runtime
Resolution:
Check which fields are out of sync in the ArgoCD UI (App > Diff). If the drift is expected, add an ignoreDifferences block to the Application spec.
Force Sync¶
If an application is stuck, force a sync with replacement from the ArgoCD UI:
- Open
https://argocd.homelab.local - Select the application
- Click Sync, enable Replace, and click Synchronize
Pod CrashLoopBackOff¶
The pod starts, crashes, and Kubernetes keeps restarting it with increasing backoff delays.
Diagnosis:
kubectl logs -n <namespace> <pod> --previous
kubectl describe pod -n <namespace> <pod>
Common causes:
| Cause | Symptoms in Logs |
|---|---|
| Missing secret | Error: secret "<name>" not found or env var is empty |
| Wrong image tag | ImagePullBackOff or ErrImagePull |
| Configuration error | Application-specific error messages at startup |
| Resource limits too low | OOMKilled in pod events |
Volume Mount Issues¶
NFS Timeouts¶
Pods are stuck in ContainerCreating with mount errors referencing the NFS server.
Diagnosis:
# Verify NAS is reachable
ping 192.168.1.158
# Check NFS provisioner pod
kubectl get pods -n nfs-provisioner
kubectl logs -n nfs-provisioner -l app=nfs-subdir-external-provisioner
PVC Stuck in Pending¶
A PersistentVolumeClaim remains in Pending state and is never bound.
Diagnosis:
kubectl describe pvc -n <namespace> <pvc-name>
kubectl get storageclass
kubectl logs -n nfs-provisioner -l app=nfs-subdir-external-provisioner
Common causes:
- NFS provisioner pod is not running
- The
nfs-clientStorageClass does not exist - NAS NFS exports are misconfigured or unreachable
Permission Issues¶
Applications report permission denied errors when accessing files on NFS volumes.
Resolution:
Check that the PUID and PGID values in the arr-env ConfigMap match the user/group ownership on the NAS share. LinuxServer.io containers use these environment variables to set the runtime user.
kubectl get configmap -n arr arr-env -o yaml
Network Issues¶
Service Unreachable¶
An application's service does not respond.
kubectl get svc -n <namespace>
kubectl get endpoints -n <namespace> <service-name>
If endpoints are empty, the service selector does not match any running pods.
Ingress Not Working¶
The application is running but not reachable via its hostname.
Diagnosis:
# Check HTTPRoute resources
kubectl get httproute -n <namespace>
# Check Gateway status
kubectl get gateway homelab-gateway -n default
# Verify DNS resolves to the Cilium L2 VIP
dig <app-name>.homelab.local
Common causes:
- DNS does not point to the Cilium L2 LoadBalancer IP
- HTTPRoute parentRef does not reference the correct Gateway
- The backend service name or port is misconfigured
VPN Not Connecting¶
The Gluetun VPN container fails to establish a connection, blocking qBittorrent.
kubectl logs -n arr -l app.kubernetes.io/instance=arr-vpn-downloads -c gluetun
Common causes:
- VPN credentials secret is missing or contains incorrect values
- VPN provider is experiencing an outage
- Firewall rules blocking the VPN connection
Certificate Issues¶
TLS Errors in Browser¶
The browser reports certificate errors beyond the expected self-signed CA warning.
Diagnosis:
# Check Certificate resources
kubectl get certificates --all-namespaces
# Check cert-manager logs
kubectl logs -n cert-manager -l app.kubernetes.io/name=cert-manager
# Check ClusterIssuer
kubectl get clusterissuer
kubectl describe clusterissuer homelab-ca-issuer
Common causes:
- The
homelab-ca-issuerClusterIssuer does not exist or is not ready - cert-manager pods are not running
- The Certificate resource failed to issue (check events with
kubectl describe certificate)
Note
Self-signed CA warnings are expected unless you have trusted the homelab CA on your machine.
Vault Issues¶
Vault Pod Fails to Start (CreateContainerConfigError)¶
The vault-aws-kms Secret is missing from the vault namespace. Vault requires this Secret to decrypt the master key via AWS KMS before it can start.
Create it before the pod starts:
kubectl create namespace vault # if it doesn't exist yet
kubectl create secret generic vault-aws-kms \
--namespace vault \
--from-literal=AWS_ACCESS_KEY_ID="<access_key_id>" \
--from-literal=AWS_SECRET_ACCESS_KEY="<secret_access_key>" \
--from-literal=AWS_REGION="us-east-1" \
--from-literal=VAULT_AWSKMS_SEAL_KEY_ID="<kms_key_id>"
Vault Starts but Remains Sealed¶
Vault is running but KMS connectivity is failing. Check the logs:
kubectl -n vault logs vault-0
Look for failed to unseal or AccessDeniedException. Verify the IAM policy attached to the Vault user allows kms:Decrypt, kms:Encrypt, and kms:DescribeKey on the key, and that the credentials in vault-aws-kms are correct.
ESO Shows SecretSyncError¶
If Vault is unsealed but ExternalSecrets are failing to sync, the ESO Kubernetes auth token may have expired. Force a resync:
kubectl annotate externalsecret <name> -n <namespace> \
force-sync=$(date +%s) --overwrite