Troubleshooting¶

This runbook covers common issues and the commands used to diagnose them.

General Debugging¶

Start any investigation with these commands:

# Pod status overview
kubectl get pods -n <namespace>

# Detailed pod information and events
kubectl describe pod -n <namespace> <pod>

# Pod logs (current instance)
kubectl logs -n <namespace> <pod> [-c <container>]

# Pod logs (previous crashed instance)
kubectl logs -n <namespace> <pod> --previous

# Recent events sorted by time
kubectl get events -n <namespace> --sort-by=.lastTimestamp

ArgoCD Sync Issues¶

App Stuck in "Progressing"¶

The application reports Progressing but never reaches Healthy.

Possible causes:

Resource quotas preventing pod scheduling
Missing CRDs that the application depends on (check sync wave ordering)
Failed health checks (readiness/liveness probes misconfigured)
Pods stuck in Pending due to insufficient node resources

Diagnosis:

kubectl get application -n argocd <app-name> -o yaml
kubectl get pods -n <namespace> -l app.kubernetes.io/instance=<app-name>
kubectl describe pod -n <namespace> <pod>

App Shows "OutOfSync" but Is Healthy¶

The application is running correctly but ArgoCD reports it as out of sync.

Possible causes:

Server-side apply drift -- fields managed by controllers (e.g., defaulted values) differ from the Git source
Missing ignoreDifferences configuration for fields that are legitimately modified at runtime

Resolution:

Check which fields are out of sync in the ArgoCD UI (App > Diff). If the drift is expected, add an ignoreDifferences block to the Application spec.

Force Sync¶

If an application is stuck, force a sync with replacement from the ArgoCD UI:

Open https://argocd.homelab.local
Select the application
Click Sync, enable Replace, and click Synchronize

Pod CrashLoopBackOff¶

The pod starts, crashes, and Kubernetes keeps restarting it with increasing backoff delays.

Diagnosis:

kubectl logs -n <namespace> <pod> --previous
kubectl describe pod -n <namespace> <pod>

Common causes:

Cause	Symptoms in Logs
Missing secret	`Error: secret "<name>" not found` or env var is empty
Wrong image tag	`ImagePullBackOff` or `ErrImagePull`
Configuration error	Application-specific error messages at startup
Resource limits too low	`OOMKilled` in pod events

Volume Mount Issues¶

NFS Timeouts¶

Pods are stuck in ContainerCreating with mount errors referencing the NFS server.

Diagnosis:

# Verify NAS is reachable
ping 192.168.1.158

# Check NFS provisioner pod
kubectl get pods -n nfs-provisioner
kubectl logs -n nfs-provisioner -l app=nfs-subdir-external-provisioner

PVC Stuck in Pending¶

A PersistentVolumeClaim remains in Pending state and is never bound.

Diagnosis:

kubectl describe pvc -n <namespace> <pvc-name>
kubectl get storageclass
kubectl logs -n nfs-provisioner -l app=nfs-subdir-external-provisioner

Common causes:

NFS provisioner pod is not running
The nfs-client StorageClass does not exist
NAS NFS exports are misconfigured or unreachable

Permission Issues¶

Applications report permission denied errors when accessing files on NFS volumes.

Resolution:

Check that the PUID and PGID values in the arr-env ConfigMap match the user/group ownership on the NAS share. LinuxServer.io containers use these environment variables to set the runtime user.

kubectl get configmap -n arr arr-env -o yaml

Network Issues¶

Service Unreachable¶

An application's service does not respond.

kubectl get svc -n <namespace>
kubectl get endpoints -n <namespace> <service-name>

If endpoints are empty, the service selector does not match any running pods.

Ingress Not Working¶

The application is running but not reachable via its hostname.

Diagnosis:

# Check HTTPRoute resources
kubectl get httproute -n <namespace>

# Check Gateway status
kubectl get gateway homelab-gateway -n default

# Verify DNS resolves to the Cilium L2 VIP
dig <app-name>.homelab.local

Common causes:

DNS does not point to the Cilium L2 LoadBalancer IP
HTTPRoute parentRef does not reference the correct Gateway
The backend service name or port is misconfigured

VPN Not Connecting¶

The Gluetun VPN container fails to establish a connection, blocking qBittorrent.

kubectl logs -n arr -l app.kubernetes.io/instance=arr-vpn-downloads -c gluetun

Common causes:

VPN credentials secret is missing or contains incorrect values
VPN provider is experiencing an outage
Firewall rules blocking the VPN connection

Certificate Issues¶

TLS Errors in Browser¶

The browser reports certificate errors beyond the expected self-signed CA warning.

Diagnosis:

# Check Certificate resources
kubectl get certificates --all-namespaces

# Check cert-manager logs
kubectl logs -n cert-manager -l app.kubernetes.io/name=cert-manager

# Check ClusterIssuer
kubectl get clusterissuer
kubectl describe clusterissuer homelab-ca-issuer

Common causes:

The homelab-ca-issuer ClusterIssuer does not exist or is not ready
cert-manager pods are not running
The Certificate resource failed to issue (check events with kubectl describe certificate)

Note

Self-signed CA warnings are expected unless you have trusted the homelab CA on your machine.

Vault Issues¶

Vault Pod Fails to Start (`CreateContainerConfigError`)¶

The vault-aws-kms Secret is missing from the vault namespace. Vault requires this Secret to decrypt the master key via AWS KMS before it can start.

Create it before the pod starts:

kubectl create namespace vault  # if it doesn't exist yet
kubectl create secret generic vault-aws-kms \
  --namespace vault \
  --from-literal=AWS_ACCESS_KEY_ID="<access_key_id>" \
  --from-literal=AWS_SECRET_ACCESS_KEY="<secret_access_key>" \
  --from-literal=AWS_REGION="us-east-1" \
  --from-literal=VAULT_AWSKMS_SEAL_KEY_ID="<kms_key_id>"

Vault Starts but Remains Sealed¶

Vault is running but KMS connectivity is failing. Check the logs:

kubectl -n vault logs vault-0

Look for failed to unseal or AccessDeniedException. Verify the IAM policy attached to the Vault user allows kms:Decrypt, kms:Encrypt, and kms:DescribeKey on the key, and that the credentials in vault-aws-kms are correct.

ESO Shows `SecretSyncError`¶

If Vault is unsealed but ExternalSecrets are failing to sync, the ESO Kubernetes auth token may have expired. Force a resync:

kubectl annotate externalsecret <name> -n <namespace> \
  force-sync=$(date +%s) --overwrite

Troubleshooting¶

General Debugging¶

ArgoCD Sync Issues¶

App Stuck in "Progressing"¶

App Shows "OutOfSync" but Is Healthy¶

Force Sync¶

Pod CrashLoopBackOff¶

Volume Mount Issues¶

NFS Timeouts¶

PVC Stuck in Pending¶

Permission Issues¶

Network Issues¶

Service Unreachable¶

Ingress Not Working¶

VPN Not Connecting¶

Certificate Issues¶

TLS Errors in Browser¶

Vault Issues¶

Vault Pod Fails to Start (CreateContainerConfigError)¶

Vault Starts but Remains Sealed¶

ESO Shows SecretSyncError¶

Vault Pod Fails to Start (`CreateContainerConfigError`)¶

ESO Shows `SecretSyncError`¶