Alerting¶

Alertmanager routes alerts from Prometheus to a Slack #alerts channel. The kube-prometheus-stack ships ~100 default Kubernetes alerting rules, and custom homelab-specific rules supplement them for app health, backup health, and node-level monitoring.

Architecture¶

Prometheus ──► Alertmanager ──► Slack #alerts
   ▲
   │
   ├── Default kube-prometheus rules (~100)
   ├── Custom homelab rules (homelab-rules.yml)
   └── Velero metrics (ServiceMonitor)

Slack Integration¶

Alertmanager sends notifications to a single #alerts channel via an Incoming Webhook.

Setup¶

Create a Slack App at https://api.slack.com/apps
Enable Incoming Webhooks and add one to your #alerts channel
Write the webhook URL to Vault:

vault kv put homelab/infrastructure/alertmanager-slack \
  url=https://hooks.slack.com/services/T.../B.../xxx

The ExternalSecret syncs alertmanager-slack-webhook from Vault into the monitoring namespace. Alertmanager mounts it via alertmanagerSpec.secrets and reads the URL from /etc/alertmanager/secrets/alertmanager-slack-webhook/url.

Routing¶

Behavior	Value
Group by	`alertname`, `namespace`
Group wait	30s
Group interval	5m
Repeat interval	4h
Inhibition	Critical suppresses warning for same alert+namespace

The Watchdog alert (a dead-man's-switch from the default rules) is routed to a null receiver to avoid noise.

Custom Homelab Rules¶

Defined in kube-prometheus-stack/homelab-rules.yml as a standalone PrometheusRule resource. Prometheus discovers it because ruleSelectorNilUsesHelmValues is set to false.

App Health¶

Alert	Severity	For	Condition
`ArrAppDown`	critical	5m	Any arr deployment has 0 available replicas
`GluetunVPNDown`	warning	10m	Gluetun VPN sidecar container not ready

Infrastructure Health¶

Alert	Severity	For	Condition
`AuthentikDown`	critical	5m	Authentik server deployment has 0 replicas
`LokiDown`	critical	5m	Loki deployment has 0 available replicas
`NFSStorageLow`	warning	15m	Any PVC usage above 85%
`CertificateExpiringSoon`	warning	1h	cert-manager certificate expires within 14 days

Backup Health¶

Alert	Severity	For	Condition
`VeleroBackupFailed`	critical	--	Backup failure in the last 24h
`VeleroBackupMissing`	warning	1h	No successful backup for a schedule in 25h
`VeleroBackupPartialFailure`	warning	--	Partial failure in the last 24h

Requires Velero metrics to be enabled (metrics.serviceMonitor.enabled: true in the Velero Helm values).

Node Health¶

Alert	Severity	For	Condition
`HighNodeCPU`	warning	15m	Sustained CPU above 85%
`HighNodeMemory`	warning	15m	Memory usage above 90%
`NodeDiskPressure`	critical	5m	Root filesystem above 90% full

These supplement the default kube-prometheus-stack node alerts, which use predictive thresholds rather than static ones.

Adding New Rules¶

Add new PrometheusRule resources in the monitoring namespace. With ruleSelectorNilUsesHelmValues: false, Prometheus picks up all rules regardless of labels.

Example rule:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: my-custom-rules
  namespace: monitoring
spec:
  groups:
    - name: my-group
      rules:
        - alert: MyAlert
          expr: some_metric > threshold
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Short description"
            description: "Detailed description with {{ $labels.instance }}."

Silencing Alerts¶

To temporarily silence an alert, use the Alertmanager UI at alertmanager.homelab.local:

Navigate to Silences > New Silence
Add a matcher (e.g., alertname = HighNodeCPU)
Set duration and comment

Silences are ephemeral and not stored in Git.