Alerting¶
Alertmanager routes alerts from Prometheus to a Slack #alerts channel. The kube-prometheus-stack ships ~100 default Kubernetes alerting rules, and custom homelab-specific rules supplement them for app health, backup health, and node-level monitoring.
Architecture¶
Prometheus ──► Alertmanager ──► Slack #alerts
▲
│
├── Default kube-prometheus rules (~100)
├── Custom homelab rules (homelab-rules.yml)
└── Velero metrics (ServiceMonitor)
Slack Integration¶
Alertmanager sends notifications to a single #alerts channel via an Incoming Webhook.
Setup¶
- Create a Slack App at https://api.slack.com/apps
- Enable Incoming Webhooks and add one to your
#alertschannel - Write the webhook URL to Vault:
vault kv put homelab/infrastructure/alertmanager-slack \
url=https://hooks.slack.com/services/T.../B.../xxx
The ExternalSecret syncs alertmanager-slack-webhook from Vault into the monitoring namespace. Alertmanager mounts it via alertmanagerSpec.secrets and reads the URL from /etc/alertmanager/secrets/alertmanager-slack-webhook/url.
Routing¶
| Behavior | Value |
|---|---|
| Group by | alertname, namespace |
| Group wait | 30s |
| Group interval | 5m |
| Repeat interval | 4h |
| Inhibition | Critical suppresses warning for same alert+namespace |
The Watchdog alert (a dead-man's-switch from the default rules) is routed to a null receiver to avoid noise.
Custom Homelab Rules¶
Defined in kube-prometheus-stack/homelab-rules.yml as a standalone PrometheusRule resource. Prometheus discovers it because ruleSelectorNilUsesHelmValues is set to false.
App Health¶
| Alert | Severity | For | Condition |
|---|---|---|---|
ArrAppDown |
critical | 5m | Any arr deployment has 0 available replicas |
GluetunVPNDown |
warning | 10m | Gluetun VPN sidecar container not ready |
Infrastructure Health¶
| Alert | Severity | For | Condition |
|---|---|---|---|
AuthentikDown |
critical | 5m | Authentik server deployment has 0 replicas |
LokiDown |
critical | 5m | Loki deployment has 0 available replicas |
NFSStorageLow |
warning | 15m | Any PVC usage above 85% |
CertificateExpiringSoon |
warning | 1h | cert-manager certificate expires within 14 days |
Backup Health¶
| Alert | Severity | For | Condition |
|---|---|---|---|
VeleroBackupFailed |
critical | -- | Backup failure in the last 24h |
VeleroBackupMissing |
warning | 1h | No successful backup for a schedule in 25h |
VeleroBackupPartialFailure |
warning | -- | Partial failure in the last 24h |
Requires Velero metrics to be enabled (metrics.serviceMonitor.enabled: true in the Velero Helm values).
Node Health¶
| Alert | Severity | For | Condition |
|---|---|---|---|
HighNodeCPU |
warning | 15m | Sustained CPU above 85% |
HighNodeMemory |
warning | 15m | Memory usage above 90% |
NodeDiskPressure |
critical | 5m | Root filesystem above 90% full |
These supplement the default kube-prometheus-stack node alerts, which use predictive thresholds rather than static ones.
Adding New Rules¶
Add new PrometheusRule resources in the monitoring namespace. With ruleSelectorNilUsesHelmValues: false, Prometheus picks up all rules regardless of labels.
Example rule:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: my-custom-rules
namespace: monitoring
spec:
groups:
- name: my-group
rules:
- alert: MyAlert
expr: some_metric > threshold
for: 5m
labels:
severity: warning
annotations:
summary: "Short description"
description: "Detailed description with {{ $labels.instance }}."
Silencing Alerts¶
To temporarily silence an alert, use the Alertmanager UI at alertmanager.homelab.local:
- Navigate to Silences > New Silence
- Add a matcher (e.g.,
alertname = HighNodeCPU) - Set duration and comment
Silences are ephemeral and not stored in Git.