Phase 5 -- Observability¶
Status: Not started
Goal: Complete the observability trifecta (metrics, logs, traces) and shift from threshold-based alerting to SLO-driven operations.
Addresses: K10, K14, K8
5.1 Add Distributed Tracing¶
- Deploy OpenTelemetry Collector as a DaemonSet (OTLP receiver)
- Deploy Grafana Tempo for trace storage
- Add Tempo as a Grafana datasource
- Instrument the request path: Cilium Gateway, Authentik forward-auth, application backends
- Verify traces appear in Grafana and correlate with metrics and logs
- Write an ADR
| Why | Metrics tell you what is broken. Logs tell you where. Traces tell you why by showing the full request lifecycle across services. Debugging slow Jellyfin loads or intermittent auth failures currently requires manually correlating timestamps. |
| Stack | OpenTelemetry Collector (DaemonSet) → Grafana Tempo (storage) → Grafana (visualization). |
5.2 Grafana Dashboards-as-Code¶
- Export existing Grafana dashboards to JSON
- Store in Git under the kube-prometheus-stack component
- Enable the Grafana sidecar to load dashboards from labeled ConfigMaps
- Deploy via ArgoCD
- Verify dashboards survive a full Grafana PVC wipe
| Why | Dashboards are created in the Grafana UI and stored in the PVC. A DR event loses dashboards created between the last backup and the failure. Dashboards-as-code makes them reproducible and reviewable. |
| Approach | kube-prometheus-stack already supports sidecar.dashboards.enabled. ConfigMaps with a specific label are auto-loaded. |
5.3 SLO-Based Alerting¶
- Choose a tool: Pyrra or Sloth
- Define SLOs for critical services (see below)
- Generate Prometheus recording rules and multi-window burn rate alerts
- Add SLO dashboards to Grafana
- Write an ADR
| Why | Current alerts fire on fixed thresholds (CPU > 85%, restarts > 5/hr). These are guesses that cause alert fatigue or fire too late. SLO-based alerting fires when users are impacted, measured by error budget burn rate. |
Example SLOs:
| Service | SLO | Error Budget |
|---|---|---|
| Jellyfin | 99.5% availability | ~3.6 hours/month |
| Authentik | 99.9% availability | ~43 minutes/month |
| ArgoCD | 99% sync success rate | ~7.3 hours/month |
5.4 Upgrade Synthetic Monitoring to Prometheus-Native Probes¶
Uptime Kuma already provides synthetic monitoring and a status page. This task upgrades to Blackbox Exporter for tighter Prometheus integration and SLO-compatible metrics.
- Deploy Blackbox Exporter
- Configure probes for every HTTPRoute endpoint
- Add PrometheusRules for probe failure and response time thresholds
- Add a Grafana dashboard for probe status
- Evaluate whether Uptime Kuma remains valuable alongside Blackbox Exporter (status page, external notifications) or should be retired
| Why | Uptime Kuma validates endpoint reachability but its metrics are not in Prometheus. Blackbox Exporter tests the full request path (DNS → Gateway → TLS → Authentik forward-auth → backend) and feeds directly into SLO burn-rate alerts (5.3). |
Definition of Done¶
- Request traces visible in Grafana for auth-gated flows
- All Grafana dashboards versioned in Git, deployed via ArgoCD
- SLOs defined for Jellyfin, Authentik, and ArgoCD with burn-rate alerts
- Synthetic probes testing every HTTPS endpoint