Monitoring & Observability¶
This document covers the full monitoring and observability stack, including metrics collection, log aggregation, visualization, and alerting.
Observability Stack Overview¶
The monitoring stack provides two parallel data pipelines -- one for metrics and one for logs -- converging in Grafana for unified visualization.
flowchart LR
subgraph metricsPath["Metrics Pipeline"]
nodeExp["Node Exporter"] --> prometheus["Prometheus"]
ksm["kube-state-metrics"] --> prometheus
kubelet["Kubelet /metrics"] --> prometheus
end
subgraph logsPath["Logs Pipeline"]
pods["Application Pods"] --> alloy["Alloy\n(DaemonSet)"]
auditLog["API Server\nAudit Log"] --> alloy
alloy --> loki["Loki\n(single-binary)"]
end
subgraph visualization["Visualization & Alerting"]
prometheus --> grafana["Grafana"]
loki --> grafana
prometheus --> alertmanager["Alertmanager"]
end
Component Inventory¶
| Component | Purpose | Deployment | Storage |
|---|---|---|---|
| Prometheus | Metrics collection, storage, and rule evaluation | StatefulSet | 20Gi PVC (local-path), 15d retention |
| Grafana | Dashboards and visualization for metrics and logs | Deployment | Persistent PVC (nfs-client) |
| Alertmanager | Alert routing, grouping, and notification | StatefulSet | Ephemeral |
| Node Exporter | Host-level hardware and OS metrics | DaemonSet | None |
| kube-state-metrics | Kubernetes object state metrics | Deployment | None |
| Loki | Log aggregation and querying | StatefulSet (single-binary) | 10Gi PVC (nfs-client), 168h retention |
| Alloy | Pod log collection and shipping | DaemonSet | None (streams to Loki) |
| Exportarr | Prometheus metrics exporter for *arr apps | Multi-Deployment (one per target) | None |
| Uptime Kuma | Synthetic HTTP/TCP/DNS monitoring | Deployment | 1Gi PVC (nfs-client) |
| VPA Recommender | Per-workload CPU/memory right-sizing recommendations | Deployment | None |
| Goldilocks | Auto-creates VPA CRs, provides recommendation dashboard | Deployment (controller + dashboard) | None |
Metrics Pipeline¶
Prometheus¶
Prometheus is deployed via the kube-prometheus-stack Helm chart (sync wave -1) and serves as the central metrics store.
| Setting | Value |
|---|---|
| Retention | 15 days |
| Storage | 20Gi PVC (nfs-client) |
| Access | prometheus.homelab.local |
| Sync Wave | -1 |
Prometheus scrapes metrics from:
- Node Exporter -- CPU, memory, disk, network, and other host-level metrics from every node
- kube-state-metrics -- Kubernetes object states (pod status, deployment replicas, node conditions)
- Kubelet metrics -- Container resource usage and pod lifecycle events
- Exportarr -- *arr application metrics (queue depth, library size, missing episodes) via ServiceMonitors
- Application metrics -- Any pods with Prometheus scrape annotations
Alertmanager¶
Alertmanager receives alerts from Prometheus and handles deduplication, grouping, silencing, and routing.
| Setting | Value |
|---|---|
| Access | alertmanager.homelab.local |
| Deployment | Part of kube-prometheus-stack |
Node Exporter¶
Node Exporter runs as a DaemonSet, ensuring one instance per node. It exposes host-level metrics including:
- CPU utilization and load averages
- Memory and swap usage
- Disk I/O and filesystem capacity
- Network interface statistics
- System temperature (where available)
kube-state-metrics¶
kube-state-metrics generates metrics about Kubernetes object states by listening to the Kubernetes API server. Key metrics include:
- Pod phase and container status
- Deployment and StatefulSet replica counts
- PersistentVolume and PVC status
- Node conditions and resource capacity
Logs Pipeline¶
Loki¶
Loki runs in single-binary mode, combining all Loki components (distributor, ingester, querier, compactor) in a single process. This simplifies deployment for a homelab-scale cluster.
| Setting | Value |
|---|---|
| Mode | Single-binary (monolithic) |
| Retention | 168 hours (7 days) |
| Storage | 10Gi PVC (nfs-client), filesystem backend |
| Access | Via Grafana data source |
| Sync Wave | -1 |
Filesystem Backend
Loki uses the local filesystem (backed by NFS PVC) for chunk and index storage. This avoids the need for an external object store while providing persistence across pod restarts.
Alloy¶
Alloy is Grafana's OpenTelemetry-compatible collector, deployed as a DaemonSet to collect logs from every node. It replaced Promtail as the recommended log collector.
| Setting | Value |
|---|---|
| Deployment | DaemonSet |
| Source | Pod logs via Kubernetes discovery |
| Destination | Loki |
| Sync Wave | 0 |
Alloy Pipeline Configuration¶
Alloy uses a pipeline-based configuration that discovers, filters, relabels, and ships logs:
flowchart LR
discovery["Kubernetes\nPod Discovery"] --> relabel["Relabel\n(namespace, pod,\ncontainer, node)"]
relabel --> logScrape["Log Scrape\n(pod stdout/stderr)"]
logScrape --> lokiWrite["Loki Write\nEndpoint"]
auditFile["/var/log/kubernetes/\naudit/audit.log"] --> auditProcess["JSON Parse\n(level, verb, user)"]
auditProcess --> lokiWrite
- Discovery: Alloy uses the Kubernetes discovery mechanism to find all running pods on the node
- Relabeling: Extracts and attaches metadata labels:
namespace-- Kubernetes namespacepod-- Pod namecontainer-- Container namenode-- Node the pod is running on
- Log scrape: Reads
stdout/stderrlog streams from discovered pods - Audit log collection: Tails the kube-apiserver audit log from the host filesystem and parses JSON fields into labels (see Kubernetes Audit Logging)
- Loki write: Ships labeled log entries to Loki's push API
Kubernetes Audit Logging¶
The Kubernetes API server is configured to write structured audit events to a log file on the control plane node. Alloy tails this file and ships the events to Loki.
Audit Policy¶
The audit policy (/etc/kubernetes/audit/audit-policy.yml) defines what gets logged:
| Event Category | Audit Level | Examples |
|---|---|---|
| Secret mutations | RequestResponse | create, update, delete, patch on secrets |
| RBAC changes | RequestResponse | Changes to roles, clusterroles, and bindings |
| Auth events | RequestResponse | Token reviews, certificate signing requests |
| Infrastructure mutations | RequestResponse | Namespace, node, and PV changes |
| All other mutations | Metadata | Any create, update, delete, patch |
| Read operations | Metadata | Remaining get requests |
| Noise (filtered out) | None | Health checks, watches, lists, lease heartbeats |
Log Rotation¶
Audit logs are rotated by the API server:
- Max age: 7 days
- Max backups: 3 files
- Max size: 100 MB per file
Querying Audit Logs¶
In Grafana Explore, query Loki with:
{job="kubernetes-audit"}-- all audit events{job="kubernetes-audit", verb="delete"}-- all delete operations{job="kubernetes-audit", user="system:serviceaccount:argocd:argocd-server"}-- events from a specific service account{job="kubernetes-audit", level="RequestResponse"} |= "secrets"-- secret access with full request/response bodies
Grafana¶
Grafana provides a unified interface for exploring both metrics (Prometheus) and logs (Loki).
| Setting | Value |
|---|---|
| Access | grafana.homelab.local |
| Storage | Persistent PVC (nfs-client) |
| Admin Credentials | ExternalSecret (grafana-admin, synced from Vault) |
Pre-Configured Data Sources¶
| Data Source | Type | Purpose |
|---|---|---|
| Prometheus | Metrics | Default metrics data source for dashboards |
| Loki | Logs | Log exploration and correlation with metrics |
Grafana is deployed with both data sources pre-configured via Helm values, eliminating manual setup after deployment.
Log Correlation
Use Grafana's split view to correlate metrics spikes with log entries. Select a time range on a Prometheus dashboard panel and switch to the Explore view with Loki to see logs from the same period.
Capacity Planning¶
The cluster includes a capacity planning layer that surfaces resource right-sizing data and cluster-wide headroom visibility.
VPA + Goldilocks¶
The Vertical Pod Autoscaler (VPA) recommender analyzes historical CPU and memory usage for each workload and computes right-sizing recommendations. Goldilocks automatically creates a VPA CR for every Deployment and StatefulSet in the cluster, eliminating manual VPA object management.
- VPA Recommender: Runs in
kube-system, computes target/lower-bound/upper-bound recommendations, exposesvpa_status_recommendationPrometheus metrics - Goldilocks Controller: Watches all namespaces (except
kube-node-lease,kube-public) and createsVerticalPodAutoscalerCRs withupdateMode: "Off" - Goldilocks Dashboard: Web UI for browsing per-workload recommendations (access via
kubectl port-forward)
VPA runs in recommend-only mode -- no pods are ever mutated. Recommendations are advisory and applied through manual manifest updates.
Capacity Dashboards¶
Two custom Grafana dashboards (provisioned via sidecar ConfigMaps) complement the Goldilocks per-workload view with cluster-wide capacity data:
| Dashboard | Purpose |
|---|---|
| Cluster Capacity Overview | CPU/memory requested vs allocatable vs used (timeseries + gauges), pod count, namespace breakdown pie charts |
| Namespace Resource Breakdown | Per-namespace tables with requests, limits, usage, and efficiency %, stacked usage timeseries |
These dashboards answer "how much headroom does the cluster have?" while Goldilocks answers "what should each workload request?"
Data Flow Summary¶
flowchart TD
subgraph nodes["Cluster Nodes"]
ne["Node Exporter"]
alloyDs["Alloy DaemonSet"]
appPods["Application Pods"]
auditLog["API Server\nAudit Log"]
end
subgraph monitoring["monitoring Namespace"]
prom["Prometheus\n15d retention\n20Gi"]
loki["Loki\n168h retention\n10Gi"]
grafana["Grafana"]
am["Alertmanager"]
ksm["kube-state-metrics"]
end
ne -->|"host metrics"| prom
ksm -->|"k8s object metrics"| prom
appPods -->|"scrape annotations"| prom
appPods -->|"stdout/stderr"| alloyDs
auditLog -->|"audit events"| alloyDs
alloyDs -->|"labeled logs"| loki
prom -->|"metrics queries"| grafana
loki -->|"log queries"| grafana
prom -->|"alerting rules"| am
subgraph arr["arr Namespace"]
exportarr["Exportarr\n(ServiceMonitors)"]
arrApps["Sonarr, Radarr,\nProwlarr, Bazarr"]
end
arrApps -->|"API"| exportarr
exportarr -->|"arr metrics"| prom