Monitoring & Observability¶

This document covers the full monitoring and observability stack, including metrics collection, log aggregation, visualization, and alerting.

Observability Stack Overview¶

The monitoring stack provides two parallel data pipelines -- one for metrics and one for logs -- converging in Grafana for unified visualization.

flowchart LR
    subgraph metricsPath["Metrics Pipeline"]
        nodeExp["Node Exporter"] --> prometheus["Prometheus"]
        ksm["kube-state-metrics"] --> prometheus
        kubelet["Kubelet /metrics"] --> prometheus
    end

    subgraph logsPath["Logs Pipeline"]
        pods["Application Pods"] --> alloy["Alloy\n(DaemonSet)"]
        auditLog["API Server\nAudit Log"] --> alloy
        alloy --> loki["Loki\n(single-binary)"]
    end

    subgraph visualization["Visualization & Alerting"]
        prometheus --> grafana["Grafana"]
        loki --> grafana
        prometheus --> alertmanager["Alertmanager"]
    end

Component Inventory¶

Component	Purpose	Deployment	Storage
Prometheus	Metrics collection, storage, and rule evaluation	StatefulSet	20Gi PVC (`local-path`), 15d retention
Grafana	Dashboards and visualization for metrics and logs	Deployment	Persistent PVC (`nfs-client`)
Alertmanager	Alert routing, grouping, and notification	StatefulSet	Ephemeral
Node Exporter	Host-level hardware and OS metrics	DaemonSet	None
kube-state-metrics	Kubernetes object state metrics	Deployment	None
Loki	Log aggregation and querying	StatefulSet (single-binary)	10Gi PVC (`nfs-client`), 168h retention
Alloy	Pod log collection and shipping	DaemonSet	None (streams to Loki)
Exportarr	Prometheus metrics exporter for *arr apps	Multi-Deployment (one per target)	None
Uptime Kuma	Synthetic HTTP/TCP/DNS monitoring	Deployment	1Gi PVC (`nfs-client`)
VPA Recommender	Per-workload CPU/memory right-sizing recommendations	Deployment	None
Goldilocks	Auto-creates VPA CRs, provides recommendation dashboard	Deployment (controller + dashboard)	None

Metrics Pipeline¶

Prometheus¶

Prometheus is deployed via the kube-prometheus-stack Helm chart (sync wave -1) and serves as the central metrics store.

Setting	Value
Retention	15 days
Storage	20Gi PVC (`nfs-client`)
Access	`prometheus.homelab.local`
Sync Wave	-1

Prometheus scrapes metrics from:

Node Exporter -- CPU, memory, disk, network, and other host-level metrics from every node
kube-state-metrics -- Kubernetes object states (pod status, deployment replicas, node conditions)
Kubelet metrics -- Container resource usage and pod lifecycle events
Exportarr -- *arr application metrics (queue depth, library size, missing episodes) via ServiceMonitors
Application metrics -- Any pods with Prometheus scrape annotations

Alertmanager¶

Alertmanager receives alerts from Prometheus and handles deduplication, grouping, silencing, and routing.

Setting	Value
Access	`alertmanager.homelab.local`
Deployment	Part of kube-prometheus-stack

Node Exporter¶

Node Exporter runs as a DaemonSet, ensuring one instance per node. It exposes host-level metrics including:

CPU utilization and load averages
Memory and swap usage
Disk I/O and filesystem capacity
Network interface statistics
System temperature (where available)

kube-state-metrics¶

kube-state-metrics generates metrics about Kubernetes object states by listening to the Kubernetes API server. Key metrics include:

Pod phase and container status
Deployment and StatefulSet replica counts
PersistentVolume and PVC status
Node conditions and resource capacity

Logs Pipeline¶

Loki¶

Loki runs in single-binary mode, combining all Loki components (distributor, ingester, querier, compactor) in a single process. This simplifies deployment for a homelab-scale cluster.

Setting	Value
Mode	Single-binary (monolithic)
Retention	168 hours (7 days)
Storage	10Gi PVC (`nfs-client`), filesystem backend
Access	Via Grafana data source
Sync Wave	-1

Filesystem Backend

Loki uses the local filesystem (backed by NFS PVC) for chunk and index storage. This avoids the need for an external object store while providing persistence across pod restarts.

Alloy¶

Alloy is Grafana's OpenTelemetry-compatible collector, deployed as a DaemonSet to collect logs from every node. It replaced Promtail as the recommended log collector.

Setting	Value
Deployment	DaemonSet
Source	Pod logs via Kubernetes discovery
Destination	Loki
Sync Wave	0

Alloy Pipeline Configuration¶

Alloy uses a pipeline-based configuration that discovers, filters, relabels, and ships logs:

flowchart LR
    discovery["Kubernetes\nPod Discovery"] --> relabel["Relabel\n(namespace, pod,\ncontainer, node)"]
    relabel --> logScrape["Log Scrape\n(pod stdout/stderr)"]
    logScrape --> lokiWrite["Loki Write\nEndpoint"]

    auditFile["/var/log/kubernetes/\naudit/audit.log"] --> auditProcess["JSON Parse\n(level, verb, user)"]
    auditProcess --> lokiWrite

Discovery: Alloy uses the Kubernetes discovery mechanism to find all running pods on the node
Relabeling: Extracts and attaches metadata labels:
- namespace -- Kubernetes namespace
- pod -- Pod name
- container -- Container name
- node -- Node the pod is running on
Log scrape: Reads stdout/stderr log streams from discovered pods
Audit log collection: Tails the kube-apiserver audit log from the host filesystem and parses JSON fields into labels (see Kubernetes Audit Logging)
Loki write: Ships labeled log entries to Loki's push API

Kubernetes Audit Logging¶

The Kubernetes API server is configured to write structured audit events to a log file on the control plane node. Alloy tails this file and ships the events to Loki.

Audit Policy¶

The audit policy (/etc/kubernetes/audit/audit-policy.yml) defines what gets logged:

Event Category	Audit Level	Examples
Secret mutations	RequestResponse	create, update, delete, patch on secrets
RBAC changes	RequestResponse	Changes to roles, clusterroles, and bindings
Auth events	RequestResponse	Token reviews, certificate signing requests
Infrastructure mutations	RequestResponse	Namespace, node, and PV changes
All other mutations	Metadata	Any create, update, delete, patch
Read operations	Metadata	Remaining get requests
Noise (filtered out)	None	Health checks, watches, lists, lease heartbeats

Log Rotation¶

Audit logs are rotated by the API server:

Max age: 7 days
Max backups: 3 files
Max size: 100 MB per file

Querying Audit Logs¶

In Grafana Explore, query Loki with:

{job="kubernetes-audit"} -- all audit events
{job="kubernetes-audit", verb="delete"} -- all delete operations
{job="kubernetes-audit", user="system:serviceaccount:argocd:argocd-server"} -- events from a specific service account
{job="kubernetes-audit", level="RequestResponse"} |= "secrets" -- secret access with full request/response bodies

Grafana¶

Grafana provides a unified interface for exploring both metrics (Prometheus) and logs (Loki).

Setting	Value
Access	`grafana.homelab.local`
Storage	Persistent PVC (`nfs-client`)
Admin Credentials	ExternalSecret (`grafana-admin`, synced from Vault)

Pre-Configured Data Sources¶

Data Source	Type	Purpose
Prometheus	Metrics	Default metrics data source for dashboards
Loki	Logs	Log exploration and correlation with metrics

Grafana is deployed with both data sources pre-configured via Helm values, eliminating manual setup after deployment.

Log Correlation

Use Grafana's split view to correlate metrics spikes with log entries. Select a time range on a Prometheus dashboard panel and switch to the Explore view with Loki to see logs from the same period.

Capacity Planning¶

The cluster includes a capacity planning layer that surfaces resource right-sizing data and cluster-wide headroom visibility.

VPA + Goldilocks¶

The Vertical Pod Autoscaler (VPA) recommender analyzes historical CPU and memory usage for each workload and computes right-sizing recommendations. Goldilocks automatically creates a VPA CR for every Deployment and StatefulSet in the cluster, eliminating manual VPA object management.

VPA Recommender: Runs in kube-system, computes target/lower-bound/upper-bound recommendations, exposes vpa_status_recommendation Prometheus metrics
Goldilocks Controller: Watches all namespaces (except kube-node-lease, kube-public) and creates VerticalPodAutoscaler CRs with updateMode: "Off"
Goldilocks Dashboard: Web UI for browsing per-workload recommendations (access via kubectl port-forward)

VPA runs in recommend-only mode -- no pods are ever mutated. Recommendations are advisory and applied through manual manifest updates.

Capacity Dashboards¶

Two custom Grafana dashboards (provisioned via sidecar ConfigMaps) complement the Goldilocks per-workload view with cluster-wide capacity data:

Dashboard	Purpose
Cluster Capacity Overview	CPU/memory requested vs allocatable vs used (timeseries + gauges), pod count, namespace breakdown pie charts
Namespace Resource Breakdown	Per-namespace tables with requests, limits, usage, and efficiency %, stacked usage timeseries

These dashboards answer "how much headroom does the cluster have?" while Goldilocks answers "what should each workload request?"

Data Flow Summary¶

flowchart TD
    subgraph nodes["Cluster Nodes"]
        ne["Node Exporter"]
        alloyDs["Alloy DaemonSet"]
        appPods["Application Pods"]
        auditLog["API Server\nAudit Log"]
    end

    subgraph monitoring["monitoring Namespace"]
        prom["Prometheus\n15d retention\n20Gi"]
        loki["Loki\n168h retention\n10Gi"]
        grafana["Grafana"]
        am["Alertmanager"]
        ksm["kube-state-metrics"]
    end

    ne -->|"host metrics"| prom
    ksm -->|"k8s object metrics"| prom
    appPods -->|"scrape annotations"| prom
    appPods -->|"stdout/stderr"| alloyDs
    auditLog -->|"audit events"| alloyDs
    alloyDs -->|"labeled logs"| loki
    prom -->|"metrics queries"| grafana
    loki -->|"log queries"| grafana
    prom -->|"alerting rules"| am

    subgraph arr["arr Namespace"]
        exportarr["Exportarr\n(ServiceMonitors)"]
        arrApps["Sonarr, Radarr,\nProwlarr, Bazarr"]
    end

    arrApps -->|"API"| exportarr
    exportarr -->|"arr metrics"| prom