Observability

The observability stack lives in the observability namespace.


Applications

AppPurposeURL
Prometheus (kube-prometheus-stack)Metrics collection + Alertmanagerinternal
Grafana (grafana-operator)Dashboardsinternal
Victoria LogsLog aggregationinternal
Fluent BitLog shipping to Victoria Logs
GatusUptime / endpoint monitoringhttps://status.dcunha.io
KromgoPrometheus badge endpointhttps://kromgo.dcunha.io
Blackbox ExporterHTTP/TCP probing for Gatus
KEDAEvent-driven autoscaling
UniFi PollerUniFi metrics → Prometheus

Prometheus (kube-prometheus-stack)

Full kube-prometheus-stack including:

  • Prometheus server
  • Alertmanager
  • Node exporter
  • kube-state-metrics

Alertmanager

Alert routing is configured in kubernetes/components/alerts/alertmanager/. Active alerts are surfaced in the README badge.

If Prometheus WAL is corrupted after a node crash:

# Scale down
kubectl scale -n observability statefulset prometheus-kube-prometheus-stack-prometheus --replicas=0

# Wipe WAL only (compacted blocks are safe)
kubectl -n observability exec <prometheus-pod> -- rm -rf /prometheus/prometheus-db/wal/

# Scale up
kubectl scale -n observability statefulset prometheus-kube-prometheus-stack-prometheus --replicas=1

Do NOT delete individual WAL segments — this creates a non-sequential gap and causes a startup failure.


Grafana

Deployed via the grafana-operator. The operator manages a Grafana CR with:

  • Datasources: Prometheus, Victoria Logs
  • Dashboards: imported from app-specific GrafanaDashboard resources and JSON ConfigMaps

Apps that ship dashboards (Flux, Envoy Gateway, Cloudflare Tunnel, etc.) create GrafanaDashboard resources in their own namespaces, which the operator picks up automatically.


Victoria Logs

Replaces Loki for log aggregation. Fluent Bit ships logs from all pods to Victoria Logs.


Gatus

Endpoint monitoring with status badges. Endpoints are defined in kubernetes/apps/observability/gatus/app/resources/cluster-endpoints.yaml. Gatus also reads endpoint annotations from HTTPRoute resources (via gatus.home-operations.com/endpoint annotations on gateways).

Groups:

  • core — Ping, Status Page, Heartbeat (Alertmanager watchdog)
  • external — externally-accessible services (checked via 1.1.1.1 DNS)
  • internal — LAN-only services

Kromgo

Exposes Prometheus queries as shields.io-compatible badge endpoints for the README.

Current metrics:

MetricQuery
talos_versionnode_os_info{name="Talos"}
kubernetes_versionkubernetes_build_info
flux_versionflux_instance_info
cluster_node_countcount(kube_node_status_condition{condition="Ready"})
cluster_pod_countsum(kube_pod_status_phase{phase="Running"})
cluster_cpu_usageavg(instance:node_cpu_utilisation:rate5m) * 100
cluster_memory_usageNode memory utilisation %
cluster_age_days(time() - min(kube_node_created)) / 86400
cluster_uptime_daysAverage node uptime
cluster_alert_countalertmanager_alerts{state="active"} - 1 (excludes Watchdog)

The cluster_power_usage metric is defined but disabled — it requires a UPS SNMP exporter which is not running (Eaton UPS batteries are dead).


UniFi Poller

Scrapes metrics from the UCG-Max (UniFi controller) and exposes them to Prometheus. Provides network device health, client counts, and traffic metrics in Grafana.