Introduction
The Artemis Cluster! :octocat:
... where YAML is law, Renovate never sleeps, and 2am is just debugging hours.
📖 Overview
This repository manages my homelab Kubernetes cluster built on TalosOS, following Infrastructure as Code (IaC) and GitOps practices. The setup consists of three bare-metal control plane nodes and three VM workers (including one GPU worker), with all configurations version-controlled and automatically deployed via FluxCD.
I didn't start from a cluster template — this was built from the ground up, learning as I went. Over time I've gradually aligned the structure and conventions with what the Home Operations community has collectively settled on, borrowing ideas and patterns from repos I admire rather than forking from any single starting point.
⛵ Kubernetes
Components Explained
The cluster is organized into logical namespaces for maintainability and separation of concerns:
- kube-system: The foundation layer — cluster networking (Cilium), core DNS (CoreDNS), multi-network (Multus), GPU support (intel-gpu-resource-driver), and cluster utilities (reloader, reflector, descheduler, spegel).
- network: Ingress via Envoy Gateway, DNS automation via ExternalDNS (Cloudflare + UniFi), and Cloudflare Tunnel.
- cert-manager: Automated TLS certificates via Let's Encrypt.
- observability: Full monitoring stack — Prometheus, Grafana, Victoria Logs, Fluent Bit, Gatus, Kromgo, KEDA, and UniFi Poller.
- rook-ceph / openebs-system / volsync-system: Block storage, local storage, and PVC backup/restore.
- home-automation: Home Assistant, Frigate, ESPHome, Zigbee2MQTT, Mosquitto, Matter Server, Homebridge, Node-RED.
- media: Full arr stack, Jellyfin, download clients, and supporting tooling.
- external-secrets: Secrets from 1Password Connect, plus age-encrypted bootstrap secrets.
Directories
This Git repository contains the following directories under Kubernetes.
📁 kubernetes
├── 📁 apps
│ ├── 📁 actions-runner-system # Self-hosted GitHub runners
│ ├── 📁 cert-manager # TLS certificate management
│ ├── 📁 external-endpoints # ExternalName services for off-cluster resources
│ ├── 📁 external-secrets # 1Password Connect secrets provider
│ ├── 📁 flux-system # Flux Operator + FluxInstance
│ ├── 📁 home-automation # Home Assistant, Frigate, ESPHome, Zigbee, etc.
│ ├── 📁 kube-system # Cilium, CoreDNS, Multus, GPU driver, utilities
│ ├── 📁 media # Arr stack, Jellyfin, download clients
│ ├── 📁 network # Envoy Gateway, ExternalDNS, Cloudflare Tunnel
│ ├── 📁 observability # Prometheus, Grafana, Victoria Logs, Gatus, Kromgo
│ ├── 📁 openebs-system # Local storage provisioner
│ ├── 📁 rook-ceph # Distributed block storage
│ ├── 📁 system-upgrade # Tuppr (Talos/K8s automated upgrades)
│ └── 📁 volsync-system # PVC backup/restore (Kopia)
├── 📁 components # Reusable Kustomize components
└── 📁 flux # Flux sync entrypoint → kubernetes/apps
🔧 Hardware
| Device | Count | Disk | RAM | OS | Purpose |
|---|---|---|---|---|---|
Lenovo M710q (talos-cp-01/02/03) | 3 | 256GB NVMe (boot) + 256GB SATA SSD (Ceph OSD) | 16GB | Talos Linux | Kubernetes Control Plane |
Proxmox VM on pantheon (talos-w-01/02) | 2 | Virtualized | 32GB | Talos Linux | Kubernetes Worker |
Proxmox VM on pantheon (talos-gpu-01) | 1 | Virtualized | 32GB | Talos Linux | Kubernetes GPU Worker (ASRock Arc A380 6GB passthrough) |
HPE ML150 G9 (pantheon) | 1 | T-FORCE 1TB SSD | 192GB | Proxmox | Virtualization Host |
Supermicro (atlas) | 1 | 3× RAIDZ2 6-wide (~41TB usable) | 94.3GB ECC | TrueNAS SCALE | NAS / Media Storage |
🌐 Networking
| Device | Role |
|---|---|
| UniFi Cloud Gateway Max | WAN/NAT, L3 gateway, DHCP, BGP (FRR), DNS, UniFi controller |
| Mikrotik CRS309-1G-8S+ | L2 switch only — downstream of UCG-Max on VLAN 1099 (LAB) |
| UniFi US-48 PoE 500W | L2 switch (upstream: UCG-Max) |
| UniFi US-16 PoE 150W | L2 switch (upstream: US-48) |
Kubernetes nodes run on VLAN 1099 (LAB, 10.10.99.0/24). Home-automation pods attach a secondary interface to VLAN 1152 (IOT, 10.10.152.0/24) via Multus for direct device access (Frigate, Home Assistant, Zigbee2MQTT).
BGP peers between UCG-Max (AS 64533) and all six Talos nodes distribute LoadBalancer service IPs into the LAB routing table.
🤝 Acknowledgments
A huge thanks to the following people whose work has been an invaluable reference:
- onedr0p/home-ops
- bjw-s-labs/home-ops
- joryirving/home-ops
- Christian Lempa — whose YouTube content helped demystify a lot of the early infrastructure concepts
- TechnoTim — for countless practical homelab guides that made the learning curve far less steep
And to the broader Home Operations Discord community — thanks to everyone openly sharing their setups and knowledge.
📝 License
This repository is available under the WTFPL License. See LICENSE for details.
Nodes
Control Planes
Three bare-metal Lenovo M710q mini PCs running Talos Linux as Kubernetes control plane nodes. Workloads are permitted to schedule on control planes (allowSchedulingOnControlPlanes: true).
| Hostname | IP | Boot Disk | Ceph OSD |
|---|---|---|---|
| talos-cp-01 | 10.10.99.101 | 256 GB NVMe (Samsung MZVLW256) | 256 GB SATA SSD |
| talos-cp-02 | 10.10.99.102 | 256 GB NVMe (Samsung MZVLW256) | 256 GB SATA SSD |
| talos-cp-03 | 10.10.99.103 | 256 GB NVMe (Samsung MZVLW256) | 256 GB SATA SSD |
- RAM: 16 GB each
- Network: Physical NIC (MAC prefix
6c:4b:90), bonded asbond0, VLAN 1099 (LAB) tagged asbond0.1099 - Secure Boot: Enabled
- Talos schematic extensions:
i915,intel-ucode,mei,nfsrahead,util-linux-tools
Workers
Three Talos Linux VMs on Proxmox host pantheon (HPE ML150 G9).
| Hostname | IP | VM ID | Disk | GPU |
|---|---|---|---|---|
| talos-w-01 | 10.10.99.201 | 101 | /dev/sda (virtualized) | — |
| talos-w-02 | 10.10.99.202 | 102 | /dev/sda (virtualized) | — |
| talos-gpu-01 | 10.10.99.203 | 104 | /dev/sda (virtualized) | ASRock Arc A380 6 GB (passthrough) |
- RAM: 32 GB each
- vCPUs: 6 (sockets=1, cores=6, NUMA enabled)
- Network: QEMU NIC (MAC prefix
bc:24:11), bonded asbond0, VLAN 1099 LAB (bond0.1099) + VLAN 1152 IOT (bond0.1152) - Secure Boot: Enabled (UKI cmdline via
grubUseUKICmdline: true) - Talos schematic extensions:
i915,intel-ucode,mei,nfsrahead,qemu-guest-agent,util-linux-tools - GPU schematic adds:
intel_iommu=on,iommu=pt,i915.enable_guc=3,pcie_aspm=off
GPU Worker Notes
- Small BAR detected — HPE ML150 G9 does not support Resizable BAR (ReBAR). This is not fixable at the firmware level. VAAPI transcoding is unaffected.
- xpu-smi / Level Zero error
zeInit: 78000001— Level Zero compute API is unavailable inside VMs (expected). VAAPI/DRM still works correctly. model: Unknown,memory: "0"in ResourceSlice — cosmetic result of xpu-smi failure above; no functional impact.
Proxmox Host (pantheon)
The virtualization host for all three worker VMs.
| Field | Value |
|---|---|
| Hostname | pantheon |
| IP | 10.10.99.104 |
| Hardware | HPE ML150 G9 |
| CPU | 2× Intel Xeon E5-2620 v3 (12 c/24 t total) |
| RAM | 192 GB |
| OS | Proxmox VE (Debian Trixie) |
| Boot disk | T-FORCE 1 TB SSD |
SSH access: root@10.10.99.104
Worker VMs are managed via the Proxmox web UI or CLI (qm). The Arc A380 GPU is passed through to talos-gpu-01 (VM 104) via VFIO.
VM Management Quick Reference
# List VMs
qm list
# Start/stop a VM
qm start 104
qm stop 104
# Hard reset (use when talosctl reboot hangs)
qm reset 104
# Console access
qm terminal 101
Storage
The cluster uses three distinct storage tiers: distributed block storage (Rook-Ceph), local host-path storage (OpenEBS), and network-attached bulk storage (TrueNAS).
Rook-Ceph (Block Storage)
Three OSDs — one per control plane node — provide replicated block storage for stateful apps.
| Node | OSD Device | Capacity |
|---|---|---|
| talos-cp-01 | 256 GB SATA SSD | ~85 GB usable (3× replica) |
| talos-cp-02 | 256 GB SATA SSD | |
| talos-cp-03 | 256 GB SATA SSD |
- Failure domain: host
- Default StorageClass:
ceph-blockpool(RWO, replicated ×3, volume expansion enabled) - Filesystem StorageClass:
ceph-filesystem(RWX, CephFS) useAllNodes: false— nodes are explicitly listed; do not change touseAllNodes: truepg_autoscaleris enabled but capped atmon_max_pg_per_osd=250
Rule: Rook-Ceph block storage is for app config, databases, and PVCs that need replication. Bulk media lives on TrueNAS NFS — never on Ceph.
OpenEBS (Local Storage)
OpenEBS provides hostpath local PVCs for workloads that need fast local storage without replication. Uses a bind mount at /var/local/openebs (configured in the Talos kubelet extraMounts).
- StorageClass:
openebs-hostpath - Used for: scratch space, cache, temporary data
- No replication — data is lost if the node is destroyed
TrueNAS (atlas)
The NAS hosts all bulk media and is the backing store for Jellyfin, the arr stack, and download clients.
| Field | Value |
|---|---|
| Hostname | atlas |
| IP | 10.10.99.100 |
| Hardware | Supermicro, Xeon E5-2643 v0 |
| RAM | 94.3 GB ECC |
| OS | TrueNAS SCALE |
| Pool | 3× RAIDZ2 6-wide of 3.49 TB drives + 1 TB mirror metadata vdev (~41 TB usable) |
NFS Mount
The export /mnt/atlas/media is mounted into pods at /media.
# Example NFS PVC
apiVersion: v1
kind: PersistentVolume
spec:
nfs:
server: 10.10.99.100
path: /mnt/atlas/media
NFS version 4.2 is enforced cluster-wide via /etc/nfsmount.conf on all Talos nodes:
[ NFSMount_Global_Options ]
nfsvers=4.2
hard=True
noatime=True
SMB
SMB shares use force user = apps / force group = apps (UID/GID 1000) for read/write access from management machines.
VolSync (Backup & Restore)
VolSync provides automated PVC backup and restore using Kopia as the backend. Backups are stored in an S3-compatible bucket via the volsync-system namespace.
See the VolSync operations runbook for backup and restore procedures.
VLANs & Routing
Network Devices
| Device | Role |
|---|---|
| UniFi Cloud Gateway Max (UCG-Max) | WAN/NAT, L3 gateway for all VLANs, DHCP server, BGP (FRR), DNS, UniFi controller |
| Mikrotik CRS309-1G-8S+ | L2 switch only — no routing, no BGP, no IPs |
| UniFi US-48 PoE 500W | L2 switch (upstream: UCG-Max port 4) |
| UniFi US-16 PoE 150W | L2 switch (upstream: US-48 port 13) |
The UCG-Max replaced pfSense as the network gateway. The Mikrotik is now a pure L2 switch downstream of the UCG-Max on VLAN 1099 (LAB).
VLANs
| Name | VLAN ID | Subnet | Gateway | DHCP Range | Purpose |
|---|---|---|---|---|---|
| LAN | 1 | 192.168.1.0/24 | 192.168.1.1 | .50–.200 | Legacy/default |
| HME | 1001 | 10.10.1.0/24 | 10.10.1.1 | .50–.200 | Trusted home users |
| TST | 1088 | 192.168.88.0/24 | 192.168.88.1 | .50–.200 | Testing |
| LAB | 1099 | 10.10.99.0/24 | 10.10.99.1 | .50–.70 | Servers, K8s nodes |
| GST | 1151 | 10.10.151.0/24 | 10.10.151.1 | .50–.200 | Guest |
| IOT | 1152 | 10.10.152.0/24 | 10.10.152.1 | .50–.200 | IoT devices |
| TRANSIT | 99 | 172.16.99.0/30 | — | None | UCG-Max ↔ Mikrotik link |
Key Static IPs (LAB — 10.10.99.0/24)
| Host | IP | Notes |
|---|---|---|
| UCG-Max | 10.10.99.1 | Gateway, DNS, BGP peer |
| talos-cp-01 | 10.10.99.101 | Control plane |
| talos-cp-02 | 10.10.99.102 | Control plane |
| talos-cp-03 | 10.10.99.103 | Control plane |
| pantheon | 10.10.99.104 | Proxmox host |
| talos-w-01 | 10.10.99.201 | Worker |
| talos-w-02 | 10.10.99.202 | Worker |
| talos-gpu-01 | 10.10.99.203 | GPU worker |
| atlas (TrueNAS) | 10.10.99.100 | NFS: /mnt/atlas/media |
| kube-api VIP | 10.10.99.99 | Kubernetes API server (L2 via Cilium) |
| Internal gateway | 10.10.99.98 | Envoy internal-gateway LoadBalancer IP |
| External gateway | 10.10.99.97 | Envoy external-gateway LoadBalancer IP |
| LB pool | 10.10.99.71–.96 | Available for additional LoadBalancer services |
Multi-Network (Multus + IOT VLAN)
Home-automation pods (Frigate, Home Assistant, Zigbee2MQTT, etc.) attach a secondary interface to VLAN 1152 (IOT) via Multus. This gives them a direct L2 presence on the IOT network for device discovery and communication without going through NAT.
The Multus NetworkAttachmentDefinition for IOT is defined in kubernetes/apps/kube-system/multus/networks/iot.yaml.
UCG-Max Management
# SSH
ssh root@10.10.99.1
# BGP status
vtysh -c 'show bgp summary'
# UniFi admin
# Web UI: https://10.10.99.1 (or unifi.ui.com)
# MongoDB (for advanced debugging)
mongo --port 27117 ace
BGP
Cilium BGP distributes Kubernetes LoadBalancer service IPs into the LAB routing table. This replaces L2 announcements for the LB IP pool — devices on other VLANs (HME, LAN) reach LoadBalancer IPs by routing through the UCG-Max which learns the routes via BGP.
Architecture
UCG-Max (AS 64533)
├── peer: talos-cp-01 10.10.99.101
├── peer: talos-cp-02 10.10.99.102
├── peer: talos-cp-03 10.10.99.103
├── peer: talos-w-01 10.10.99.201
├── peer: talos-w-02 10.10.99.202
└── peer: talos-gpu-01 10.10.99.203
All 6 nodes peer with the UCG-Max. Nodes advertise the LB IP pool (10.10.99.71–10.10.99.99) via BGP. The UCG-Max installs these routes and distributes them to other VLANs.
Cilium BGP Configuration
BGP is configured via CiliumBGPClusterConfig and CiliumBGPPeerConfig resources in kubernetes/apps/kube-system/cilium/app/networking.yaml. Nodes participating in BGP must have the label bgppolicy: enabled, which is applied to all nodes via the Talos node config.
Known Behaviour
Devices on the LAB subnet (10.10.99.0/24) cannot reach LB IPs directly. The LB IP pool is within the LAB subnet range but the UCG-Max does not L2-proxy ARP for these addresses. Devices on HME, LAN, and other VLANs route through the UCG-Max and work fine.
This is an intentional BGP-only design.
Checking BGP Status
# On UCG-Max
ssh root@10.10.99.1
vtysh -c 'show bgp summary'
vtysh -c 'show ip route bgp'
# From a cluster node
kubectl -n kube-system exec ds/cilium -- cilium bgp peers
kubectl -n kube-system exec ds/cilium -- cilium bgp routes
Adding a New LoadBalancer IP
Add the IP to the CiliumLoadBalancerIPPool resource in the Cilium networking manifest. Cilium will advertise it to all BGP peers automatically once a service claims it.
DNS & Split-Horizon
The cluster uses split-horizon DNS so that services resolve to internal IPs from inside the home network and are accessible externally via Cloudflare.
Architecture
External clients
└── Cloudflare DNS (proxied) → Cloudflare Tunnel → external-gateway pod (10.10.99.97)
Internal clients (HME, LAN, LAB)
└── UCG-Max DNS → internal A record → internal-gateway (10.10.99.98)
OR
→ external A record → external-gateway (10.10.99.97)
UCG-Max DNS
The UCG-Max acts as the recursive DNS resolver for all VLANs. It is configured with:
- Split-horizon domain:
dcunha.io— internal A records override Cloudflare for this domain - Upstream forwarders: Cloudflare (
1.1.1.1,1.0.0.1)
Two ExternalDNS instances write records automatically:
| Controller | Provider | Watches | Writes |
|---|---|---|---|
external-dns-unifi | UCG-Max webhook (kashalls) | internal-gateway HTTPRoutes | A records → 10.10.99.98 |
external-dns-cloudflare | Cloudflare API | external-gateway HTTPRoutes | CNAME → external.dcunha.io, proxied |
Two Gateways
| Gateway | IP | Purpose |
|---|---|---|
external-gateway | 10.10.99.97 | Internet-facing services, Cloudflare Tunnel entry point |
internal-gateway | 10.10.99.98 | LAN-only services (Grafana, Prometheus, etc.) |
The gateways are annotated with external-dns.alpha.kubernetes.io/target:
external-gateway→ targetexternal.dcunha.io— Cloudflare DNS record +lbipam.cilium.io/ips: 10.10.99.97internal-gateway→ targetinternal.dcunha.io— UCG-Max DNS record +lbipam.cilium.io/ips: 10.10.99.98
Cloudflare Tunnel
External traffic flows through Cloudflare Zero Trust Network Access rather than a port-forwarded IP:
- Cloudflare receives a request for
*.dcunha.io - The tunnel routes it to the
cloudflare-tunnelpod (2 replicas, PodDisruptionBudget min 1) - The pod forwards directly to
https://external-gateway.network.svc.cluster.local:443 - Envoy routes to the matching HTTPRoute
The tunnel bypasses the external gateway's LoadBalancer IP entirely for inbound public traffic. The external-gateway IP (10.10.99.97) is still used for internal split-horizon access to externally-annotated services.
Tunnel config (kubernetes/apps/network/cloudflare-tunnel/app/helmrelease.yaml):
ingress:
- hostname: "*.dcunha.io"
originRequest:
http2Origin: true
originServerName: external.dcunha.io
service: https://external-gateway.network.svc.cluster.local:443
- service: http_status:404
Deploying a New Service
Internal only (LAN-accessible, no internet)
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: my-app
namespace: my-namespace
spec:
parentRefs:
- name: internal-gateway
namespace: network
hostnames:
- my-app.dcunha.io
rules:
- matches:
- path:
type: PathPrefix
value: /
backendRefs:
- name: my-app
port: 8080
external-dns-unifi auto-detects this and writes my-app.dcunha.io → 10.10.99.98 to UCG-Max DNS.
External (internet-accessible via Cloudflare Tunnel)
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: my-app
namespace: my-namespace
spec:
parentRefs:
- name: external-gateway
namespace: network
hostnames:
- my-app.dcunha.io
rules:
- matches:
- path:
type: PathPrefix
value: /
backendRefs:
- name: my-app
port: 8080
external-dns-cloudflare auto-detects this and writes a proxied CNAME in Cloudflare.
Troubleshooting
| Symptom | Check |
|---|---|
| Name not resolving from home network | dig my-app.dcunha.io @10.10.99.1 — check UCG-Max has the A record |
| Name not resolving externally | Check Cloudflare DNS dashboard for the CNAME record |
kubectl get httproute -A shows no routes | Check the parentRefs gateway name and namespace |
| ExternalDNS not writing records | kubectl logs -n network deploy/external-dns-unifi or deploy/external-dns-cloudflare |
Config Management
Talos machine configs are managed via a render-config workflow using MiniJinja templates and talosctl machineconfig patch. This replaced the previous talhelper approach.
File Layout
talos/
├── machineconfig.yaml.j2 # Base machine config template (shared by all nodes)
├── nodes/
│ ├── talos-cp-01.yaml.j2 # Per-node patch + type declaration
│ ├── talos-cp-02.yaml.j2
│ ├── talos-cp-03.yaml.j2
│ ├── talos-w-01.yaml.j2
│ ├── talos-w-02.yaml.j2
│ └── talos-gpu-01.yaml.j2
├── schematics/
│ ├── controlplane.yaml # Schematic for CP nodes
│ ├── worker.yaml # Schematic for standard workers
│ └── gpu.yaml # Schematic for GPU worker
├── talconfig.yaml # Node inventory/reference (not used for config generation)
└── mod.just # Just tasks
How It Works
1. Template rendering
machineconfig.yaml.j2 is a MiniJinja template for the base config shared by all nodes. The IS_CONTROLLER environment variable controls control-plane-specific blocks (etcd CA keys, API server config, kubernetesTalosAPIAccess, etc.).
Per-node patches (nodes/<node>.yaml.j2) declare the machine.type, install.image, install.disk, node labels, and hostname.
2. Rendering a config
just talos render-config talos-cp-01
This sets IS_CONTROLLER by inspecting the node's type, renders machineconfig.yaml.j2, then patches it with nodes/talos-cp-01.yaml.j2.
3. Applying a config
just talos apply-node talos-cp-01
Pipes render-config directly into talosctl apply-config. No intermediate files are written to disk.
For initial (insecure) apply during bootstrap:
just talos apply-node talos-cp-01 --insecure
For a change that requires a reboot:
just talos apply-node talos-w-01 --mode=reboot
Schematics
Schematics define the kernel args and system extensions for each node type. They are submitted to factory.talos.dev to generate a unique schematic ID, which becomes the installer image URL.
| Schematic | Extensions | Extra kernel args |
|---|---|---|
controlplane | i915, intel-ucode, mei, nfsrahead, util-linux-tools | lockdown=integrity, mitigations=off |
worker | + qemu-guest-agent | same as controlplane |
gpu | + qemu-guest-agent | + intel_iommu=on, iommu=pt, i915.enable_guc=3, pcie_aspm=off |
Generating a schematic ID
just talos gen-schematic-id controlplane
# → 4ba058235b9a91962983fdb0a4e04979567495c7dea6dd5ec3f7d1e337f8ee7b
Downloading an image
just talos download-image v1.12.6 controlplane
Downloads a secureboot ISO to talos/talos-v1.12.6-controlplane.iso.
Updating extensions on a node
- Edit the relevant schematic file in
talos/schematics/ - Run
just talos gen-schematic-id <schematic>to get the new hash - Update
machine.install.imagein the node's.yaml.j2file - Apply and reboot:
just talos apply-node <node> --mode=reboot
Secrets
All sensitive values in machineconfig.yaml.j2 use 1Password op:// references (e.g. op://kubernetes/talos/MACHINE_TOKEN). These are resolved at render time by op CLI before the config is applied.
No SOPS encryption is used for Talos configs.
talconfig.yaml
talconfig.yaml is retained as a human-readable node inventory (IPs, disk selectors, types). It is not used for config generation — the genconfig task was removed. Treat it as documentation.
Bootstrap
Full procedure to bring up the cluster from scratch. Run just from the repo root — the bootstrap mod.just orchestrates all stages.
Prerequisites
- All 6 nodes booted into Talos maintenance mode (USB or netboot)
talosctl,kubectl,helm,helmfile,op(1Password CLI) all installed and in PATH- Active 1Password session (
op signin) - Talosconfig pointed at the control plane nodes
Stage Overview
The default bootstrap target runs all stages in order:
just (runs bootstrap/mod.just default)
talos → apply Talos config to all nodes
kube → bootstrap Kubernetes (etcd init)
kubeconfig → fetch kubeconfig via node IP
wait → wait for nodes to become not-ready (CNI not installed yet)
namespaces → create all app namespaces from kubernetes/apps/
resources → apply bootstrap secrets (1Password Connect, Cloudflare Tunnel ID)
crds → apply CRDs via helmfile (00-crds.yaml)
apps → install bootstrap apps via helmfile (01-apps.yaml)
kubeconfig → re-fetch kubeconfig now using Cilium LB
You can run any stage individually:
just bootstrap talos
just bootstrap kube
just bootstrap apps
# etc.
Stage Details
talos — Apply Talos Config
Iterates all nodes from talosctl config info and applies the rendered config. Skips nodes that are already configured (detects "certificate required" error).
just bootstrap talos
# or apply a single node
just talos apply-node talos-cp-01 --insecure
Use
--insecurefor nodes that have never been configured (no client cert yet).
kube — Bootstrap Kubernetes
Runs talosctl bootstrap on the first control plane. Retries until etcd reports AlreadyExists (idempotent).
just bootstrap kube
kubeconfig — Fetch Kubeconfig
Fetches kubeconfig from the control plane and saves it to the repo root. Run twice — once early (using node IP) and once after Cilium is running (using LB VIP).
just bootstrap kubeconfig
namespaces — Create Namespaces
Extracts Namespace resources from each app directory's kustomization and applies them with --server-side. This ensures namespaces exist before Flux tries to deploy into them.
resources — Bootstrap Secrets
Renders bootstrap/resources.yaml.j2 via the op CLI to resolve op:// references and applies the result. This creates:
onepassword-connect-credentials-secretinexternal-secrets(1Password Connect JSON credentials)onepassword-connect-vault-secretinexternal-secrets(Connect API token)cloudflare-tunnel-id-secretinnetwork(Cloudflare Tunnel ID)
These secrets must exist before the helmfile apps can start.
crds — Install CRDs
Applies CRDs from bootstrap/helmfile.d/00-crds.yaml using helmfile template | kubectl apply. This pre-installs CRDs for:
cloudflare-dns(ExternalDNS)envoy-gatewaygrafana-operatorkedakube-prometheus-stack
apps — Install Bootstrap Apps
Runs helmfile sync on bootstrap/helmfile.d/01-apps.yaml. Install order (respecting needs: dependencies):
cilium
→ coredns
→ spegel
→ cert-manager
→ external-secrets
→ onepassword-connect (+ ClusterSecretStore)
→ flux-operator
→ flux-instance (starts Flux GitOps sync)
Once flux-instance is installed, Flux takes over and reconciles kubernetes/apps/.
Post-Bootstrap Verification
# Nodes ready
kubectl get nodes -o wide
# All system pods running
kubectl get pods -n kube-system
kubectl get pods -n flux-system
# Flux reconciling
flux get kustomizations
# Cilium healthy
kubectl -n kube-system exec ds/cilium -- cilium status --brief
# BGP peers established
kubectl -n kube-system exec ds/cilium -- cilium bgp peers
API Server Endpoint
The Kubernetes API server is accessed via:
- VIP:
https://10.10.99.99:6443(L2 via Cilium, active once Cilium is running) - DNS:
https://artemis.dcunha.io:6443(resolves to 10.10.99.99 via split-horizon) - KubePrism (local proxy):
127.0.0.1:7445on each node (used by Cilium internally)
certSANs include 127.0.0.1, 10.10.99.99, and artemis.dcunha.io.
Upgrades (tuppr)
tuppr automates Talos and Kubernetes upgrades via GitOps. It is deployed in the system-upgrade namespace and managed by Flux.
How It Works
tuppr watches TalosUpgrade and KubernetesUpgrade CRDs. When Renovate bumps the version in those resources and Flux reconciles, tuppr performs the upgrade automatically — draining nodes, upgrading, and continuing without manual intervention.
Renovate picks up new versions via # renovate: annotations on the CRDs.
Current Versions
Managed in kubernetes/apps/system-upgrade/tuppr/upgrades/:
| Resource | Kind | Current Version |
|---|---|---|
talos | TalosUpgrade | v1.12.6 |
kubernetes | KubernetesUpgrade | v1.35.4 |
TalosUpgrade
The TalosUpgrade resource specifies the installer image per node schematic. Workers and GPU nodes use a different schematic hash than control planes (different extensions).
Renovate manages the version via a datasource=docker annotation pointing to ghcr.io/siderolabs/installer.
Before upgrading Talos, verify the schematic IDs are still valid:
just talos gen-schematic-id controlplane
just talos gen-schematic-id worker
just talos gen-schematic-id gpu
If the schematic hashes change (e.g. after adding extensions), update the node .yaml.j2 files and re-apply before triggering a tuppr upgrade.
KubernetesUpgrade
The KubernetesUpgrade resource specifies the target Kubernetes version. tuppr runs talosctl upgrade-k8s internally.
Manual Upgrade (without tuppr)
If you need to upgrade outside of tuppr:
# Upgrade Talos on a single node
just talos upgrade-node talos-cp-01
# Upgrade Kubernetes
just talos upgrade-k8s v1.36.0
upgrade-node reads the install image from the node's .yaml.j2 file automatically.
Prometheus Alerts
tuppr ships PrometheusRules for upgrade job status:
tuppr.talosupgrade— TalosUpgrade job failurestuppr.kubernetesupgrade— KubernetesUpgrade job failurestuppr.jobs— generic job failure alert
KubernetesTalosAPIAccess
The system-upgrade namespace is granted os:admin access to the Talos API via kubernetesTalosAPIAccess on all control plane nodes. This allows tuppr to call talosctl against nodes from within the cluster.
Node Reset
Procedures for resetting individual nodes or the entire cluster.
Reset a Single Node
Use just talos reset-node — it prompts for confirmation before executing.
just talos reset-node talos-w-01
This runs talosctl reset --system-labels-to-wipe STATE --system-labels-to-wipe EPHEMERAL --graceful=false. Only STATE and EPHEMERAL partitions are wiped; the OS installation remains. The node reboots into a clean state and can be re-configured with apply-node.
For worker VMs that hang during reboot (kernel RBD stall), hard-reset via Proxmox:
qm reset <vmid>
Full Cluster Reset
Wipes all nodes completely (OS disk included). Use this only when rebuilding from scratch.
Step 1: Reset all nodes
# Control planes
talosctl -n 10.10.99.101 reset --graceful=false --reboot
talosctl -n 10.10.99.102 reset --graceful=false --reboot
talosctl -n 10.10.99.103 reset --graceful=false --reboot
# Workers
talosctl -n 10.10.99.201 reset --graceful=false --reboot
talosctl -n 10.10.99.202 reset --graceful=false --reboot
talosctl -n 10.10.99.203 reset --graceful=false --reboot
After reset, each node's disk is completely wiped. Nodes will reboot but cannot boot from disk.
Step 2: Boot nodes from Talos USB/ISO
Each node must boot from a Talos installation ISO to get back to maintenance mode. Download the correct image for each schematic:
just talos download-image v1.12.6 controlplane # for CPs
just talos download-image v1.12.6 worker # for workers
just talos download-image v1.12.6 gpu # for talos-gpu-01
Flash to USB and boot each node. For worker VMs on Proxmox, attach the ISO in the VM's CD drive and set boot order to CD first.
Step 3: Re-bootstrap
Once all nodes are in maintenance mode, run the full bootstrap:
just
See Bootstrap for full stage details.
Rebooting a Node
just talos reboot-node talos-cp-01
Uses powercycle mode (graceful shutdown + power cycle) with a confirmation prompt.
Shutting Down a Node
just talos shutdown-node talos-cp-01
Health Check
just talos check-cluster-health
# Or directly
talosctl health --nodes 10.10.99.101
Dashboard
just talos open-dashboard
Opens an interactive Talos dashboard for the first control plane node.
Flux & GitOps
The cluster is managed entirely via GitOps using Flux Operator + FluxInstance.
Architecture
flux-operator → manages Flux lifecycle (install, upgrade, health)
└── flux-instance → defines sync config (repo, branch, path)
└── GitRepository (flux-system/flux-system) → github.com/Exikle/Artemis-Cluster
└── Kustomization (artemis-cluster) → ./kubernetes/apps
└── per-namespace Kustomizations → HelmReleases
Sync entrypoint: kubernetes/flux/sync/cluster.yaml — one root Kustomization pointing to kubernetes/apps, syncing every hour.
Key Behaviours
All child Kustomizations and HelmReleases inherit these defaults (patched by the root Kustomization):
- CRD strategy:
CreateReplaceon install and upgrade - Upgrade remediation: retry 2×, remediate last failure
- Rollback:
cleanupOnFail: true,recreate: true - Deletion policy:
WaitForTermination
The flux-system Kustomization has prune: false — Flux will never delete itself.
Repo Structure
kubernetes/
├── apps/ # All namespaced app resources
│ ├── <namespace>/
│ │ ├── <app>/
│ │ │ ├── ks.yaml # Flux Kustomization
│ │ │ └── app/ # HelmRelease, secrets, config
│ │ └── kustomization.yaml
│ └── kustomization.yaml
├── components/ # Shared Kustomize components
│ ├── alerts/ # Alertmanager + GitHub status providers
│ ├── nfs-scaler/ # KEDA ScaledObject for NFS
│ └── volsync/ # VolSync PVC/ReplicationSource templates
└── flux/
└── sync/
├── cluster.yaml # Root Kustomization
└── kustomization.yaml
Upgrading Flux
Change the version in flux-operator or flux-instance HelmRelease — the operator handles the rolling update. Renovate manages version bumps automatically.
Flux CLI Quick Reference
# Check all Kustomizations
flux get kustomizations -A
# Check all HelmReleases
flux get helmreleases -A
# Force reconcile a specific app
flux reconcile kustomization <name> -n flux-system --with-source
# Force reconcile all
flux reconcile source git flux-system
# Suspend a HelmRelease (stop auto-sync)
flux suspend helmrelease <name> -n <namespace>
# Resume
flux resume helmrelease <name> -n <namespace>
# Check events
kubectl get events -n flux-system --sort-by='.lastTimestamp'
Self-hosted GitHub Runners (actions-runner-system)
The actions-runner-controller runs self-hosted GitHub Actions runners in the cluster, used for Renovate automation workflows. Managed by the runner HelmRelease in kubernetes/apps/actions-runner-system/.
Secrets
All secrets are managed via External Secrets Operator (ESO) backed by 1Password Connect. There is no SOPS encryption at runtime.
Architecture
1Password vault ("kubernetes")
└── 1Password Connect server (in-cluster, external-secrets namespace)
└── ClusterSecretStore "onepassword-connect"
└── ExternalSecret resources → Kubernetes Secrets
Components
| Component | Namespace | Purpose |
|---|---|---|
external-secrets | external-secrets | ESO operator |
onepassword-connect | external-secrets | 1Password Connect server |
ClusterSecretStore/onepassword-connect | cluster-scoped | Provider config pointing to Connect |
ClusterSecretStore
The onepassword-connect ClusterSecretStore is the single provider used by all ExternalSecrets in the cluster. It connects to the in-cluster Connect server:
spec:
provider:
onepassword:
connectHost: http://onepassword-connect.external-secrets.svc.cluster.local
vaults:
kubernetes: 1
Bootstrap Secrets
Two secrets must exist before ESO or 1Password Connect are installed. They are created by the just bootstrap resources stage from bootstrap/resources.yaml.j2 (rendered with op CLI):
| Secret | Namespace | Contains |
|---|---|---|
onepassword-connect-credentials-secret | external-secrets | 1password-credentials.json |
onepassword-connect-vault-secret | external-secrets | Connect API token |
cloudflare-tunnel-id-secret | network | CLOUDFLARE_TUNNEL_ID |
All values are sourced from op://kubernetes/1password/* and op://kubernetes/cloudflare/*.
Using ExternalSecrets in Apps
Reference the ClusterSecretStore in any namespace:
apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
name: my-app-secret
namespace: my-namespace
spec:
refreshInterval: 1h
secretStoreRef:
kind: ClusterSecretStore
name: onepassword-connect
target:
name: my-app-secret
creationPolicy: Owner
data:
- secretKey: MY_API_KEY
remoteRef:
key: my-app # 1Password item name in "kubernetes" vault
property: MY_API_KEY # 1Password field name
Troubleshooting
# Check ESO is running
kubectl get pods -n external-secrets
# Check a specific ExternalSecret status
kubectl describe externalsecret <name> -n <namespace>
# Check ClusterSecretStore connectivity
kubectl describe clustersecretstore onepassword-connect
# Check Connect server logs
kubectl logs -n external-secrets deploy/onepassword-connect
If Connect cannot reach 1Password servers, check that the onepassword-connect-credentials-secret JSON is valid and the Connect token has access to the kubernetes vault.
Ingress & Gateways
The cluster uses Envoy Gateway (Gateway API) for all ingress, with Cloudflare Tunnel for external access and ExternalDNS for automatic DNS record management.
See DNS & Split-Horizon for the full DNS flow.
Gateways
Two Gateway objects are defined in kubernetes/apps/network/envoy-gateway/app/envoy.yaml:
| Gateway | IP | Listeners | Purpose |
|---|---|---|---|
external-gateway | 10.10.99.97 | HTTP :80, HTTPS :443 | Internet-facing services (via Cloudflare Tunnel) |
internal-gateway | 10.10.99.98 | HTTP :80, HTTPS :443 | LAN-only services |
Both gateways share the same wildcard TLS certificate (dcunha-io-tls Secret in network namespace). HTTP traffic on port 80 is redirected to HTTPS via an https-redirect HTTPRoute.
Envoy Deployment
The EnvoyProxy resource configures the backing Envoy deployment:
- Replicas: 2
- PodDisruptionBudget: 1 minimum available
- Compression: Zstd, Brotli, Gzip (backend), with HTTP/2 and HTTP/3 support
- TLS minimum version: 1.2, ALPN:
h2,http/1.1 - Drain timeout: 180 s
- Metrics: Prometheus endpoint (gzip compressed)
Cloudflare Tunnel
The cloudflare-tunnel deployment (2 replicas) connects to Cloudflare's network and forwards *.dcunha.io traffic directly to the external-gateway pod:
ingress:
- hostname: "*.dcunha.io"
originRequest:
http2Origin: true
originServerName: external.dcunha.io
service: https://external-gateway.network.svc.cluster.local:443
- service: http_status:404
The tunnel bypasses the LoadBalancer IP — traffic comes in through Cloudflare's edge and is injected directly into the pod, which hands it to Envoy. The external-gateway IP (10.10.99.97) is only used for internal split-horizon access.
ExternalDNS
Two ExternalDNS instances watch different gateways and write to different DNS providers:
| Instance | Watches | Writes to |
|---|---|---|
external-dns-cloudflare | external-gateway HTTPRoutes | Cloudflare DNS (proxied CNAME → external.dcunha.io) |
external-dns-unifi | internal-gateway HTTPRoutes | UCG-Max DNS via kashalls webhook (A record → 10.10.99.98) |
TXT ownership records are prefixed with k8s. in both cases. external-dns-cloudflare uses txtOwnerId: artemis-cluster, external-dns-unifi uses txtOwnerId: k8s-internal.
Certificates
A single wildcard certificate covers all services:
- Cert:
dcunha-io-tls(Secret innetworknamespace) - Issuer: Let's Encrypt production via cert-manager
- DNS names:
dcunha.io,*.dcunha.io
The certificate is issued by cert-manager and referenced by both gateways. See Certificates.
Adding a New HTTPRoute
Internal service
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: my-app
namespace: my-namespace
spec:
parentRefs:
- name: internal-gateway
namespace: network
hostnames:
- my-app.dcunha.io
rules:
- matches:
- path:
type: PathPrefix
value: /
backendRefs:
- name: my-app-svc
port: 8080
External service (internet-accessible)
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: my-app
namespace: my-namespace
spec:
parentRefs:
- name: external-gateway
namespace: network
hostnames:
- my-app.dcunha.io
rules:
- matches:
- path:
type: PathPrefix
value: /
backendRefs:
- name: my-app-svc
port: 8080
Troubleshooting
# Check gateway status
kubectl get gateway -n network
# List all HTTPRoutes
kubectl get httproute -A
# Check Envoy proxy pods
kubectl get pods -n network -l gateway.envoyproxy.io/owning-gateway-name
# Check ExternalDNS logs
kubectl logs -n network deploy/external-dns-cloudflare
kubectl logs -n network deploy/external-dns-unifi
# Check tunnel connectivity
kubectl logs -n network deploy/cloudflare-tunnel
Storage
See also Hardware → Storage for the physical/NAS tier details.
Storage Classes
| StorageClass | Provisioner | Access Mode | Use Case |
|---|---|---|---|
ceph-blockpool (default) | Rook-Ceph RBD | RWO | App databases, stateful services |
ceph-filesystem | Rook-Ceph CephFS | RWX | Shared config across pods |
openebs-hostpath | OpenEBS | RWO | Local scratch/cache, single-node only |
Rook-Ceph
Deployed in rook-ceph namespace. The cluster consists of:
- 3 OSDs — one per control plane node (256 GB SATA SSD each)
- 3 MONs / 1 MGR — on control plane nodes
ceph-blockpool— replicated ×3, host-level failure domainceph-filesystem— CephFS for RWX workloads
Common Commands
# Check Ceph cluster health
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph status
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd status
# Check OSD usage
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph df
# Check PG status
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph pg stat
# Get pool list
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd lspools
Known Limits
pg_autoscalerenabled but capped atmon_max_pg_per_osd=250— cannot scale past this without adding OSDs or reducing pool count- Adding a new OSD requires adding a new node to the explicit node list in the
CephClusterresource (useAllNodes: false)
OpenEBS
Deployed in openebs-system. Provides local hostpath PVCs for workloads that don't need replication. The mount point /var/local/openebs is configured as a bind mount in Talos kubelet extraMounts.
Used for: download client incomplete dirs (SABnzbd), cache volumes, scratch space.
VolSync (PVC Backup/Restore)
VolSync automates PVC backups using Kopia. Deployed in volsync-system.
Components
volsync— operatorkopia— backup engine (S3-compatible backend)
Shared Components
Reusable Kustomize components in kubernetes/components/volsync/:
| File | Purpose |
|---|---|
pvc.yaml | PVC template |
replicationsource.yaml | Backup schedule + Kopia config |
replicationdestination.yaml | Restore destination config |
externalsecret.yaml | S3 credentials from 1Password |
Key Settings Applied
fsGroupChangePolicy: OnRootMismatch— prevents slow recursive chown on every backup (critical for Jellyfin with 21k+ files)moverAffinitypodAntiAffinity — spreads backup pods across nodes to avoid RBD mount storms on a single worker
See VolSync Operations for backup and restore procedures.
NFS (TrueNAS)
Bulk media storage is served via NFS from atlas (10.10.99.100). All media pods mount /mnt/atlas/media as /media.
NFS v4.2 is enforced via /etc/nfsmount.conf on all Talos nodes (configured in machineconfig.yaml.j2).
KEDA NFS Scaler
A KEDA ScaledObject in kubernetes/components/nfs-scaler/ can scale deployments based on NFS availability. Used to gate pods that depend on the NFS mount being healthy.
Certificates
TLS certificates are managed by cert-manager using Let's Encrypt with DNS-01 challenge via Cloudflare.
Wildcard Certificate
A single wildcard certificate covers all services in the cluster:
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: dcunha-io
namespace: network
spec:
secretName: dcunha-io-tls
issuerRef:
name: letsencrypt-production
kind: ClusterIssuer
dnsNames:
- dcunha.io
- "*.dcunha.io"
The resulting Secret dcunha-io-tls in the network namespace is referenced by both Envoy gateways (external-gateway and internal-gateway).
Certificate Export (Reflector)
The network/certificates kustomization handles syncing the wildcard cert to other namespaces via Reflector. The certificate is also exported to 1Password via a PushSecret for use outside the cluster (e.g. UCG-Max TLS).
cert-manager
Deployed in the cert-manager namespace via Helm. Bootstrapped early in the helmfile chain (before ESO/1Password).
# Check certificate status
kubectl get certificates -A
kubectl describe certificate dcunha-io -n network
# Check cert-manager logs
kubectl logs -n cert-manager deploy/cert-manager
# Force certificate renewal
kubectl delete secret dcunha-io-tls -n network
# cert-manager will automatically re-issue
Reflector
Reflector (kube-system namespace) mirrors Secrets and ConfigMaps across namespaces. Used to replicate dcunha-io-tls to namespaces that need TLS.
Annotate a Secret to enable reflection:
metadata:
annotations:
reflector.v1.k8s.emberstack.com/reflection-allowed: "true"
reflector.v1.k8s.emberstack.com/reflection-auto-enabled: "true"
reflector.v1.k8s.emberstack.com/reflection-allowed-namespaces: "media,home-automation"
Media Stack
The media stack lives in the media namespace. All apps share the TrueNAS NFS mount at /media.
Applications
| App | Purpose | URL |
|---|---|---|
| Jellyfin | Media server | https://jellyfin.dcunha.io |
| Jellyseerr | Request management | https://requests.dcunha.io |
| Sonarr (TV) | TV series management | internal |
| Sonarr (K-Drama) | K-Drama library | internal |
| Sonarr (Anime) | Anime library | internal |
| Radarr | Movie management | internal |
| Bazarr | Subtitle management | internal |
| Prowlarr | Central indexer manager | internal |
| SABnzbd | Usenet download client | internal |
| qBittorrent + Gluetun | Torrent client (VPN) | internal |
| qui | qBittorrent web UI + automation | internal |
| autobrr | IRC-based release automation | internal |
| Dispatcharr | IPTV management | internal |
| Recyclarr | Quality profile sync (CronJob) | — |
| FlareSolverr | Cloudflare bypass proxy | internal |
| TheLounge | Web IRC client | internal |
| cross-seed | Cross-seeding (built into qui) | — |
Jellyfin
- Trickplay: enabled
- Streamyfin plugin: installed — users connect via Streamyfin app for push notifications, casting, and TV login
- AnilistSync (Fallenbagel's plugin): per-user AniList scrobbling
If Trickplay stops working:
kubectl rollout restart deployment jellyfin -n media
Jellyseerr
- Tag Requests enabled — passes tags to Sonarr/Radarr for Kodi metadata, visible in Jellyfin
- Streamyfin webhook for user-targeted push notifications:
{ "title": "{{subject}}", "body": "{{message}}", "username": "{{requestedBy_username}}" }
Arr Stack
Three Sonarr instances manage separate libraries. All connect to Prowlarr as the single indexer source of truth.
Rule: Never add indexer API keys directly to Sonarr/Radarr. All indexers are managed in Prowlarr and synced automatically. Indexer configs live in Prowlarr's internal SQLite DB (stateful PVC), not in Git.
Internal cluster routing uses <app>.media.svc.cluster.local:
- Sonarr TV:
http://sonarr.media.svc.cluster.local:8989 - Prowlarr:
http://prowlarr.media.svc.cluster.local:9696
SABnzbd (Usenet)
SABnzbd incomplete dir must be on Rook-Ceph block storage (not TrueNAS NFS) — NFS cannot handle the random IOPS of RAR unpacking.
Server configuration:
| Priority | Server | Host | Connections | Notes |
|---|---|---|---|---|
| P0 | Frugal US (Omicron) | news.frugalusenet.com | 50 | ~3000 day retention |
| P1 | Frugal EU | eunews.frugalusenet.com | 30 | EU/NTD fallback |
| P2 | Frugal Bonus (Usenet.Farm EU) | bonus.frugalusenet.com | 50 | 1 TB/month cap |
| P3 | NGD 1 TB block | us.newsgroupdirect.com | 20 | UsenetExpress backbone |
| P4 | Blocknews 300 GB | us.blocknews.net | 10 | 6000+ day retention |
Config: article_cache=2G, receive_threads=4, SSL port 563, ciphers CHACHA20.
qBittorrent + Gluetun
qBittorrent and the Gluetun VPN sidecar run in the same pod (shared network namespace). All torrent traffic is tunnelled through Gluetun.
- Port forwarded: configured in qBittorrent Connection settings (UPnP disabled)
- DHT/PeX/Local Peer Discovery: disabled (private trackers only)
- Torrent queueing: disabled (all torrents active 24/7)
- Global share limits: disabled — handled by qui Automation
qui Seeding Automation
qui manages qBittorrent with AND-logic seeding rules (qBittorrent native is OR-only):
- Condition: ratio ≥ 1.1 AND seeding time ≥ 259,200 s (3 days)
- Action: Pause
Minimum tracker requirements apply — check each tracker's rules for ratio and seed time.
autobrr
autobrr monitors IRC announcers for private torrent trackers and Prowlarr feeds. Used primarily for ratio racing — grabbing releases the moment they're announced.
Connected to:
- Prowlarr (for indexer feeds)
- NZBGeek (as Newznab feed, secondary)
The AutobrrNetworkUnmonitored PrometheusRule fires if an IRC channel goes unmonitored for more than 1 hour. If it fires for a specific network, restart autobrr:
kubectl rollout restart deployment autobrr -n media
Cross-seeding
Cross-seeding is built into qui — there is no separate cross-seed deployment.
Critical: Never enable "Remove Completed" in Sonarr/Radarr download client settings. Enabling it deletes source files that cross-seed depends on.
Recyclarr
Runs as a CronJob to sync quality profiles from TRaSH Guides to Sonarr and Radarr.
# Force a manual run
kubectl create job --from=cronjob/recyclarr recyclarr-manual -n media
Home Automation
All home-automation apps live in the home-automation namespace. Pods that need direct L2 access to IoT devices attach a secondary interface to VLAN 1152 (IOT) via Multus.
Applications
| App | Purpose |
|---|---|
| Home Assistant | Central home automation hub |
| Frigate | NVR / AI camera monitoring |
| ESPHome | ESP8266/ESP32 device firmware management |
| Zigbee2MQTT | Zigbee coordinator → MQTT bridge |
| Mosquitto | MQTT broker |
| Matter Server | Matter/Thread protocol support |
| Homebridge | HomeKit bridge for non-native devices |
| Node-RED | Visual automation flows |
Network Architecture
Pods requiring IoT network access use a Multus NetworkAttachmentDefinition (iot in kube-system) to attach a secondary NIC on VLAN 1152. This allows:
- Frigate to discover and stream RTSP/ONVIF cameras on the IOT subnet
- Home Assistant to communicate directly with devices
- Zigbee2MQTT to reach the Zigbee coordinator USB dongle (passed through to the pod)
The primary pod interface remains on the cluster overlay network (VLAN 1099).
Home Assistant
Central hub connecting all other automation apps. Integrations include Zigbee (via Zigbee2MQTT + MQTT), ESPHome devices, Frigate (via MQTT + API), Matter devices, and Homebridge.
Frigate
AI-based NVR. Runs on the talos-gpu-01 node for hardware-accelerated object detection via the Intel Arc A380 GPU (VAAPI).
Mosquitto
MQTT broker used by Zigbee2MQTT, Frigate, ESPHome, and Home Assistant as the messaging backbone.
Zigbee2MQTT
Bridges Zigbee devices to MQTT. Requires a USB Zigbee coordinator passed through to the pod.
Node-RED
Visual flow editor for automation logic. Runs as a companion to Home Assistant for complex automations.
Observability
The observability stack lives in the observability namespace.
Applications
| App | Purpose | URL |
|---|---|---|
| Prometheus (kube-prometheus-stack) | Metrics collection + Alertmanager | internal |
| Grafana (grafana-operator) | Dashboards | internal |
| Victoria Logs | Log aggregation | internal |
| Fluent Bit | Log shipping to Victoria Logs | — |
| Gatus | Uptime / endpoint monitoring | https://status.dcunha.io |
| Kromgo | Prometheus badge endpoint | https://kromgo.dcunha.io |
| Blackbox Exporter | HTTP/TCP probing for Gatus | — |
| KEDA | Event-driven autoscaling | — |
| UniFi Poller | UniFi metrics → Prometheus | — |
Prometheus (kube-prometheus-stack)
Full kube-prometheus-stack including:
- Prometheus server
- Alertmanager
- Node exporter
- kube-state-metrics
Alertmanager
Alert routing is configured in kubernetes/components/alerts/alertmanager/. Active alerts are surfaced in the README badge.
If Prometheus WAL is corrupted after a node crash:
# Scale down
kubectl scale -n observability statefulset prometheus-kube-prometheus-stack-prometheus --replicas=0
# Wipe WAL only (compacted blocks are safe)
kubectl -n observability exec <prometheus-pod> -- rm -rf /prometheus/prometheus-db/wal/
# Scale up
kubectl scale -n observability statefulset prometheus-kube-prometheus-stack-prometheus --replicas=1
Do NOT delete individual WAL segments — this creates a non-sequential gap and causes a startup failure.
Grafana
Deployed via the grafana-operator. The operator manages a Grafana CR with:
- Datasources: Prometheus, Victoria Logs
- Dashboards: imported from app-specific
GrafanaDashboardresources and JSON ConfigMaps
Apps that ship dashboards (Flux, Envoy Gateway, Cloudflare Tunnel, etc.) create GrafanaDashboard resources in their own namespaces, which the operator picks up automatically.
Victoria Logs
Replaces Loki for log aggregation. Fluent Bit ships logs from all pods to Victoria Logs.
Gatus
Endpoint monitoring with status badges. Endpoints are defined in kubernetes/apps/observability/gatus/app/resources/cluster-endpoints.yaml. Gatus also reads endpoint annotations from HTTPRoute resources (via gatus.home-operations.com/endpoint annotations on gateways).
Groups:
core— Ping, Status Page, Heartbeat (Alertmanager watchdog)external— externally-accessible services (checked via1.1.1.1DNS)internal— LAN-only services
Kromgo
Exposes Prometheus queries as shields.io-compatible badge endpoints for the README.
Current metrics:
| Metric | Query |
|---|---|
talos_version | node_os_info{name="Talos"} |
kubernetes_version | kubernetes_build_info |
flux_version | flux_instance_info |
cluster_node_count | count(kube_node_status_condition{condition="Ready"}) |
cluster_pod_count | sum(kube_pod_status_phase{phase="Running"}) |
cluster_cpu_usage | avg(instance:node_cpu_utilisation:rate5m) * 100 |
cluster_memory_usage | Node memory utilisation % |
cluster_age_days | (time() - min(kube_node_created)) / 86400 |
cluster_uptime_days | Average node uptime |
cluster_alert_count | alertmanager_alerts{state="active"} - 1 (excludes Watchdog) |
The cluster_power_usage metric is defined but disabled — it requires a UPS SNMP exporter which is not running (Eaton UPS batteries are dead).
UniFi Poller
Scrapes metrics from the UCG-Max (UniFi controller) and exposes them to Prometheus. Provides network device health, client counts, and traffic metrics in Grafana.
Adding an App
Most apps use the bjw-s app-template Helm chart. This is the standard pattern for adding a new app to the cluster.
Directory Structure
kubernetes/apps/<namespace>/<app-name>/
├── ks.yaml # Flux Kustomization
└── app/
├── kustomization.yaml
├── helmrelease.yaml
├── externalsecret.yaml # (if secrets needed)
└── httproute.yaml # (if ingress needed)
Step 1: Create the Flux Kustomization (ks.yaml)
# yaml-language-server: $schema=https://kubernetes-schemas.pages.dev/kustomization_v1.json
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: &app my-app
namespace: flux-system
spec:
targetNamespace: my-namespace
commonMetadata:
labels:
app.kubernetes.io/name: *app
path: ./kubernetes/apps/my-namespace/my-app/app
prune: true
sourceRef:
kind: GitRepository
name: flux-system
dependsOn:
- name: external-secrets-stores # if using ExternalSecrets
- name: rook-ceph-cluster # if using Ceph PVCs
- name: volsync # if using VolSync backups
Step 2: Add to Namespace Kustomization
Add the app to kubernetes/apps/<namespace>/kustomization.yaml:
resources:
- ./existing-app
- ./my-app # add this line
Step 3: Create the HelmRelease (app/helmrelease.yaml)
# yaml-language-server: $schema=https://raw.githubusercontent.com/bjw-s-labs/helm-charts/main/charts/other/app-template/schemas/helmrelease-helm-v2.schema.json
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: my-app
spec:
interval: 1h
chartRef:
kind: OCIRepository
name: app-template
namespace: flux-system
values:
controllers:
my-app:
containers:
app:
image:
repository: ghcr.io/example/my-app
tag: 1.0.0
env:
TZ: America/Toronto
service:
app:
ports:
http:
port: 8080
persistence:
data:
existingClaim: my-app-data
globalMounts:
- path: /data
Step 4: Add Secrets (if needed)
# app/externalsecret.yaml
apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
name: my-app
spec:
refreshInterval: 1h
secretStoreRef:
kind: ClusterSecretStore
name: onepassword-connect
target:
name: my-app
creationPolicy: Owner
data:
- secretKey: API_KEY
remoteRef:
key: my-app
property: API_KEY
Reference in HelmRelease:
containers:
app:
envFrom:
- secretRef:
name: my-app
Step 5: Add Ingress (if needed)
Internal only
# app/httproute.yaml
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: my-app
spec:
parentRefs:
- name: internal-gateway
namespace: network
hostnames:
- my-app.dcunha.io
rules:
- matches:
- path:
type: PathPrefix
value: /
backendRefs:
- name: my-app
port: 8080
External (internet-accessible)
Change internal-gateway to external-gateway.
Step 6: Add VolSync Backup (if PVC needs backup)
Add the volsync component to app/kustomization.yaml:
components:
- ../../../components/volsync
Create a ClaimName annotation and ensure the ReplicationSource is configured with the correct PVC name and schedule in a patch.
Step 7: Add Config Map (if needed)
Use a Kustomize ConfigMap generator to bundle config files:
# app/kustomization.yaml
configMapGenerator:
- name: my-app-config
files:
- config.yaml
Add reloader.stakater.com/auto: "true" to the controller annotation to restart pods when the ConfigMap changes.
Conventions
- Use
TZ: America/Torontofor timezone-sensitive apps - Use
10.10.99.1as DNS resolver (UCG-Max), not 8.8.8.8 - Internal cluster routing:
<app>.<namespace>.svc.cluster.local— never external DNS for pod-to-pod - Reloader annotation on controllers that need restart on config/secret change
- Prowlarr is the single indexer source — never add indexer keys directly to arr apps
Cluster Reset
Full procedure to destroy and rebuild the cluster. See Talos → Node Reset for single-node reset.
Before You Start
- Ensure any critical PVC data has been backed up (VolSync or manual snapshot)
- This is irreversible — all data on node disks is permanently deleted
Phase 1: Reset All Nodes
# Control planes
talosctl -n 10.10.99.101 reset --graceful=false --reboot
talosctl -n 10.10.99.102 reset --graceful=false --reboot
talosctl -n 10.10.99.103 reset --graceful=false --reboot
# Workers
talosctl -n 10.10.99.201 reset --graceful=false --reboot
talosctl -n 10.10.99.202 reset --graceful=false --reboot
talosctl -n 10.10.99.203 reset --graceful=false --reboot
Nodes reboot after wiping. Because the OS disk is wiped, they cannot boot from disk.
Phase 2: Boot from Talos ISO
Each node needs to boot into Talos maintenance mode from an ISO.
Download ISOs
just talos download-image v1.12.6 controlplane
just talos download-image v1.12.6 worker
just talos download-image v1.12.6 gpu
Physical nodes (talos-cp-01/02/03)
- Flash ISO to USB (
dd if=talos-v1.12.6-controlplane.iso of=/dev/sdX bs=4M status=progress) - Insert USB and boot each node — select USB from boot menu (F10/F12)
Proxmox VMs (talos-w-01/02, talos-gpu-01)
- Upload the worker/gpu ISO to Proxmox storage
- Attach ISO to each VM's CD drive:
qm set <vmid> -ide2 local:iso/talos-v1.12.6-worker.iso,media=cdrom - Set boot order to CD first:
qm set <vmid> -boot order=ide2;scsi0 - Start VMs:
qm start 101; qm start 102; qm start 104
Phase 3: Re-Bootstrap
Once all nodes are in maintenance mode:
# Verify nodes are reachable
ping 10.10.99.101
ping 10.10.99.201
# Run full bootstrap
just
See Bootstrap for stage details.
Phase 4: Restore PVC Data
After Flux has reconciled all apps, restore PVC data from VolSync backups:
Post-Reset Checklist
# Nodes ready
kubectl get nodes -o wide
# Flux reconciling
flux get kustomizations -A
# Rook-Ceph healthy
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph status
# BGP peers established
kubectl -n kube-system exec ds/cilium -- cilium bgp peers
# Check all pods
kubectl get pods -A | grep -v Running | grep -v Completed
RBD CSI Recovery
When worker VMs experience storage I/O errors, the RBD kernel driver can enter a broken state causing cascading pod failures across the node.
Symptoms
- Pods stuck in
ContainerCreatingwithinput/output erroron mounts - CSI node plugin logs:
operation already existsorCannot send after transport endpoint shutdown MountVolume.SetUp failedwithlstat ... input/output error- VolSync jobs stuck in
Init:0/1 rbd: map failed: (108) Cannot send after transport endpoint shutdown
Recovery (in order)
Step 1: Restart the RBD CSI node plugin on the affected node
# Find the CSI node plugin pod on the affected node
kubectl get pods -n rook-ceph -l app=csi-rbdplugin --field-selector spec.nodeName=talos-w-01
# Delete it (it will restart automatically)
kubectl delete pod -n rook-ceph <csi-nodeplugin-pod>
If the pod restarts and errors clear, you're done.
Step 2: If CSI restart doesn't help — reboot the worker node
The kernel RBD module may have lost network transport. A reboot is required:
just talos reboot-node talos-w-01
If the node hangs during reboot (kernel stalls on RBD unmount):
# Hard reset via Proxmox
qm reset 101 # talos-w-01
qm reset 102 # talos-w-02
qm reset 104 # talos-gpu-01
Step 3: After reboot — clean up stale resources
# Force-delete pods stuck in Error or ContainerStatusUnknown
kubectl delete pod <pod> -n <namespace> --force --grace-period=0
# Find stale VolumeAttachments for the rebooted node
kubectl get volumeattachment | grep talos-w-01
# Delete stale VolumeAttachments
kubectl delete volumeattachment <name>
Step 4: If a VolSync PVC has XFS corruption
If volsync reports mount failed: exit status 32 on a snapshot PVC:
# Delete the volsync source PVC — it will be recreated fresh on the next backup run
kubectl delete pvc volsync-<app>-src -n <namespace>
Stale VolumeAttachment with Stuck Finalizers
Some PVs (notably Mosquitto) have VolumeAttachments that re-appear after deletion due to stuck finalizers:
# Find the PV for the stuck VA
kubectl get volumeattachment <name> -o jsonpath='{.spec.source.persistentVolumeName}'
# Remove finalizers from the PV
kubectl patch pv <pv-name> --type=json \
-p='[{"op":"remove","path":"/metadata/finalizers"}]'
Root Cause
The RBD kernel module (rbd: map failed: (108) Cannot send after transport endpoint shutdown) loses its network transport to the Ceph cluster when the Proxmox host disk experiences I/O errors. Worker VMs freeze and the kernel RBD state becomes irrecoverable without a node reboot.
Prevention: The Proxmox OS disk was replaced (T-FORCE 1 TB SSD) after the WD Blue SSD that caused this reached 85% wear. VolSync moverAffinity podAntiAffinity was added to spread backup jobs across nodes, reducing the chance of a concurrent RBD mount storm.
Prometheus WAL Corruption (After Node Crash)
If Prometheus fails to start after a crash with segments are not sequential errors:
# Scale down Prometheus
kubectl scale -n observability statefulset prometheus-kube-prometheus-stack-prometheus --replicas=0
# Get a shell (pod must exist — scale to 1 with a sleep command if needed, or use a debug pod)
# Wipe the entire WAL directory (NOT individual segments)
kubectl -n observability exec <prometheus-pod> -- rm -rf /prometheus/prometheus-db/wal/
# Scale back up
kubectl scale -n observability statefulset prometheus-kube-prometheus-stack-prometheus --replicas=1
This loses ~2 hours of uncompacted metrics only. Compacted TSDB blocks on disk are untouched.
VolSync Backup & Restore
VolSync provides automated PVC backup and restore using Kopia. All backup configuration is templated via shared Kustomize components.
Architecture
ReplicationSource (per app)
└── VolSync operator → Kopia mover pod → S3 backup repository
ReplicationDestination (per app)
└── VolSync operator → Kopia mover pod → restores to new PVC
Credentials (S3 endpoint, bucket, keys) are synced from 1Password via ExternalSecret in kubernetes/components/volsync/externalsecret.yaml.
Shared Components
Located in kubernetes/components/volsync/. Apps include them via:
# app/kustomization.yaml
components:
- ../../../components/volsync
| File | Purpose |
|---|---|
pvc.yaml | PVC definition |
replicationsource.yaml | Backup schedule + Kopia config |
replicationdestination.yaml | Restore destination |
externalsecret.yaml | S3 credentials from 1Password |
Key Settings
fsGroupChangePolicy: OnRootMismatch— prevents recursive chown on every backup. Critical for apps with large filesystems (e.g. Jellyfin with 21k+ trickplay files — without this, backups take 1 hour+ just on chown).moverAffinitypodAntiAffinity — spreads mover pods across nodes. Without this, all backup jobs land on a single node causing concurrent RBD mount storms and CSI failures.
Note: the descheduler cannot help here — it excludes Job-owned pods from eviction. Anti-affinity must be set at scheduling time.
Triggering a Manual Backup
# Annotate the ReplicationSource to trigger an immediate backup
kubectl annotate replicationsource <app> \
volsync.backube/trigger-immediate-backup="$(date +%s)" \
-n <namespace>
# Watch the backup job
kubectl get jobs -n <namespace> -w
kubectl logs -n <namespace> job/volsync-src-<app> -f
Restoring a PVC
Method 1: Restore to existing app (rolling restore)
-
Scale down the app:
kubectl scale deploy/<app> -n <namespace> --replicas=0 -
Delete the existing PVC:
kubectl delete pvc <app-data-pvc> -n <namespace> -
Apply or annotate the
ReplicationDestinationto trigger a restore:kubectl annotate replicationdestination <app> \ volsync.backube/trigger-immediate-restore="$(date +%s)" \ -n <namespace> -
Wait for the restore job to complete:
kubectl get replicationdestination <app> -n <namespace> -w -
The restored PVC is now bound. Scale the app back up:
kubectl scale deploy/<app> -n <namespace> --replicas=1
Method 2: Restore to a new namespace (disaster recovery)
Create a ReplicationDestination in the target namespace pointing to the same Kopia repository. The mover will pull the latest snapshot.
Checking Backup Status
# List all ReplicationSources and their last sync time
kubectl get replicationsource -A
# Check a specific source
kubectl describe replicationsource <app> -n <namespace>
# Check mover pod logs for a running backup
kubectl logs -n <namespace> -l app.kubernetes.io/component=replication-source -f
Volsync Maintenance
The volsync-system/volsync/maintenance/ kustomization applies:
MutatingAdmissionPolicyfor default settings- Kopia repository maintenance schedule (prune old snapshots)
- ExternalSecret for S3 credentials
Troubleshooting
| Issue | Cause | Fix |
|---|---|---|
Job stuck in Init:0/1 | RBD mount failure on mover node | See RBD CSI Recovery |
mount failed: exit status 32 | XFS corruption on snapshot PVC | Delete volsync-<app>-src PVC |
| All movers on same node | Missing moverAffinity | Already applied in component; check patch is included |
| Backup taking 1h+ | fsGroupChangePolicy not set | Already set in component; check patch is included |
Links
Tools & Projects
- Talos Linux — immutable Kubernetes OS
- Flux Operator — manages Flux lifecycle
- bjw-s app-template — Helm chart used by nearly all apps
- Cilium — CNI, BGP, Gateway API
- Envoy Gateway — Kubernetes Gateway API implementation
- External Secrets Operator — syncs secrets from 1Password
- 1Password Connect — self-hosted 1Password API server
- Rook-Ceph — distributed block storage operator
- VolSync — PVC backup/restore with Kopia
- Kromgo — Prometheus badge endpoint
- external-dns-unifi-webhook — ExternalDNS provider for UniFi
- Gatus — endpoint uptime monitoring
- Spegel — peer-to-peer OCI registry mirror
- tuppr — automated Talos/Kubernetes upgrades
- Multus — multi-network CNI plugin
- Reflector — Secret/ConfigMap mirror across namespaces
- Reloader — rolling restart on Secret/ConfigMap changes
Community
Reference Reading
- TechnoTim — practical homelab guides
- Christian Lempa YouTube — infrastructure concepts
- Christian Lempa Cheat Sheets
Repo References
Homelab repos that have been referenced, borrowed from, or used as inspiration for the Artemis Cluster: