GitOps-managed homelab: Kubernetes cluster (Talos + Flux) + VPS services + Infrastructure agent.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β VPS (Oracle/Hetzner) cloudlab-infrastructure/ β
β βββ Traefik (reverse proxy + Cloudflare Tunnel) β
β βββ Pi-hole (DNS) β
β βββ Portainer EE (container management) β
β βββ Homepage (dashboard) β
β βββ Joplin Server + Postgres (notes) β
β βββ Uptime Kuma (monitoring) β
β βββ Guacamole (remote desktop gateway) β
β βββ Glances (system monitoring) β
β βββ Garage S3 (Longhorn backup target) β
ββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β Tailscale mesh VPN
ββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββ
β Kubernetes Cluster (Talos Linux + Flux) β
β βββ Cilium (CNI + Gateway API) β
β βββ Longhorn (storage β backs up to Garage S3) β
β βββ cert-manager, external-dns, k8s-gateway β
β βββ Apps: see kubernetes/apps/ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββ
β OpenClaw β openclaw.ai β
β Telegram bot β Claude API β kubectl/docker β
β Config: agent/openclaw.json Skill: agent/skills/infra β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Device | Role | Specs |
|---|---|---|
| Dell OptiPlex 3050 #1 | K8s node (Proxmox VM) | i5-6500T, 16GB, 128GB NVMe |
| Dell OptiPlex 3050 #2 | K8s node (Proxmox VM) | i5-6500T, 16GB, 128GB NVMe |
| Beelink GTi 13 Pro | K8s node (Proxmox VM) | i9-13900H, 64GB, 2x2TB NVMe |
| Dell PowerEdge R720 | Proxmox Backup Server | 2x Xeon E5-2697v2, 192GB |
| Synology DS223+ | NAS / NFS + Backup | 2x2TB HDD RAID1 |
| XCY X44 | pfSense Firewall | N100, 8GB |
| Oracle Cloud ARM VPS | Off-site services (primary) | 4 vCPU ARM, 24GB RAM, 200GB |
| Hetzner CAX21 (DR) | Standby β make dr-full |
4 vCPU ARM, 8GB RAM, β¬5.39/mo |
infrastructure/
βββ cloudlab-infrastructure/ # Ansible β VPS provisioning
βββ kubernetes/
β βββ apps/ # Flux app manifests (namespaced)
β βββ flux/ # Flux bootstrap + HelmRepositories
β βββ components/ # Shared Kustomize components (common, repos)
βββ talos/ # Talos node configs + patches
βββ bootstrap/ # Cluster bootstrap vars
βββ agent/ # OpenClaw config template + infra skill
β βββ openclaw.json # Gateway config (no secrets β use ~/.openclaw/.env)
β βββ skills/infra/ # kubectl/docker skill context
βββ DEPLOY.md # Full rebuild + DR guide
βββ Taskfile.yaml # Task runner (talosctl, flux, etc.)
Full step-by-step rebuild guide: DEPLOY.md
| Scenario | Where to look |
|---|---|
| VPS lost (Oracle reclaimed / provider down) | cd cloudlab-infrastructure && make dr-full β provisions Hetzner server + deploys all services in ~15 min |
| Full rebuild (new server + new cluster) | DEPLOY.md β Phase 1 (VPS) β Phase 2 (K8s) β Phase 3 (Agent) |
| Restore Longhorn volumes from S3 backup | DEPLOY.md β Phase 2, step 7: task restore:longhorn |
| New hardware (different IPs/disks) | DEPLOY.md β Phase 2, step 3: update talos/talconfig.yaml, cluster-vars.yaml, cilium/networks.yaml |
| Intel iGPU absent on new hardware | Remove gpu.intel.com/i915 from kubernetes/apps/default/jellyfin/app/helmrelease.yaml and disable intel-device-plugin-operator |
| Jellyfin restored but streaming slow / Tailscale broken | See docs/jellyfin-post-restore.md β manual UI steps required after every restore |
| Re-install OpenClaw agent only | DEPLOY.md β Phase 3 |
The two things to back up before decommissioning a server:
age.keyβ losing this = losing all SOPS-encrypted secrets~/.openclaw/.envβ Anthropic API key, Telegram tokens
# Check overall health
kubectl get nodes
kubectl get kustomizations -A
kubectl get helmreleases -A
cilium status
# Force Flux sync
task reconcile
# Regenerate Talos config (after editing talconfig.yaml)
task talos:generate-config
# Apply updated config to a node
task talos:apply-node IP=10.57.57.80
# Upgrade Talos on a node (update talenv.yaml version first)
task talos:upgrade-node IP=10.57.57.80
task talos:upgrade-k8s
# Reset entire cluster (destructive)
task talos:resetcd cloudlab-infrastructure/
make health-check # Verify all services running
make setup # Full redeploy (idempotent)
make update # OS package updates only
make check # Dry-run (--check --diff)
make check-resources # Disk, memory, Docker usage
make cleanup # Remove unused Docker images/volumes
# Disaster recovery β provision on Hetzner + deploy everything
make terraform-init # first time only
make dr-full # ~15 min: new server + full stack
make terraform-plan # preview what Terraform will createflux get sources git -A # Check git source is reachable
flux get kustomizations -A # Find which ks is failing
flux logs --level=error # See error messages
flux reconcile kustomization cluster-apps --with-source # Force synckubectl get helmreleases -A | grep -v True
flux logs --kind HelmRelease --name <name> -n <namespace>
flux reconcile helmrelease <name> -n <namespace> --with-source
# If values changed and Helm refuses β suspend + resume:
flux suspend helmrelease <name> -n <namespace>
flux resume helmrelease <name> -n <namespace>kubectl -n <namespace> get pods -o wide
kubectl -n <namespace> describe pod <pod>
kubectl -n <namespace> logs <pod> -f
kubectl -n <namespace> logs <pod> --previous # crashed container
kubectl -n <namespace> get events --sort-by='.metadata.creationTimestamp'# Volume / replica status
kubectl -n longhorn-system get volumes
kubectl -n longhorn-system get nodes.longhorn.io
# Orphaned replicas (safe to delete)
kubectl get orphan -n longhorn-system -o name | \
xargs kubectl delete -n longhorn-system
# Trigger a backup manually
# Longhorn UI β Volume β Create Backup
# Old snapshots cleanup
kubectl get snapshots -n longhorn-system -o json | \
jq -r '.items[] | select(.status.creationTime < "2025-01-01") | .metadata.name' | \
xargs kubectl delete snapshot -n longhorn-system# 1. Drain node
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data
# 2. In Proxmox: shutdown VM, swap physical disk, boot VM
# 3. Regenerate and re-apply Talos config
task talos:generate-config
talosctl apply-config --insecure --nodes <ip> \
--file talos/clusterconfig/<node>.yaml
# 4. Uncordon
kubectl uncordon <node>
# 5. If Longhorn disk UUID changed β evict replicas then re-add disk:
kubectl -n longhorn-system patch node.longhorn.io <node> \
--type merge -p '{"spec":{"evictionRequested":true}}'
# Wait for replicas to evacuate (~20-60 min), then remove old disk
# and add new disk via Longhorn UIWait 1-2 hours between disk swaps to allow replica rebuild.
talosctl -n <node-ip> health
talosctl -n <node-ip> dmesg
talosctl -n <node-ip> services
kubectl describe node <node-name>ssh root@<vps-ip> "docker exec garage /garage status"
ssh root@<vps-ip> "docker exec garage /garage bucket list"
# Verify Longhorn can reach it:
kubectl -n longhorn-system get secret minio-secretKeep an odd number of control plane nodes (1, 3, 5) for quorum.
# 1. Boot new node from Talos ISO β same schematic ID as existing nodes
# Get disk and MAC from the node in maintenance mode:
talosctl get disks -n <new-node-ip> --insecure
talosctl get links -n <new-node-ip> --insecure
# 2. Add node entry to talos/talconfig.yaml with the disk and MAC above
# 3. Regenerate config and apply to new node
task talos:generate-config
task talos:apply-node IP=<new-node-ip>
# 4. Node joins automatically β watch it become Ready:
kubectl get nodes --watch# After editing talos/talconfig.yaml or any patch:
task talos:generate-config
task talos:apply-node IP=<node-ip> MODE=auto
# MODE=auto applies without reboot if possible, reboots if requiredRenovate runs every weekend and opens PRs automatically for:
- Helm chart versions (all HelmReleases)
- Container image tags (annotated with
# renovate:) - Talos / Kubernetes versions (
.mise.toml)
Config: .renovaterc.json5
# Edit any encrypted secret
sops kubernetes/apps/<namespace>/<app>/app/secret.sops.yaml
# Re-encrypt all secrets after AGE key rotation
find . -name "*.sops.*" -exec sops updatekeys {} \;- Kubernetes secrets encrypted with SOPS (AGE key β back up manually)
- Ansible secrets in encrypted Vault (
cloudlab-infrastructure/) - All traffic via Tailscale mesh or Cloudflare Tunnel (no open ports)