merox.dev Infrastructure

GitOps-managed homelab: Kubernetes cluster (Talos + Flux) + VPS services + Infrastructure agent.

Architecture

┌─────────────────────────────────────────────────────────┐
│  VPS (Oracle/Hetzner)   cloudlab-infrastructure/        │
│  ├── Traefik (reverse proxy + Cloudflare Tunnel)        │
│  ├── Pi-hole (DNS)                                      │
│  ├── Portainer EE (container management)                │
│  ├── Homepage (dashboard)                               │
│  ├── Joplin Server + Postgres (notes)                   │
│  ├── Uptime Kuma (monitoring)                           │
│  ├── Guacamole (remote desktop gateway)                 │
│  ├── Glances (system monitoring)                        │
│  └── Garage S3 (Longhorn backup target)                 │
└────────────────────┬────────────────────────────────────┘
                     │ Tailscale mesh VPN
┌────────────────────▼────────────────────────────────────┐
│  Kubernetes Cluster (Talos Linux + Flux)                │
│  ├── Cilium (CNI + Gateway API)                         │
│  ├── Longhorn (storage → backs up to Garage S3)         │
│  ├── cert-manager, external-dns, k8s-gateway            │
│  └── Apps: see kubernetes/apps/                         │
└─────────────────────────────────────────────────────────┘
                     │
┌────────────────────▼────────────────────────────────────┐
│  OpenClaw — openclaw.ai                                 │
│  Telegram bot → Claude API → kubectl/docker             │
│  Config: agent/openclaw.json  Skill: agent/skills/infra │
└─────────────────────────────────────────────────────────┘

Hardware

Device	Role	Specs
Dell OptiPlex 3050 #1	K8s node (Proxmox VM)	i5-6500T, 16GB, 128GB NVMe
Dell OptiPlex 3050 #2	K8s node (Proxmox VM)	i5-6500T, 16GB, 128GB NVMe
Beelink GTi 13 Pro	K8s node (Proxmox VM)	i9-13900H, 64GB, 2x2TB NVMe
Dell PowerEdge R720	Proxmox Backup Server	2x Xeon E5-2697v2, 192GB
Synology DS223+	NAS / NFS + Backup	2x2TB HDD RAID1
XCY X44	pfSense Firewall	N100, 8GB
Oracle Cloud ARM VPS	Off-site services (primary)	4 vCPU ARM, 24GB RAM, 200GB
Hetzner CAX21 (DR)	Standby — `make dr-full`	4 vCPU ARM, 8GB RAM, €5.39/mo

Repository Layout

infrastructure/
├── cloudlab-infrastructure/    # Ansible — VPS provisioning
├── kubernetes/
│   ├── apps/                   # Flux app manifests (namespaced)
│   ├── flux/                   # Flux bootstrap + HelmRepositories
│   └── components/             # Shared Kustomize components (common, repos)
├── talos/                      # Talos node configs + patches
├── bootstrap/                  # Cluster bootstrap vars
├── agent/                      # OpenClaw config template + infra skill
│   ├── openclaw.json           # Gateway config (no secrets — use ~/.openclaw/.env)
│   └── skills/infra/           # kubectl/docker skill context
├── DEPLOY.md                   # Full rebuild + DR guide
└── Taskfile.yaml               # Task runner (talosctl, flux, etc.)

Disaster Recovery

Full step-by-step rebuild guide: DEPLOY.md

Scenario	Where to look
VPS lost (Oracle reclaimed / provider down)	`cd cloudlab-infrastructure && make dr-full` — provisions Hetzner server + deploys all services in ~15 min
Full rebuild (new server + new cluster)	DEPLOY.md — Phase 1 (VPS) → Phase 2 (K8s) → Phase 3 (Agent)
Restore Longhorn volumes from S3 backup	DEPLOY.md — Phase 2, step 7: `task restore:longhorn`
New hardware (different IPs/disks)	DEPLOY.md — Phase 2, step 3: update `talos/talconfig.yaml`, `cluster-vars.yaml`, `cilium/networks.yaml`
Intel iGPU absent on new hardware	Remove `gpu.intel.com/i915` from `kubernetes/apps/default/jellyfin/app/helmrelease.yaml` and disable `intel-device-plugin-operator`
Jellyfin restored but streaming slow / Tailscale broken	See docs/jellyfin-post-restore.md — manual UI steps required after every restore
Re-install OpenClaw agent only	DEPLOY.md — Phase 3

The two things to back up before decommissioning a server:

age.key — losing this = losing all SOPS-encrypted secrets
~/.openclaw/.env — Anthropic API key, Telegram tokens

Day-to-Day Operations

Cluster

# Check overall health
kubectl get nodes
kubectl get kustomizations -A
kubectl get helmreleases -A
cilium status

# Force Flux sync
task reconcile

# Regenerate Talos config (after editing talconfig.yaml)
task talos:generate-config

# Apply updated config to a node
task talos:apply-node IP=10.57.57.80

# Upgrade Talos on a node (update talenv.yaml version first)
task talos:upgrade-node IP=10.57.57.80
task talos:upgrade-k8s

# Reset entire cluster (destructive)
task talos:reset

VPS

cd cloudlab-infrastructure/

make health-check       # Verify all services running
make setup              # Full redeploy (idempotent)
make update             # OS package updates only
make check              # Dry-run (--check --diff)
make check-resources    # Disk, memory, Docker usage
make cleanup            # Remove unused Docker images/volumes

# Disaster recovery — provision on Hetzner + deploy everything
make terraform-init     # first time only
make dr-full            # ~15 min: new server + full stack
make terraform-plan     # preview what Terraform will create

Troubleshooting

Flux not reconciling

flux get sources git -A                              # Check git source is reachable
flux get kustomizations -A                           # Find which ks is failing
flux logs --level=error                              # See error messages
flux reconcile kustomization cluster-apps --with-source  # Force sync

HelmRelease stuck / failed

kubectl get helmreleases -A | grep -v True
flux logs --kind HelmRelease --name <name> -n <namespace>
flux reconcile helmrelease <name> -n <namespace> --with-source
# If values changed and Helm refuses — suspend + resume:
flux suspend helmrelease <name> -n <namespace>
flux resume helmrelease <name> -n <namespace>

Pod issues

kubectl -n <namespace> get pods -o wide
kubectl -n <namespace> describe pod <pod>
kubectl -n <namespace> logs <pod> -f
kubectl -n <namespace> logs <pod> --previous            # crashed container
kubectl -n <namespace> get events --sort-by='.metadata.creationTimestamp'

Longhorn storage

# Volume / replica status
kubectl -n longhorn-system get volumes
kubectl -n longhorn-system get nodes.longhorn.io

# Orphaned replicas (safe to delete)
kubectl get orphan -n longhorn-system -o name | \
  xargs kubectl delete -n longhorn-system

# Trigger a backup manually
# Longhorn UI → Volume → Create Backup

# Old snapshots cleanup
kubectl get snapshots -n longhorn-system -o json | \
  jq -r '.items[] | select(.status.creationTime < "2025-01-01") | .metadata.name' | \
  xargs kubectl delete snapshot -n longhorn-system

Replacing a disk on a K8s node

# 1. Drain node
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data

# 2. In Proxmox: shutdown VM, swap physical disk, boot VM

# 3. Regenerate and re-apply Talos config
task talos:generate-config
talosctl apply-config --insecure --nodes <ip> \
  --file talos/clusterconfig/<node>.yaml

# 4. Uncordon
kubectl uncordon <node>

# 5. If Longhorn disk UUID changed — evict replicas then re-add disk:
kubectl -n longhorn-system patch node.longhorn.io <node> \
  --type merge -p '{"spec":{"evictionRequested":true}}'
# Wait for replicas to evacuate (~20-60 min), then remove old disk
# and add new disk via Longhorn UI

Wait 1-2 hours between disk swaps to allow replica rebuild.

Node unreachable

talosctl -n <node-ip> health
talosctl -n <node-ip> dmesg
talosctl -n <node-ip> services
kubectl describe node <node-name>

Garage S3 (Longhorn backup target)

ssh root@<vps-ip> "docker exec garage /garage status"
ssh root@<vps-ip> "docker exec garage /garage bucket list"
# Verify Longhorn can reach it:
kubectl -n longhorn-system get secret minio-secret

Maintenance

Adding a node

Keep an odd number of control plane nodes (1, 3, 5) for quorum.

# 1. Boot new node from Talos ISO — same schematic ID as existing nodes
#    Get disk and MAC from the node in maintenance mode:
talosctl get disks -n <new-node-ip> --insecure
talosctl get links -n <new-node-ip> --insecure

# 2. Add node entry to talos/talconfig.yaml with the disk and MAC above

# 3. Regenerate config and apply to new node
task talos:generate-config
task talos:apply-node IP=<new-node-ip>

# 4. Node joins automatically — watch it become Ready:
kubectl get nodes --watch

Talos config changes

# After editing talos/talconfig.yaml or any patch:
task talos:generate-config
task talos:apply-node IP=<node-ip> MODE=auto
# MODE=auto applies without reboot if possible, reboots if required

Dependency updates

Renovate runs every weekend and opens PRs automatically for:

Helm chart versions (all HelmReleases)
Container image tags (annotated with # renovate:)
Talos / Kubernetes versions (.mise.toml)

Config: .renovaterc.json5

SOPS secret rotation

# Edit any encrypted secret
sops kubernetes/apps/<namespace>/<app>/app/secret.sops.yaml

# Re-encrypt all secrets after AGE key rotation
find . -name "*.sops.*" -exec sops updatekeys {} \;

Security

Kubernetes secrets encrypted with SOPS (AGE key — back up manually)
Ansible secrets in encrypted Vault (cloudlab-infrastructure/)
All traffic via Tailscale mesh or Cloudflare Tunnel (no open ports)

Name		Name	Last commit message	Last commit date
Latest commit History 1,634 Commits
.github		.github
.taskfiles		.taskfiles
.vscode		.vscode
agent		agent
bootstrap		bootstrap
cloudlab-infrastructure @ a0b2213		cloudlab-infrastructure @ a0b2213
docs		docs
kubernetes		kubernetes
nixos @ ab85838		nixos @ ab85838
scripts		scripts
talos		talos
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
.mise.toml		.mise.toml
.renovaterc.json5		.renovaterc.json5
.shellcheckrc		.shellcheckrc
.sops.yaml		.sops.yaml
.yamllint.yaml		.yamllint.yaml
DEPLOY.md		DEPLOY.md
LICENSE		LICENSE
README.md		README.md
Taskfile.yaml		Taskfile.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

merox.dev Infrastructure

Architecture

Hardware

Repository Layout

Disaster Recovery

Day-to-Day Operations

Cluster

VPS

Troubleshooting

Flux not reconciling

HelmRelease stuck / failed

Pod issues

Longhorn storage

Replacing a disk on a K8s node

Node unreachable

Garage S3 (Longhorn backup target)

Maintenance

Adding a node

Talos config changes

Dependency updates

SOPS secret rotation

Security

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

merox.dev Infrastructure

Architecture

Hardware

Repository Layout

Disaster Recovery

Day-to-Day Operations

Cluster

VPS

Troubleshooting

Flux not reconciling

HelmRelease stuck / failed

Pod issues

Longhorn storage

Replacing a disk on a K8s node

Node unreachable

Garage S3 (Longhorn backup target)

Maintenance

Adding a node

Talos config changes

Dependency updates

SOPS secret rotation

Security

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages