Skip to content

meroxdotdev/infrastructure

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1,634 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

merox.dev Infrastructure

GitOps-managed homelab: Kubernetes cluster (Talos + Flux) + VPS services + Infrastructure agent.


Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  VPS (Oracle/Hetzner)   cloudlab-infrastructure/        β”‚
β”‚  β”œβ”€β”€ Traefik (reverse proxy + Cloudflare Tunnel)        β”‚
β”‚  β”œβ”€β”€ Pi-hole (DNS)                                      β”‚
β”‚  β”œβ”€β”€ Portainer EE (container management)                β”‚
β”‚  β”œβ”€β”€ Homepage (dashboard)                               β”‚
β”‚  β”œβ”€β”€ Joplin Server + Postgres (notes)                   β”‚
β”‚  β”œβ”€β”€ Uptime Kuma (monitoring)                           β”‚
β”‚  β”œβ”€β”€ Guacamole (remote desktop gateway)                 β”‚
β”‚  β”œβ”€β”€ Glances (system monitoring)                        β”‚
β”‚  └── Garage S3 (Longhorn backup target)                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β”‚ Tailscale mesh VPN
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Kubernetes Cluster (Talos Linux + Flux)                β”‚
β”‚  β”œβ”€β”€ Cilium (CNI + Gateway API)                         β”‚
β”‚  β”œβ”€β”€ Longhorn (storage β†’ backs up to Garage S3)         β”‚
β”‚  β”œβ”€β”€ cert-manager, external-dns, k8s-gateway            β”‚
β”‚  └── Apps: see kubernetes/apps/                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  OpenClaw β€” openclaw.ai                                 β”‚
β”‚  Telegram bot β†’ Claude API β†’ kubectl/docker             β”‚
β”‚  Config: agent/openclaw.json  Skill: agent/skills/infra β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Hardware

Device Role Specs
Dell OptiPlex 3050 #1 K8s node (Proxmox VM) i5-6500T, 16GB, 128GB NVMe
Dell OptiPlex 3050 #2 K8s node (Proxmox VM) i5-6500T, 16GB, 128GB NVMe
Beelink GTi 13 Pro K8s node (Proxmox VM) i9-13900H, 64GB, 2x2TB NVMe
Dell PowerEdge R720 Proxmox Backup Server 2x Xeon E5-2697v2, 192GB
Synology DS223+ NAS / NFS + Backup 2x2TB HDD RAID1
XCY X44 pfSense Firewall N100, 8GB
Oracle Cloud ARM VPS Off-site services (primary) 4 vCPU ARM, 24GB RAM, 200GB
Hetzner CAX21 (DR) Standby β€” make dr-full 4 vCPU ARM, 8GB RAM, €5.39/mo

Repository Layout

infrastructure/
β”œβ”€β”€ cloudlab-infrastructure/    # Ansible β€” VPS provisioning
β”œβ”€β”€ kubernetes/
β”‚   β”œβ”€β”€ apps/                   # Flux app manifests (namespaced)
β”‚   β”œβ”€β”€ flux/                   # Flux bootstrap + HelmRepositories
β”‚   └── components/             # Shared Kustomize components (common, repos)
β”œβ”€β”€ talos/                      # Talos node configs + patches
β”œβ”€β”€ bootstrap/                  # Cluster bootstrap vars
β”œβ”€β”€ agent/                      # OpenClaw config template + infra skill
β”‚   β”œβ”€β”€ openclaw.json           # Gateway config (no secrets β€” use ~/.openclaw/.env)
β”‚   └── skills/infra/           # kubectl/docker skill context
β”œβ”€β”€ DEPLOY.md                   # Full rebuild + DR guide
└── Taskfile.yaml               # Task runner (talosctl, flux, etc.)

Disaster Recovery

Full step-by-step rebuild guide: DEPLOY.md

Scenario Where to look
VPS lost (Oracle reclaimed / provider down) cd cloudlab-infrastructure && make dr-full β€” provisions Hetzner server + deploys all services in ~15 min
Full rebuild (new server + new cluster) DEPLOY.md β€” Phase 1 (VPS) β†’ Phase 2 (K8s) β†’ Phase 3 (Agent)
Restore Longhorn volumes from S3 backup DEPLOY.md β€” Phase 2, step 7: task restore:longhorn
New hardware (different IPs/disks) DEPLOY.md β€” Phase 2, step 3: update talos/talconfig.yaml, cluster-vars.yaml, cilium/networks.yaml
Intel iGPU absent on new hardware Remove gpu.intel.com/i915 from kubernetes/apps/default/jellyfin/app/helmrelease.yaml and disable intel-device-plugin-operator
Jellyfin restored but streaming slow / Tailscale broken See docs/jellyfin-post-restore.md β€” manual UI steps required after every restore
Re-install OpenClaw agent only DEPLOY.md β€” Phase 3

The two things to back up before decommissioning a server:

  1. age.key β€” losing this = losing all SOPS-encrypted secrets
  2. ~/.openclaw/.env β€” Anthropic API key, Telegram tokens

Day-to-Day Operations

Cluster

# Check overall health
kubectl get nodes
kubectl get kustomizations -A
kubectl get helmreleases -A
cilium status

# Force Flux sync
task reconcile

# Regenerate Talos config (after editing talconfig.yaml)
task talos:generate-config

# Apply updated config to a node
task talos:apply-node IP=10.57.57.80

# Upgrade Talos on a node (update talenv.yaml version first)
task talos:upgrade-node IP=10.57.57.80
task talos:upgrade-k8s

# Reset entire cluster (destructive)
task talos:reset

VPS

cd cloudlab-infrastructure/

make health-check       # Verify all services running
make setup              # Full redeploy (idempotent)
make update             # OS package updates only
make check              # Dry-run (--check --diff)
make check-resources    # Disk, memory, Docker usage
make cleanup            # Remove unused Docker images/volumes

# Disaster recovery β€” provision on Hetzner + deploy everything
make terraform-init     # first time only
make dr-full            # ~15 min: new server + full stack
make terraform-plan     # preview what Terraform will create

Troubleshooting

Flux not reconciling

flux get sources git -A                              # Check git source is reachable
flux get kustomizations -A                           # Find which ks is failing
flux logs --level=error                              # See error messages
flux reconcile kustomization cluster-apps --with-source  # Force sync

HelmRelease stuck / failed

kubectl get helmreleases -A | grep -v True
flux logs --kind HelmRelease --name <name> -n <namespace>
flux reconcile helmrelease <name> -n <namespace> --with-source
# If values changed and Helm refuses β€” suspend + resume:
flux suspend helmrelease <name> -n <namespace>
flux resume helmrelease <name> -n <namespace>

Pod issues

kubectl -n <namespace> get pods -o wide
kubectl -n <namespace> describe pod <pod>
kubectl -n <namespace> logs <pod> -f
kubectl -n <namespace> logs <pod> --previous            # crashed container
kubectl -n <namespace> get events --sort-by='.metadata.creationTimestamp'

Longhorn storage

# Volume / replica status
kubectl -n longhorn-system get volumes
kubectl -n longhorn-system get nodes.longhorn.io

# Orphaned replicas (safe to delete)
kubectl get orphan -n longhorn-system -o name | \
  xargs kubectl delete -n longhorn-system

# Trigger a backup manually
# Longhorn UI β†’ Volume β†’ Create Backup

# Old snapshots cleanup
kubectl get snapshots -n longhorn-system -o json | \
  jq -r '.items[] | select(.status.creationTime < "2025-01-01") | .metadata.name' | \
  xargs kubectl delete snapshot -n longhorn-system

Replacing a disk on a K8s node

# 1. Drain node
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data

# 2. In Proxmox: shutdown VM, swap physical disk, boot VM

# 3. Regenerate and re-apply Talos config
task talos:generate-config
talosctl apply-config --insecure --nodes <ip> \
  --file talos/clusterconfig/<node>.yaml

# 4. Uncordon
kubectl uncordon <node>

# 5. If Longhorn disk UUID changed β€” evict replicas then re-add disk:
kubectl -n longhorn-system patch node.longhorn.io <node> \
  --type merge -p '{"spec":{"evictionRequested":true}}'
# Wait for replicas to evacuate (~20-60 min), then remove old disk
# and add new disk via Longhorn UI

Wait 1-2 hours between disk swaps to allow replica rebuild.

Node unreachable

talosctl -n <node-ip> health
talosctl -n <node-ip> dmesg
talosctl -n <node-ip> services
kubectl describe node <node-name>

Garage S3 (Longhorn backup target)

ssh root@<vps-ip> "docker exec garage /garage status"
ssh root@<vps-ip> "docker exec garage /garage bucket list"
# Verify Longhorn can reach it:
kubectl -n longhorn-system get secret minio-secret

Maintenance

Adding a node

Keep an odd number of control plane nodes (1, 3, 5) for quorum.

# 1. Boot new node from Talos ISO β€” same schematic ID as existing nodes
#    Get disk and MAC from the node in maintenance mode:
talosctl get disks -n <new-node-ip> --insecure
talosctl get links -n <new-node-ip> --insecure

# 2. Add node entry to talos/talconfig.yaml with the disk and MAC above

# 3. Regenerate config and apply to new node
task talos:generate-config
task talos:apply-node IP=<new-node-ip>

# 4. Node joins automatically β€” watch it become Ready:
kubectl get nodes --watch

Talos config changes

# After editing talos/talconfig.yaml or any patch:
task talos:generate-config
task talos:apply-node IP=<node-ip> MODE=auto
# MODE=auto applies without reboot if possible, reboots if required

Dependency updates

Renovate runs every weekend and opens PRs automatically for:

  • Helm chart versions (all HelmReleases)
  • Container image tags (annotated with # renovate:)
  • Talos / Kubernetes versions (.mise.toml)

Config: .renovaterc.json5

SOPS secret rotation

# Edit any encrypted secret
sops kubernetes/apps/<namespace>/<app>/app/secret.sops.yaml

# Re-encrypt all secrets after AGE key rotation
find . -name "*.sops.*" -exec sops updatekeys {} \;

Security

  • Kubernetes secrets encrypted with SOPS (AGE key β€” back up manually)
  • Ansible secrets in encrypted Vault (cloudlab-infrastructure/)
  • All traffic via Tailscale mesh or Cloudflare Tunnel (no open ports)

About

🏠 Personal Homelab Infrastructure Production-ready Kubernetes homelab with Talos Linux and GitOps automation. Multi-node setup with automated deployments via Flux and comprehensive infrastructure management.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Generated from onedr0p/cluster-template