AWS-specific Kubernetes deployment overlay for Thyme high-throughput log collection benchmarks on EKS.
This directory contains Kustomize overlays that adapt the base Kubernetes deployment for AWS EKS:
- Grafana LoadBalancer: Exposes Grafana via AWS Network Load Balancer
- Log-Generator Pod Affinity: Co-locates all 20 log-gen pods on a single node
| Component | Base (k3d) | AWS Overlay |
|---|---|---|
| Grafana Access | NodePort (30000) | LoadBalancer (NLB) |
| Log-Gen Placement | Distributed | Co-located on single node |
| Image Registry | Local or GHCR | GHCR (or ECR) |
- EKS cluster provisioned via
infrastructure/aws/ - kubectl configured to access cluster:
aws eks update-kubeconfig --region eu-central-1 --name thyme-benchmark
- Thyme image pushed to GHCR or ECR:
make docker-build make docker-push # Requires GHCR authentication
# From repository root
kubectl apply -k deployment/aws/This creates:
- thyme-benchmark namespace with thyme DaemonSet, nop-collector, log-generators
- lgtm namespace with Grafana LGTM stack
- LoadBalancer service for Grafana
- Pod affinity rules for log-generators
# Check all pods running
kubectl get pods -n thyme-benchmark
kubectl get pods -n lgtm
# Verify log-generator pod co-location (all should be on same node)
kubectl get pods -n thyme-benchmark -l app=log-generator -o wide
# Check services
kubectl get svc -n lgtm grafanaExpected output for log-generator pods:
NAME READY STATUS NODE
log-generator-xxxxxxxxxx-xxxxx 1/1 Running ip-10-0-11-123.eu-central-1.compute.internal
log-generator-xxxxxxxxxx-xxxxx 1/1 Running ip-10-0-11-123.eu-central-1.compute.internal
log-generator-xxxxxxxxxx-xxxxx 1/1 Running ip-10-0-11-123.eu-central-1.compute.internal
...
(All 20 pods on the SAME node)
# Get LoadBalancer URL
LB_URL=$(kubectl get svc grafana -n lgtm -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')
echo "Grafana: http://$LB_URL:3000"
# Open in browser (takes ~2-3 minutes for DNS propagation)
# Login: admin / adminNote: LoadBalancer provisioning takes 2-3 minutes. Check status with:
kubectl describe svc grafana -n lgtmIf using AWS ECR (uncommented in infrastructure/aws/ecr.tf):
-
Get ECR repository URL:
cd infrastructure/aws tofu output ecr_repository_url -
Update kustomization.yaml:
images: - name: ghcr.io/ollygarden/thyme newName: YOUR_ACCOUNT_ID.dkr.ecr.eu-central-1.amazonaws.com/thyme newTag: latest
-
Authenticate and push:
aws ecr get-login-password --region eu-central-1 | \ docker login --username AWS --password-stdin \ $(aws sts get-caller-identity --query Account --output text).dkr.ecr.eu-central-1.amazonaws.com make docker-build docker tag ghcr.io/ollygarden/thyme:latest YOUR_ACCOUNT_ID.dkr.ecr.eu-central-1.amazonaws.com/thyme:latest docker push YOUR_ACCOUNT_ID.dkr.ecr.eu-central-1.amazonaws.com/thyme:latest
To limit LoadBalancer access to specific IPs:
-
Edit grafana-loadbalancer.yaml:
spec: loadBalancerSourceRanges: - 1.2.3.4/32 # Your office IP - 5.6.7.8/32 # VPN IP
-
Reapply:
kubectl apply -k deployment/aws/
The log-generator pod affinity ensures all 20 log-gen pods run on a single node:
- Reason: Test thyme DaemonSet at maximum node capacity (50k logs/sec)
- Strategy: First pod schedules freely, subsequent pods must co-locate with it
- Result: One "hot" node with 20 log-gens + thyme DaemonSet pod
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- log-generator
topologyKey: kubernetes.io/hostname- requiredDuringScheduling: Hard constraint, pod won't schedule if violated
- IgnoredDuringExecution: If node fails, pods can move elsewhere
- topologyKey: kubernetes.io/hostname: Pods must share same hostname (node)
# Should show all pods on same node
kubectl get pods -n thyme-benchmark -l app=log-generator -o custom-columns=NAME:.metadata.name,NODE:.spec.nodeName
# Count pods per node (should be 20 on one node, 0 on others)
kubectl get pods -n thyme-benchmark -l app=log-generator -o json | \
jq -r '.items[].spec.nodeName' | sort | uniq -cAccess Grafana at LoadBalancer URL, then use Explore → Prometheus:
Throughput (logs/sec):
rate(otelcol_receiver_accepted_log_records_total{service_name="nop-collector"}[1m])
Export Failures:
rate(otelcol_exporter_send_failed_log_records_total[1m])
CPU Usage per Node:
sum by (instance) (rate(otelcol_process_cpu_seconds_total[1m]))
Memory Usage:
otelcol_process_runtime_heap_alloc_bytes
Thyme logs (DaemonSet):
kubectl logs -n thyme-benchmark -l app=thyme --tail=100Nop-collector logs:
kubectl logs -n thyme-benchmark -l app=nop-collector --tail=100Log-generator logs (verify 2,500 logs/sec per pod):
kubectl logs -n thyme-benchmark -l app=log-generator --tail=20Symptom: Log-gen pods spread across multiple nodes
Diagnosis:
kubectl get pods -n thyme-benchmark -l app=log-generator -o wideCause: Insufficient resources on single node for 20 pods
Solution:
- Check node capacity:
kubectl describe nodes | grep -A5 "Allocated resources" - Reduce log-gen replicas or resource requests
- Use larger instance type (e.g., m6i.4xlarge)
Symptom: kubectl get svc grafana -n lgtm shows <pending>
Diagnosis:
kubectl describe svc grafana -n lgtmCommon causes:
- Subnets missing
kubernetes.io/role/elbtag (checkinfrastructure/aws/vpc.tf) - Service quotas exceeded (check AWS Service Quotas)
- Security groups blocking traffic
Solution:
# Verify subnet tags
aws ec2 describe-subnets --filters "Name=tag:Name,Values=*thyme-benchmark*" \
--query 'Subnets[*].[SubnetId,Tags[?Key==`kubernetes.io/role/elb`].Value]'
# Check events
kubectl get events -n lgtm --sort-by='.lastTimestamp'Symptom: otelcol_exporter_send_failed_log_records_total increasing
Diagnosis:
kubectl logs -n thyme-benchmark -l app=thyme | grep -i error
kubectl logs -n thyme-benchmark -l app=nop-collector | grep -i errorCommon causes:
- Nop-collector overwhelmed (increase resources)
- Network issues between nodes
- gRPC message size limit (check
max_recv_msg_size_mib)
Symptom: Pods stuck in Pending, node showing high utilization
Diagnosis:
kubectl describe node <node-with-log-gens>
kubectl top node
kubectl top pod -n thyme-benchmarkSolution:
- Reduce log-gen replicas:
kubectl scale deployment log-generator -n thyme-benchmark --replicas=15 - Increase node size in
infrastructure/aws/variables.tfand reapply
The run-benchmark-aws.sh script handles cleanup automatically:
# Auto-cleanup enabled by default
./scripts/run-benchmark-aws.sh
# Disable auto-cleanup to keep infrastructure running
AUTO_CLEANUP=false ./scripts/run-benchmark-aws.shWhat the automated cleanup does:
- Deletes all Kubernetes resources
- Waits for LoadBalancer deletion (3 minutes)
- Deletes ECR images (if ECR is used)
- Destroys all infrastructure
Total cleanup time: ~10 minutes
If you need to manually clean up:
# Step 1: Delete Kubernetes resources
kubectl delete -k deployment/aws/
# Step 2: Wait for LoadBalancer deletion
sleep 180
# Step 3: Destroy infrastructure
cd infrastructure/aws
tofu destroyImportant: Always delete Kubernetes resources first and wait 3 minutes. This prevents tofu destroy from hanging for 15-20 minutes while AWS cleans up the LoadBalancer.
The automated script handles this, but for manual cleanup:
aws ecr list-images --repository-name thyme --region eu-central-1 \
--query 'imageIds[*]' --output json | \
jq -r '.[] | @json' | \
xargs -I {} aws ecr batch-delete-image \
--repository-name thyme --region eu-central-1 --image-ids '{}'# Verify all resources destroyed
cd infrastructure/aws && tofu show
# Check for orphaned resources
aws elbv2 describe-load-balancers --region eu-central-1 \
--query 'LoadBalancers[?contains(Tags[?Key==`Project`].Value, `Thyme`)]'Use the automated script for complete benchmark workflow:
# From repository root
./scripts/run-benchmark-aws.sh [active_duration_minutes] [cluster_name]
# Examples:
./scripts/run-benchmark-aws.sh # 30-min active, auto-cleanup
./scripts/run-benchmark-aws.sh 60 # 60-min active, auto-cleanup
AUTO_CLEANUP=false ./scripts/run-benchmark-aws.sh # Keep cluster afterThe script handles:
- Infrastructure provisioning (15 min)
- Deployment (5 min)
- Ramp-up phase (5 min)
- Active benchmark (configurable, default 30 min)
- Cool-down phase (10 min)
- Metrics collection via LoadBalancer
- Report generation
- Infrastructure cleanup (optional, 10 min)
Total runtime: ~45 minutes for default 30-minute active benchmark
Monitor costs during benchmarks:
# Real-time cost estimate
aws ce get-cost-and-usage \
--time-period Start=2026-02-04,End=2026-02-05 \
--granularity HOURLY \
--metrics UnblendedCost \
--filter file://<(echo '{"Tags":{"Key":"Project","Values":["Thyme"]}}')Expected: ~$2.50/hour during benchmark