Thyme AWS Deployment

AWS-specific Kubernetes deployment overlay for Thyme high-throughput log collection benchmarks on EKS.

Overview

This directory contains Kustomize overlays that adapt the base Kubernetes deployment for AWS EKS:

Grafana LoadBalancer: Exposes Grafana via AWS Network Load Balancer
Log-Generator Pod Affinity: Co-locates all 20 log-gen pods on a single node

Differences from Base Deployment

Component	Base (k3d)	AWS Overlay
Grafana Access	NodePort (30000)	LoadBalancer (NLB)
Log-Gen Placement	Distributed	Co-located on single node
Image Registry	Local or GHCR	GHCR (or ECR)

Prerequisites

EKS cluster provisioned via infrastructure/aws/

kubectl configured to access cluster:

aws eks update-kubeconfig --region eu-central-1 --name thyme-benchmark

Thyme image pushed to GHCR or ECR:

make docker-build
make docker-push  # Requires GHCR authentication

Deployment

Deploy All Resources

# From repository root
kubectl apply -k deployment/aws/

This creates:

thyme-benchmark namespace with thyme DaemonSet, nop-collector, log-generators
lgtm namespace with Grafana LGTM stack
LoadBalancer service for Grafana
Pod affinity rules for log-generators

Verify Deployment

# Check all pods running
kubectl get pods -n thyme-benchmark
kubectl get pods -n lgtm

# Verify log-generator pod co-location (all should be on same node)
kubectl get pods -n thyme-benchmark -l app=log-generator -o wide

# Check services
kubectl get svc -n lgtm grafana

Expected output for log-generator pods:

NAME                             READY   STATUS    NODE
log-generator-xxxxxxxxxx-xxxxx   1/1     Running   ip-10-0-11-123.eu-central-1.compute.internal
log-generator-xxxxxxxxxx-xxxxx   1/1     Running   ip-10-0-11-123.eu-central-1.compute.internal
log-generator-xxxxxxxxxx-xxxxx   1/1     Running   ip-10-0-11-123.eu-central-1.compute.internal
...
(All 20 pods on the SAME node)

Access Grafana

# Get LoadBalancer URL
LB_URL=$(kubectl get svc grafana -n lgtm -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')
echo "Grafana: http://$LB_URL:3000"

# Open in browser (takes ~2-3 minutes for DNS propagation)
# Login: admin / admin

Note: LoadBalancer provisioning takes 2-3 minutes. Check status with:

kubectl describe svc grafana -n lgtm

Configuration

Using ECR Instead of GHCR

If using AWS ECR (uncommented in infrastructure/aws/ecr.tf):

Get ECR repository URL:

cd infrastructure/aws
tofu output ecr_repository_url

Update kustomization.yaml:

images:
- name: ghcr.io/ollygarden/thyme
  newName: YOUR_ACCOUNT_ID.dkr.ecr.eu-central-1.amazonaws.com/thyme
  newTag: latest

Authenticate and push:

aws ecr get-login-password --region eu-central-1 | \
  docker login --username AWS --password-stdin \
  $(aws sts get-caller-identity --query Account --output text).dkr.ecr.eu-central-1.amazonaws.com

make docker-build
docker tag ghcr.io/ollygarden/thyme:latest YOUR_ACCOUNT_ID.dkr.ecr.eu-central-1.amazonaws.com/thyme:latest
docker push YOUR_ACCOUNT_ID.dkr.ecr.eu-central-1.amazonaws.com/thyme:latest

Restricting Grafana Access

To limit LoadBalancer access to specific IPs:

Edit grafana-loadbalancer.yaml:

spec:
  loadBalancerSourceRanges:
    - 1.2.3.4/32  # Your office IP
    - 5.6.7.8/32  # VPN IP

Reapply:
```
kubectl apply -k deployment/aws/
```

Pod Affinity Explained

Why Co-locate Log Generators?

The log-generator pod affinity ensures all 20 log-gen pods run on a single node:

Reason: Test thyme DaemonSet at maximum node capacity (50k logs/sec)
Strategy: First pod schedules freely, subsequent pods must co-locate with it
Result: One "hot" node with 20 log-gens + thyme DaemonSet pod

Affinity Rule Breakdown

podAffinity:
  requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchExpressions:
          - key: app
            operator: In
            values:
              - log-generator
      topologyKey: kubernetes.io/hostname

requiredDuringScheduling: Hard constraint, pod won't schedule if violated
IgnoredDuringExecution: If node fails, pods can move elsewhere
topologyKey: kubernetes.io/hostname: Pods must share same hostname (node)

Verifying Co-location

# Should show all pods on same node
kubectl get pods -n thyme-benchmark -l app=log-generator -o custom-columns=NAME:.metadata.name,NODE:.spec.nodeName

# Count pods per node (should be 20 on one node, 0 on others)
kubectl get pods -n thyme-benchmark -l app=log-generator -o json | \
  jq -r '.items[].spec.nodeName' | sort | uniq -c

Monitoring

Prometheus Queries for Benchmarking

Access Grafana at LoadBalancer URL, then use Explore → Prometheus:

Throughput (logs/sec):

rate(otelcol_receiver_accepted_log_records_total{service_name="nop-collector"}[1m])

Export Failures:

rate(otelcol_exporter_send_failed_log_records_total[1m])

CPU Usage per Node:

sum by (instance) (rate(otelcol_process_cpu_seconds_total[1m]))

Memory Usage:

otelcol_process_runtime_heap_alloc_bytes

Log Analysis

Thyme logs (DaemonSet):

kubectl logs -n thyme-benchmark -l app=thyme --tail=100

Nop-collector logs:

kubectl logs -n thyme-benchmark -l app=nop-collector --tail=100

Log-generator logs (verify 2,500 logs/sec per pod):

kubectl logs -n thyme-benchmark -l app=log-generator --tail=20

Troubleshooting

Pods Not Co-locating

Symptom: Log-gen pods spread across multiple nodes

Diagnosis:

kubectl get pods -n thyme-benchmark -l app=log-generator -o wide

Cause: Insufficient resources on single node for 20 pods

Solution:

Check node capacity: kubectl describe nodes | grep -A5 "Allocated resources"
Reduce log-gen replicas or resource requests
Use larger instance type (e.g., m6i.4xlarge)

LoadBalancer Stuck in Pending

Symptom: kubectl get svc grafana -n lgtm shows <pending>

Diagnosis:

kubectl describe svc grafana -n lgtm

Common causes:

Subnets missing kubernetes.io/role/elb tag (check infrastructure/aws/vpc.tf)
Service quotas exceeded (check AWS Service Quotas)
Security groups blocking traffic

Solution:

# Verify subnet tags
aws ec2 describe-subnets --filters "Name=tag:Name,Values=*thyme-benchmark*" \
  --query 'Subnets[*].[SubnetId,Tags[?Key==`kubernetes.io/role/elb`].Value]'

# Check events
kubectl get events -n lgtm --sort-by='.lastTimestamp'

High Export Failure Rate

Symptom: otelcol_exporter_send_failed_log_records_total increasing

Diagnosis:

kubectl logs -n thyme-benchmark -l app=thyme | grep -i error
kubectl logs -n thyme-benchmark -l app=nop-collector | grep -i error

Common causes:

Nop-collector overwhelmed (increase resources)
Network issues between nodes
gRPC message size limit (check max_recv_msg_size_mib)

Node Running Out of Resources

Symptom: Pods stuck in Pending, node showing high utilization

Diagnosis:

kubectl describe node <node-with-log-gens>
kubectl top node
kubectl top pod -n thyme-benchmark

Solution:

Reduce log-gen replicas: kubectl scale deployment log-generator -n thyme-benchmark --replicas=15
Increase node size in infrastructure/aws/variables.tf and reapply

Cleanup

Automated Cleanup (Recommended)

The run-benchmark-aws.sh script handles cleanup automatically:

# Auto-cleanup enabled by default
./scripts/run-benchmark-aws.sh

# Disable auto-cleanup to keep infrastructure running
AUTO_CLEANUP=false ./scripts/run-benchmark-aws.sh

What the automated cleanup does:

Deletes all Kubernetes resources
Waits for LoadBalancer deletion (3 minutes)
Deletes ECR images (if ECR is used)
Destroys all infrastructure

Total cleanup time: ~10 minutes

Manual Cleanup

If you need to manually clean up:

# Step 1: Delete Kubernetes resources
kubectl delete -k deployment/aws/

# Step 2: Wait for LoadBalancer deletion
sleep 180

# Step 3: Destroy infrastructure
cd infrastructure/aws
tofu destroy

Important: Always delete Kubernetes resources first and wait 3 minutes. This prevents tofu destroy from hanging for 15-20 minutes while AWS cleans up the LoadBalancer.

Manual ECR Cleanup (If Using ECR)

The automated script handles this, but for manual cleanup:

aws ecr list-images --repository-name thyme --region eu-central-1 \
  --query 'imageIds[*]' --output json | \
  jq -r '.[] | @json' | \
  xargs -I {} aws ecr batch-delete-image \
  --repository-name thyme --region eu-central-1 --image-ids '{}'

Cleanup Verification

# Verify all resources destroyed
cd infrastructure/aws && tofu show

# Check for orphaned resources
aws elbv2 describe-load-balancers --region eu-central-1 \
  --query 'LoadBalancers[?contains(Tags[?Key==`Project`].Value, `Thyme`)]'

Automated Benchmarking

Use the automated script for complete benchmark workflow:

# From repository root
./scripts/run-benchmark-aws.sh [active_duration_minutes] [cluster_name]

# Examples:
./scripts/run-benchmark-aws.sh                    # 30-min active, auto-cleanup
./scripts/run-benchmark-aws.sh 60                 # 60-min active, auto-cleanup
AUTO_CLEANUP=false ./scripts/run-benchmark-aws.sh # Keep cluster after

The script handles:

Infrastructure provisioning (15 min)
Deployment (5 min)
Ramp-up phase (5 min)
Active benchmark (configurable, default 30 min)
Cool-down phase (10 min)
Metrics collection via LoadBalancer
Report generation
Infrastructure cleanup (optional, 10 min)

Total runtime: ~45 minutes for default 30-minute active benchmark

Cost Tracking

Monitor costs during benchmarks:

# Real-time cost estimate
aws ce get-cost-and-usage \
  --time-period Start=2026-02-04,End=2026-02-05 \
  --granularity HOURLY \
  --metrics UnblendedCost \
  --filter file://<(echo '{"Tags":{"Key":"Project","Values":["Thyme"]}}')

Expected: ~$2.50/hour during benchmark

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thyme AWS Deployment

Overview

Differences from Base Deployment

Prerequisites

Deployment

Deploy All Resources

Verify Deployment

Access Grafana

Configuration

Using ECR Instead of GHCR

Restricting Grafana Access

Pod Affinity Explained

Why Co-locate Log Generators?

Affinity Rule Breakdown

Verifying Co-location

Monitoring

Prometheus Queries for Benchmarking

Log Analysis

Troubleshooting

Pods Not Co-locating

LoadBalancer Stuck in Pending

High Export Failure Rate

Node Running Out of Resources

Cleanup

Automated Cleanup (Recommended)

Manual Cleanup

Manual ECR Cleanup (If Using ECR)

Cleanup Verification

Automated Benchmarking

Cost Tracking

References

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Thyme AWS Deployment

Overview

Differences from Base Deployment

Prerequisites

Deployment

Deploy All Resources

Verify Deployment

Access Grafana

Configuration

Using ECR Instead of GHCR

Restricting Grafana Access

Pod Affinity Explained

Why Co-locate Log Generators?

Affinity Rule Breakdown

Verifying Co-location

Monitoring

Prometheus Queries for Benchmarking

Log Analysis

Troubleshooting

Pods Not Co-locating

LoadBalancer Stuck in Pending

High Export Failure Rate

Node Running Out of Resources

Cleanup

Automated Cleanup (Recommended)

Manual Cleanup

Manual ECR Cleanup (If Using ECR)

Cleanup Verification

Automated Benchmarking

Cost Tracking

References