This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Thyme is a downstream distribution of Tulip (OllyGarden's OpenTelemetry Collector), specialized for high-throughput log collection benchmarking. It is NOT a production collector - it's a performance testing tool that validates collector performance at 50-100k+ logs/sec.
Relationship to Tulip: Thyme is built using the same OCB (OpenTelemetry Collector Builder) approach as Tulip but with a focused component set optimized for log ingestion benchmarking.
The project uses a two-level Makefile structure:
- Root Makefile (
./Makefile) - Entry point that delegates to distribution - Distribution Makefile (
distributions/thyme/Makefile) - Actual build logic
All build commands at the root simply forward to distributions/thyme/.
Thyme is built using OCB v0.144.0, which:
- Downloads automatically on first build to
distributions/thyme/bin/builder - Reads
distributions/thyme/manifest.yamlto know which OTel components to include - Generates Go source code in
distributions/thyme/build/ - Compiles to
distributions/thyme/build/thymebinary
Key principle: The manifest.yaml defines the distribution. To add/remove components, edit the manifest and rebuild.
# Build the collector binary
make build
# Run locally with config.yaml
make run
# Validate configuration without running
make validate
# Clean all build artifacts
make clean
# Build Docker image
make docker-build
# Build and load into k3d cluster
make k3d-load K3D_CLUSTER=thyme-testTwo configurations exist in distributions/thyme/:
config.yaml- Production config with filelog receiver + k8sattributes processor (for DaemonSet deployments)config-local.yaml- Local testing config with only OTLP receiver (simpler, no K8s dependencies)
When modifying configurations, remember:
- Extensions must bind to
0.0.0.0notlocalhost(for K8s health checks) - Batch processor
send_batch_size: 10000creates ~4.6MB messages - gRPC receivers need
max_recv_msg_size_mib: 16to handle large batches - Internal telemetry sends to LGTM via
http://lgtm.lgtm.svc.cluster.local:4318
log-generator pods (100 × 1,000 lines/sec = 100k logs/sec)
↓
/var/log/pods/* (Kubernetes host filesystem)
↓
thyme DaemonSet (filelog receiver → k8sattributes → batch → OTLP exporter)
↓
nop-collector Deployment (OTLP receiver → batch → nop exporter)
↓
[discarded]
Both collectors export internal telemetry to LGTM stack
Why two stages?: This simulates realistic edge collection (DaemonSet) → aggregation (Deployment) patterns while isolating performance of the edge collector.
- Docker Compose (
deployment/compose/) - Local development, single-host testing - Kubernetes with k3d (
deployment/kubernetes/) - Local Kubernetes testing with k3d - AWS EKS (
deployment/aws/+infrastructure/aws/) - Production-scale cloud benchmarking
All Kubernetes resources deploy via kubectl apply -k deployment/kubernetes/, which includes:
- thyme-benchmark namespace: Contains thyme DaemonSet, nop-collector, log-generators
- lgtm namespace: Grafana LGTM stack for observability (Prometheus, Loki, Tempo)
Resources use kustomize with labels (not deprecated commonLabels).
When querying metrics in Grafana, use these correct metric names (v0.144.0+):
otelcol_receiver_accepted_log_records_total- Logs receivedotelcol_exporter_sent_log_records_total- Logs exportedotelcol_exporter_send_failed_log_records_total- Export failuresotelcol_processor_refused_log_records_total- Memory limiter refusals
otelcol_process_cpu_seconds_total- CPU usage (rate this for cores)otelcol_process_runtime_heap_alloc_bytes- Memory usageotelcol_processor_batch_batch_send_size_*- Batch size stats
Important: Process metrics have otelcol_ prefix. Go runtime metrics like process_runtime_go_goroutines are NOT exported.
When updating OTel Collector versions:
-
Update
distributions/thyme/manifest.yaml:- Collector version:
dist.versionand all component versions (e.g.,v0.144.0) - Provider versions:
providers[*].gomod(e.g.,v1.50.0)
- Collector version:
-
Update
distributions/thyme/Makefile:OCB_VERSIONvariable
-
Update
Dockerfile:ARG OCB_VERSION=...
-
Rebuild:
make clean && make build
Symptom: grpc: received message larger than max (4630191 vs. 4194304)
Fix: Increase max_recv_msg_size_mib on the receiving collector's OTLP receiver:
receivers:
otlp:
protocols:
grpc:
max_recv_msg_size_mib: 16Symptom: Readiness probes fail with "connection refused"
Fix: Extensions must bind to 0.0.0.0, not localhost:
extensions:
health_check:
endpoint: 0.0.0.0:13133Filelog receiver parses containerd/docker/cri-o formats. Simplify operators with on_error: drop to avoid pipeline failures on malformed logs.
For local development and quick testing:
-
Build and load into k3d:
k3d cluster create thyme-test --agents 2 make k3d-load K3D_CLUSTER=thyme-test
-
Deploy everything:
kubectl apply -k deployment/kubernetes/ kubectl wait --for=condition=ready pod -l app=lgtm -n lgtm --timeout=120s -
Access Grafana:
kubectl port-forward -n lgtm service/grafana 3000:3000
-
Monitor throughput (Explore → Prometheus):
rate(otelcol_receiver_accepted_log_records_total{service_name="nop-collector"}[1m]) -
Cleanup:
kubectl delete -k deployment/kubernetes/ k3d cluster delete thyme-test
Thyme supports production-scale benchmarking on AWS EKS with automated infrastructure provisioning, deployment, and teardown.
The easiest way to run an AWS benchmark:
# Full automated benchmark (provision → deploy → run → collect → cleanup)
./scripts/run-benchmark-aws.sh
# Custom active duration (default 30 minutes)
./scripts/run-benchmark-aws.sh 60
# Keep infrastructure running after benchmark
AUTO_CLEANUP=false ./scripts/run-benchmark-aws.shWhat the script does:
- Provisions EKS cluster with OpenTofu (~15 min)
- Deploys thyme + LGTM stack (~5 min)
- Runs benchmark with 3 phases:
- Ramp-up: 5 minutes (system stabilization)
- Active: 30 minutes (performance measurement)
- Cool-down: 10 minutes (tail behavior observation)
- Collects metrics via LoadBalancer
- Generates comprehensive report in
local/reports/YYYY-MM-DD-NN-aws/ - Destroys infrastructure (~10 min, optional)
Total runtime: ~75 minutes for default 30-minute active benchmark
Estimated cost: $2.50/hour ($3.00 for full run)
For interactive testing or development:
cd infrastructure/aws
# Initialize OpenTofu (first time only)
tofu init
# Provision EKS cluster (~15 minutes)
tofu applyInfrastructure includes:
- EKS cluster (Kubernetes 1.34)
- 3× m6i.2xlarge nodes (8 vCPU, 32GB RAM each)
- VPC with public/private subnets across 3 AZs
- NAT gateway for private subnet egress
- ECR repository (optional)
- IAM roles and security groups
aws eks update-kubeconfig --region eu-central-1 --name thyme-benchmark-<timestamp>
# Verify nodes ready
kubectl wait --for=condition=ready node --all --timeout=300s
kubectl get nodes# Deploy thyme + LGTM stack
kubectl apply -k deployment/aws/
# Wait for LGTM
kubectl wait --for=condition=ready pod -l app=lgtm -n lgtm --timeout=180s
# Wait for collectors
kubectl wait --for=condition=ready pod -l app=thyme -n thyme-benchmark --timeout=120s
kubectl wait --for=condition=ready pod -l app=nop-collector -n thyme-benchmark --timeout=120s
# Wait for log generators (co-located on single node)
kubectl wait --for=condition=ready pod -l app=log-generator -n thyme-benchmark --timeout=180s# Get LoadBalancer URL (takes ~2-3 minutes to provision)
LB_URL=$(kubectl get svc grafana -n lgtm -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')
echo "Grafana: http://$LB_URL:3000"
# Login: admin / adminQuery throughput in Grafana (Explore → Prometheus):
# Throughput (logs/sec)
rate(otelcol_receiver_accepted_log_records_total{service_name="nop-collector"}[1m])
# CPU usage per thyme pod
rate(otelcol_process_cpu_seconds_total{service_name="thyme"}[1m])
# Memory usage
otelcol_process_runtime_heap_alloc_bytes{service_name="thyme"} / 1024 / 1024
IMPORTANT: Always delete Kubernetes resources first, then wait before destroying infrastructure to avoid 15-20 minute hangs.
# Delete Kubernetes resources
kubectl delete -k deployment/aws/
# Wait for LoadBalancer deletion (critical!)
sleep 180
# Destroy infrastructure
cd infrastructure/aws
tofu destroy| Aspect | k3d | AWS EKS |
|---|---|---|
| Infrastructure | Local Docker | Real cloud VMs |
| Grafana Access | NodePort 30000 | LoadBalancer (NLB) |
| Log-Gen Placement | Distributed | Co-located (single node) |
| Setup Time | ~5 minutes | ~20 minutes |
| Cost | Free | ~$2.50/hour |
| Use Case | Development, quick tests | Production-scale validation |
Why co-locate log generators on AWS? Tests thyme DaemonSet at maximum single-node capacity (100k logs/sec from one node), simulating realistic "hot node" scenarios.
Infrastructure configuration: infrastructure/aws/terraform.tfvars
cluster_name = "thyme-benchmark-<timestamp>"
aws_region = "eu-central-1"
kubernetes_version = "1.34"
node_instance_type = "m6i.2xlarge"
node_desired_capacity = 3
# Cost optimization
enable_cluster_logging = false
enable_nat_gateway_ha = falseCustomization options:
- Instance types:
m6i.xlarge(cheaper),m6i.4xlarge(more capacity) - Node count: Scale up for higher throughput tests
- Spot instances: Set in
eks-node-group.tffor 70% cost savings
LoadBalancer stuck in <pending>:
- Check subnet tags:
aws ec2 describe-subnets --filters "Name=tag:Name,Values=*thyme-benchmark*" - Verify subnets have
kubernetes.io/role/elbtag - Check events:
kubectl get events -n lgtm --sort-by='.lastTimestamp'
Pods not co-locating:
- Verify:
kubectl get pods -n thyme-benchmark -l app=log-generator -o wide - All 100 log-gen pods should be on the same node
- If spread: increase instance type or reduce replicas
tofu destroy hangs:
- You deleted K8s resources first: ✓
- You waited 3+ minutes after deletion: ✓
- If still hanging: manually delete LoadBalancer in AWS console
# Real-time cost estimate
aws ce get-cost-and-usage \
--time-period Start=2026-02-04,End=2026-02-05 \
--granularity HOURLY \
--metrics UnblendedCost \
--filter file://<(echo '{"Tags":{"Key":"Project","Values":["Thyme"]}}')Hourly breakdown:
- EKS control plane: $0.10/hour
- 3× m6i.2xlarge: $1.152/hour
- EBS + NAT + NLB + transfer: ~$0.25/hour
- Total: ~$2.50/hour
See detailed AWS documentation:
- Deployment guide:
deployment/aws/README.md(400+ lines) - Infrastructure overview:
infrastructure/aws/README.md - Automated script:
scripts/run-benchmark-aws.sh(670+ lines)
- Manifest:
distributions/thyme/manifest.yaml- Component definitions - Configs:
distributions/thyme/config*.yaml- Collector configurations - K8s Manifests:
deployment/kubernetes/- k3d Kubernetes resources - AWS Manifests:
deployment/aws/- EKS-specific overlays (LoadBalancer, affinity) - AWS Infrastructure:
infrastructure/aws/- OpenTofu/Terraform for EKS provisioning - Benchmark Script:
scripts/run-benchmark-aws.sh- Automated AWS benchmarking - Kustomization:
deployment/kubernetes/kustomization.yaml- Includes both thyme-benchmark and LGTM - Binary Output:
distributions/thyme/build/thyme - Generated Source:
distributions/thyme/build/*.go(gitignored) - Benchmark Reports:
local/reports/YYYY-MM-DD-NN-aws/- Generated after AWS runs - Export Format Benchmark:
tools/exportbench/- Compares OTLP/STEF serialization formats
- No production use - Thyme is for benchmarking only; use Tulip for production
- Realistic simulation - Two-stage architecture mimics real edge collection patterns
- Observable - Both collectors export internal telemetry to LGTM for performance analysis
- Multi-environment - Supports local development (k3d) and production-scale validation (AWS EKS)
- High throughput - Configured for 50k-100k logs/sec with large batches (10k records)
- Automated - Full benchmark lifecycle (provision → run → report → cleanup) via single script
Problem: Default batch timeout of 10s caused massive throughput issues.
Solution: Use 200ms (the OTel default). At 50k+ logs/sec with batch size 10000, batches fill in ~200ms anyway. Long timeouts create unnecessary pipeline stalls.
batch:
send_batch_size: 10000
timeout: 200ms # NOT 10sProblem: k8sattributes processor queries Kubernetes API and maintains in-memory caches, adding CPU overhead even when metadata is already extracted from file paths.
Solution: For pure throughput benchmarks, remove k8sattributes if you're already extracting namespace/pod/container from file paths via filelog operators. Only use k8sattributes when you need owner references (deployment, statefulset names) or pod labels.
Finding: Reducing poll_interval from 200ms to 100ms improved throughput for high-volume scenarios. For 100k+ logs/sec, consider 50-100ms polling.
Trade-off: Lower intervals = higher CPU usage for file polling.
Problem: 100GB disks filled within 30 minutes at 100k logs/sec (~50MB/sec of log data). Kubernetes applies disk-pressure taint, evicting pods.
Solutions:
- Use 500GB+ disks for extended benchmarks
- Add toleration for
node.kubernetes.io/disk-pressureon log generators - Clean up old log files between runs
Problem: With start_at: beginning, thyme reads ALL accumulated log files on startup, not just new data. This makes throughput measurements unreliable during catch-up.
Solution: For steady-state benchmarking:
- Use
start_at: endto only read new logs - Or wait for backlog processing to complete before measuring
- Watch for throughput to stabilize at expected input rate
Approach: 100 pods × 1,000 lines/sec (co-located on one node via pod affinity) simulates a busy node with many workloads writing to /var/log/pods/. This stresses the filelog receiver's ability to watch many files simultaneously, which is more realistic than a few high-throughput pods.
Constraint: EKS max pods per node is ~110 (m6i.2xlarge with VPC CNI prefix delegation). With ~10 system pods, ~100 log-gen pods is the practical maximum per node.
Problem: Hardcoded nodeSelector for co-locating log generators failed when that node got a disk-pressure taint.
Solution: Use tolerations instead of (or in addition to) nodeSelector:
tolerations:
- key: node.kubernetes.io/disk-pressure
operator: Exists
effect: NoScheduleProblem: tofu destroy fails if ECR repository has images.
Solution: Force delete before destroy:
aws ecr delete-repository --repository-name thyme --force --region eu-central-1