Problem
Current --dry-run benchmarks are fake. They run against hardcoded static mock state (3 colonists, 8000 wealth, never changes). Mock LLM returns no_action every tick. Metrics are flat. This tests JSON parse rate and pipeline plumbing, not colony management.
Real benchmarks require a live RimWorld instance — and the leaderboard (#8) requires running 6 scenarios × N models × 4+ runs = hundreds of game sessions with statistical rigor. That's not possible manually.
Solution
HeadlessRim + HeadlessRimPatch + RIMAPI in Docker. Ilya confirmed RIMAPI endpoints work headlessly (HeadlessRim#1) and published HeadlessRimPatch the same day. Save files load via RIMAPI endpoints + Docker mount.
Our existing scoring, evaluation, and paired comparison code works unchanged — just swap mock transport for real HTTP to the containerized game.
Why this is the leaderboard blocker
The leaderboard vision from #8 ("Chatbot Arena for RimWorld") requires:
| Requirement |
Without Docker |
With Docker |
| Run 96+ game sessions (6 scenarios × 4 models × 4 runs) |
Human babysitting a PC for hours |
Scripted, unattended |
| Statistical significance (N=4+ per model) |
Impractical manually |
Automated with parallel containers |
| Reproducibility |
Different random events per manual session |
Same Docker image + save = same start |
| New model drops, update leaderboard |
Redo everything manually |
CI triggers full matrix automatically |
| Community submissions |
"Trust us, we ran it" |
Publish Docker image, anyone can replicate |
Without HeadlessRim, the leaderboard is a one-off manual effort. With it, it's a living automated pipeline.
Architecture
Docker container (HeadlessRim + HeadlessRimPatch + RIMAPI on :8765)
↕ REST API + SSE
RLE Python orchestrator (run_benchmark.py)
↕ hub-spoke
7 agents (Felix Agent SDK) + any LLM provider
↕
Score → paired stats → leaderboard update
Leaderboard pipeline:
for model in [claude, gpt-4o, nemotron, llama, ...]:
for scenario in [crashlanded, first_winter, ...]:
for run in range(4):
docker start → load save via RIMAPI
run_benchmark.py --model $model
docker stop
compute paired stats (agent vs baseline, Cohen's d, p-value)
publish leaderboard
Tasks
Docker image
Benchmark tooling
CI / leaderboard
References
Problem
Current
--dry-runbenchmarks are fake. They run against hardcoded static mock state (3 colonists, 8000 wealth, never changes). Mock LLM returnsno_actionevery tick. Metrics are flat. This tests JSON parse rate and pipeline plumbing, not colony management.Real benchmarks require a live RimWorld instance — and the leaderboard (#8) requires running 6 scenarios × N models × 4+ runs = hundreds of game sessions with statistical rigor. That's not possible manually.
Solution
HeadlessRim + HeadlessRimPatch + RIMAPI in Docker. Ilya confirmed RIMAPI endpoints work headlessly (HeadlessRim#1) and published HeadlessRimPatch the same day. Save files load via RIMAPI endpoints + Docker mount.
Our existing scoring, evaluation, and paired comparison code works unchanged — just swap mock transport for real HTTP to the containerized game.
Why this is the leaderboard blocker
The leaderboard vision from #8 ("Chatbot Arena for RimWorld") requires:
Without HeadlessRim, the leaderboard is a one-off manual effort. With it, it's a living automated pipeline.
Architecture
Leaderboard pipeline:
Tasks
Docker image
Benchmark tooling
--dry-runto--smoke-test(honest about what it measures)--dockerflag to run_benchmark.py (connect to containerized game)CI / leaderboard
References