A reproducible, local experiment framework for evaluating five different planning patterns on Aider-based code-editing tasks using single local LLM models (Ollama).
Quick facts:
- Platform: macOS / Linux with Docker & Ollama
- Models: Any Ollama model (default: Qwen 2.5 Coder 7B)
- Patterns: Baseline, Decomposition, MultiPlan, Reflection, RAG Memory
- Scale: Full factorial experiments (5 patterns × N models × K tasks)
- Status: Task-level tracking with incremental result writing
- Reproducibility: All results trackable in CSV with timestamps
If you already have Ollama running with a model ready:
# 1. Install dependencies
bash shared/scripts/setup_env.sh
# 2. Setup benchmark repos (one time)
bash shared/scripts/setup_benchmark.sh
# 3. Run full experiment: 5 patterns × 2 models × 10 tasks
bash shared/scripts/run_experiment_orchestrator.sh \
--patterns baseline,decomposition,multiplan,reflection,rag \
--models qwen2.5-coder:7b-instruct,qwen2.5-coder:32b-instruct \
--tasks 10
# 4. Check results
cat experiments/run_table.csvThis section walks through setting up on a brand new machine with nothing installed.
# 1a. Install Homebrew (if not already installed)
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
# 1b. Install required tools
brew install python@3.11 git docker
# 1c. Install Ollama (for local LLM)
# Download from: https://ollama.ai
# OR via Homebrew:
brew install ollama# 1a. Install required tools
sudo apt-get update
sudo apt-get install -y python3 python3-pip git docker.io
# 1b. Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh
# 1c. Start Docker (may need sudo)
sudo systemctl start docker
sudo usermod -aG docker $USER # Add user to docker group
newgrp docker # Apply group membershipThis framework is optimized for macOS/Linux.
For Windows, use WSL2 (Windows Subsystem for Linux) and follow Linux instructions.
# Run the prerequisite check
bash shared/scripts/check_prereqs.shThis verifies:
- ✓ Python 3 installed
- ✓ Git available
- ✓ Docker available
- ✓ Ollama available
Fix any missing tools before proceeding.
# Clone this repository
git clone <REPO_URL> aider-planning-experiments
cd aider-planning-experiments/Aider_Planning_Patterns
# Verify folder structure
ls -la
# Should show: README.md, Baseline/, Reflection/, MultiPlan/, shared/, etc.In a separate terminal (this runs continuously):
# Start Ollama server
ollama serveThe server will listen on http://127.0.0.1:11434 by default.
In your main terminal, verify connectivity:
# Test Ollama is responding (may take 10-20 seconds)
curl http://127.0.0.1:11434/api/tagsExpected output: {"models":[...]}
While keeping Ollama server running, in another terminal:
# Pull the default model (Qwen 2.5 Coder 7B - ~4.7GB)
ollama pull qwen2.5-coder:7b-instruct
# (Optional) Pull additional model for multi-model experiments
ollama pull qwen2.5-coder:32b-instructVerify models are available:
ollama list
# Should show: qwen2.5-coder:7b-instructBack in your main terminal:
# Install Python dependencies
bash shared/scripts/setup_env.sh
# This creates:
# - .venv/ (virtual environment)
# - installs: pandas, csvq, other analysis tools# Clone benchmark repos (this fetches ~500MB)
bash shared/scripts/setup_benchmark.sh
# This clones:
# - Aider-AI/aider (the benchmark harness)
# - Aider-AI/polyglot-benchmark (exercise corpus)
# Into: benchmark/repos/Verify:
ls benchmark/repos/
# Should show: aider/, polyglot-benchmark/Create .env for custom settings:
# Copy template
cp .env.example .env
# Edit .env to customize:
# - OLLAMA_MODEL (default: qwen2.5-coder:7b-instruct)
# - OLLAMA_API_BASE (default: http://127.0.0.1:11434)
# - TASK_TIMEOUT (default: 900 seconds = 15 minutes)
# - AIDER_BENCH_SHUFFLE_TASKS (default: 1 = shuffled)Run one quick test to verify everything works:
# Run baseline on 3 tasks (should complete in ~5-10 minutes)
AIDER_BENCH_NUM_TESTS=3 bash shared/scripts/run_baseline.sh
# Watch for output:
# ✓ Ollama connectivity confirmed
# ✓ Benchmark repos found
# ✓ Running baseline variant...
# ✓ CompletedCheck results:
ls -lah Baseline/results/
# Should show: TIMESTAMP--baseline--qwen2.5-coder-7b-instruct.tasks.csv# Baseline only
bash shared/scripts/run_baseline.sh
# Or any other pattern:
bash shared/scripts/run_reflection.sh
bash shared/scripts/run_decomposition.sh
bash shared/scripts/run_multiplan.sh
bash shared/scripts/run_rag.shExpected time: ~30-50 minutes (depends on model speed)
Run all 5 patterns, 2 model sizes, 10 tasks each = 100 tracked tasks:
bash shared/scripts/run_experiment_orchestrator.sh \
--patterns baseline,decomposition,multiplan,reflection,rag \
--models qwen2.5-coder:7b-instruct,qwen2.5-coder:32b-instruct \
--tasks 10Expected time: ~8-12 hours (5 patterns × 2 models × ~50 min per model)
What happens:
- Iterates through each (pattern, model) pair
- Switches Ollama model with proper cooldowns
- Runs benchmark with 10 tasks per pattern
- Extracts individual task results from harness output
- Logs each task as separate row in
experiments/run_table.csv(100 total rows) - Skips any runs already completed (resumable on interruption)
Small test (2 patterns, 1 model, 5 tasks = 10 tracked tasks):
bash shared/scripts/run_experiment_orchestrator.sh \
--patterns baseline,reflection \
--models qwen2.5-coder:7b-instruct \
--tasks 5Larger experiment (5 patterns, 3 models, 20 tasks each = 300 tracked tasks):
bash shared/scripts/run_experiment_orchestrator.sh \
--patterns baseline,decomposition,multiplan,reflection,rag \
--models qwen2.5-coder:7b-instruct,qwen2.5-coder:32b-instruct,deepseek-coder:7b \
--tasks 20All patterns, single model (useful for model comparison later):
bash shared/scripts/run_experiment_orchestrator.sh \
--patterns baseline,decomposition,multiplan,reflection,rag \
--models qwen2.5-coder:7b-instruct \
--tasks 10All experiments write individual task results to:
experiments/run_table.csv
Structure:
run_id,experiment_id,pattern,model,task_name,status,energy_kwh,duration_seconds,pass_rate,timestamp
20260413-120500--baseline--qwen2.5-coder-7b-instruct,20260413-120500,baseline,qwen2.5-coder:7b-instruct,accumulate,COMPLETED,0.000206,14.596,1,2026-04-13T12:05:00Z
20260413-120500--baseline--qwen2.5-coder-7b-instruct,20260413-120500,baseline,qwen2.5-coder:7b-instruct,acronym,FAILED,0.000350,27.394,0,2026-04-13T12:06:00Z
...Columns:
run_id: Unique identifier for this pattern/model combinationexperiment_id: Groups multiple runs from same experiment sessionpattern: Which planning pattern (baseline, reflection, etc.)model: Which Ollama model usedtask_name: Individual task name (accumulate, acronym, etc.)status: COMPLETED, FAILED_SETUP, FAILED_EXECUTION, etc.energy_kwh: Energy consumed by this task (from CodeCarbon)duration_seconds: Wall-clock time for this taskpass_rate: 1 if passed, 0 if failedtimestamp: UTC timestamp when logged
Each pattern also writes detailed results:
Baseline/results/TIMESTAMP--baseline--MODEL.tasks.csv
Decomposition/results/TIMESTAMP--decomposition--MODEL.tasks.csv
MultiPlan/results/TIMESTAMP--multiplan--MODEL.tasks.csv
Reflection/results/TIMESTAMP--reflection--MODEL.tasks.csv
Memory/RAG/results/TIMESTAMP--rag--MODEL.tasks.csv
Each .tasks.csv contains:
- Task name, pass/fail status
- Duration, token counts
- CodeCarbon energy metrics
- Per-LLM-call breakdown (pattern-specific)
View experiment summary:
# Show all completed runs
head experiments/run_table.csv
# Count passes per pattern
tail -n +2 experiments/run_table.csv | cut -d',' -f3,9 | sort | uniq -c
# Pass rate by pattern+model
tail -n +2 experiments/run_table.csv | \
awk -F',' '{print $3"/"$4": "$9}' | sort | paste -sd, - | column -t -s','Using csvq for advanced queries:
# Total pass rate
csvq "SELECT SUM(pass_rate) / COUNT(*) as pass_rate FROM experiments/run_table.csv"
# Per-pattern performance
csvq "SELECT pattern, SUM(pass_rate) as passed, COUNT(*) as total FROM experiments/run_table.csv GROUP BY pattern"
# Hardest tasks
csvq "SELECT task_name, COUNT(*) as total, SUM(pass_rate) as passed FROM experiments/run_table.csv GROUP BY task_name ORDER BY passed ASC"
# Energy by pattern (only if > 0)
csvq "SELECT pattern, AVG(energy_kwh) as avg_energy FROM experiments/run_table.csv WHERE energy_kwh > 0 GROUP BY pattern"All five patterns use AIDER's SINGLE-MODEL CODE MODE (not architect/editor dual-model) to isolate planning effects and ensure fair comparison.
| Pattern | Purpose | Key Technique | Model Calls | Token Cost | Implementation |
|---|---|---|---|---|---|
| Baseline | Error-driven baseline | Plain instructions + test errors | 1-2 per task | 1× | Baseline/scripts/baseline_harness.py |
| Decomposition | Incremental sub-planning | "Plan next step" prompts interleaved with execution | 2-4 per cycle | 2-3× | Decomposition/scripts/decomposition_harness.py |
| MultiPlan | Self-consistent selection | Generate 4 plans (temp: 0.3, 0.7, 1.0, 1.5), majority vote | 4 planning + evals | 3-4× | MultiPlan/scripts/multiplan_harness.py |
| Reflection | Error analysis + refinement | Generate → Reflect on errors → Refine fixes → Retry | 3 per cycle (gen + reflect + refine) | 2-5× | Reflection/scripts/reflection_harness.py |
| RAG Memory | In-context learning | Retrieve 3 similar solutions, inject as examples | 1 + retrieval | 1.1× | Memory/RAG/scripts/rag_harness.py |
operationalization: Pure error-driven retry. No reflection, decomposition, or memory.
- Prompt 1 (Initial): Task instructions + file list
- Prompt 2+ (Errors): Previous code + test error output
Example flow:
try 1: Generate code → Test fails (syntax error)
Receive error: "NameError: unfined variable X"
try 2: Refine code using error feedback → Test fails (logic error)
try 3: Final refinement attempt
Config: max 2 attempts (1 initial + 1 retry)
Operationalization: Interleaved planning and execution. Breaks task into steps.
- Planning phase: "Plan the next step to "
- Execution phase: "Implement the plan: <step_description>"
- Cycle: Generate plan → test → on error, generate new plan
Example flow:
cycle 1: Plan: "Parse input with regex"
Execute: "Write parser code"
Test → fails
cycle 2: Analyze error → Plan: "Add error handling"
Execute: "Refactor parser"
Test → passes
Config: max 4 planning/execution cycles
Operationalization: Generate multiple candidate plans with temperature sampling, select best via majority vote.
- Temperature 0.3 (deterministic): Conservative plan
- Temperature 0.7 (creative)
- Temperature 1.0 (baseline)
- Temperature 1.5 (diverse)
Each plan executed independently, winner = passes most tests
Example flow:
plan_0.3: Conservative approach → tests: 7/10 pass
plan_0.7: Try clever optimization → tests: 5/10 pass
plan_1.0: Balanced → tests: 7/10 pass
plan_1.5: Creative → tests: 3/10 pass
Result: Majority vote favors 0.3 or 1.0
Config: 4 temperature levels, all executed
Operationalization: Failed attempts trigger reflection phase. Analyze failures, generate fixes.
- Phase 1: Generate initial code
- Phase 2 (on error): "Reflect: Why did these tests fail? Root causes?"
- Phase 3: "Refine: What specific fixes address each root cause?"
- Phase 4: Implement refined version
Example flow:
try 1: Generate → Test fails: "Error in line 23"
Reflect: "Type mismatch between input parsing and calculation"
Refine: "Add type casting before calculation"
try 2: Code with reflection → Test fails: "Edge case not handled"
try 3: Code with 2 reflections → Test passes
Config: max 3 attempts (allows 2 reflection cycles)
Operationalization: Retrieve similar solved tasks, inject as in-context examples.
- Retrieval: Find 3 most similar past solutions (by code structure/task description)
- Injection: Add as "Here are similar solved tasks:" examples
- Execution: Same as baseline, but with augmented context
Example flow:
Task: "Implement accumulate function"
Retrieve: Similar functions from past tasks
Inject: "Here's how sum() was implemented, pattern is similar"
Generate: Code learned from examples
Config: Top-3 retrieval, no additional model calls beyond generation
All patterns deliberately use AIDER's single-model code mode (not architect/editor dual-model):
- Isolates planning effects: Pattern innovation is the independent variable
- Fair comparison: Same model, same task scope across all patterns
- Token budget control: Can't blame model size for differences
- Reproducibility: No hidden reasoning from architect model
# Check if Ollama service is running
curl http://127.0.0.1:11434/api/tags
# If hung or slow:
# 1. Kill the process
pkill -f "ollama serve"
# 2. Wait 10 seconds
sleep 10
# 3. Restart
ollama serve# Re-setup benchmark repos
bash shared/scripts/setup_benchmark.sh
# Verify
ls benchmark/repos/aider/benchmark/
ls benchmark/repos/polyglot-benchmark/# Ensure Docker is running
docker ps # Should not error
# If error "Cannot connect to Docker daemon":
# macOS: Open Docker application
# Linux: sudo systemctl start docker
# If permission denied:
# Linux: sudo usermod -aG docker $USER && newgrp docker# Check model pull
ollama list
# If pull interrupted, resume:
ollama pull qwen2.5-coder:7b-instruct
# Try alternative model (faster):
ollama pull mistral:7b# Increase timeout in .env:
TASK_TIMEOUT=1800 # 30 minutes instead of 15
# Rerun
bash shared/scripts/run_baseline.sh# Check if harness created output CSV
ls -la Baseline/results/
# If missing, check logs:
tail -50 benchmark/runs/*/run.log
# Ensure pattern-specific harness has write permissions:
chmod +x Baseline/scripts/baseline_harness.py
chmod +x Reflection/scripts/reflection_harness.py# Current model may be stuck
# Kill Ollama and restart:
pkill -f ollama
sleep 5
ollama serve # In separate terminal
# Resume experiment (it will skip completed runs)
bash shared/scripts/run_experiment_orchestrator.sh ...Aider_Planning_Patterns/
├── README.md ← This file
├── EXPERIMENT_ORCHESTRATOR.md ← Detailed orchestrator reference
├── ORCHESTRATOR_REDESIGN_SUMMARY.md ← Design notes (task-level tracking)
│
├── shared/ ← Common utilities
│ ├── scripts/
│ │ ├── run_experiment_orchestrator.sh ← Main entry point
│ │ ├── check_prereqs.sh ← Verify dependencies
│ │ ├── setup_env.sh ← Setup .venv
│ │ ├── setup_benchmark.sh ← Clone benchmark repos
│ │ └── lib/common.sh ← Shared functions
│ └── config/
│ ├── defaults.env ← Default settings
│ └── profiles/ ← (Future) Named configs
│
├── Baseline/
│ ├── scripts/baseline_harness.py ← Pattern implementation
│ ├── scripts/collect_results.py ← Result aggregation
│ ├── README.md ← Detailed pattern docs
│ └── results/ ← Output CSVs/JSONs
│
├── Reflection/
│ ├── scripts/reflection_harness.py ← Reflection + refinement impl.
│ ├── README.md
│ └── results/
│
├── MultiPlan/
│ ├── scripts/multiplan_harness.py ← Multi-plan selection impl.
│ ├── README.md
│ └── results/
│
├── Decomposition/
│ ├── scripts/decomposition_harness.py ← Interleaved planning impl.
│ ├── README.md
│ └── results/
│
├── Memory/ (RAG)
│ ├── RAG/
│ │ ├── scripts/rag_harness.py ← RAG memory impl.
│ │ ├── README.md
│ │ └── results/
│
├── benchmark/
│ ├── repos/ ← Aider + polyglot-benchmark
│ │ ├── aider/
│ │ └── polyglot-benchmark/
│ └── runs/ ← Run artifacts per execution
│
├── experiments/
│ └── run_table.csv ← Master tracking (all tasks)
│
└── .env.example ← Configuration template
Q: Can I run multiple patterns in parallel? A: No, the orchestrator runs sequentially to keep energy tracking clean. This also prevents Ollama resource contention.
Q: What if a run fails midway? A: The orchestrator tracks completed (pattern, model) pairs. Re-run the same command and it will skip completed runs and resume from the failure point.
Q: How long does each pattern take? A: Roughly:
- Baseline: 5-10 min per 10 tasks (simple loop)
- Reflection: 15-25 min per 10 tasks (3 attempts with reflection)
- Decomposition: 20-30 min per 10 tasks (multiple planning cycles)
- MultiPlan: 30-40 min per 10 tasks (4 plans × execution)
- RAG: 5-15 min per 10 tasks (baseline + retrieval overhead)
Q: Can I use different models? A: Yes! Any Ollama model. Best paired models:
- Small:
mistral:7b,qwen2.5-coder:7b - Large:
llama2-uncensored:13b-q5_K_M,qwen2.5-coder:32b
Q: Are results reproducible?
A: Mostly. Same model + same task order = mostly deterministic, but temperature-based sampling (multiplan) and reflection logic introduce minor variation. For true reproducibility, set AIDER_BENCH_SHUFFLE_TASKS=0.
Q: What's the energy tracking?
A: CodeCarbon integration (placeholder values for now). When fully integrated, energy_kwh column shows real consumption. Use for comparing pattern efficiency.
- Detailed Orchestrator Reference: EXPERIMENT_ORCHESTRATOR.md
- Task-Level Design Notes: ORCHESTRATOR_REDESIGN_SUMMARY.md
- Pattern-Specific Details: See
[Pattern]/README.mdfor each planning pattern - Baseline/Reflection/MultiPlan: Original implementations documented in each pattern folder
To add a new planning pattern:
- Create
NewPattern/scripts/newpattern_harness.py(see baseline as template) - Add result aggregation to pattern harness
- Create
NewPattern/README.mdwith operationalization details - Add run script:
shared/scripts/run_newpattern.sh - Update orchestrator: Add pattern to
get_run_script()andget_harness_script()functions - Test:
bash shared/scripts/run_experiment_orchestrator.sh --patterns baseline,newpattern --models qwen2.5-coder:7b-instruct --tasks 3
[Add your license here]
Last Updated: April 13, 2026
Status: Stable (task-level tracking, 5 patterns operationalized, full orchestrator)
Maintainer: [Team Name]