Skip to content

feat: gate pilot — LLM at batch decision boundaries#1

Draft
logpie wants to merge 2 commits intomainfrom
worktree-gate-pilot
Draft

feat: gate pilot — LLM at batch decision boundaries#1
logpie wants to merge 2 commits intomainfrom
worktree-gate-pilot

Conversation

@logpie
Copy link
Copy Markdown
Owner

@logpie logpie commented Mar 30, 2026

Summary

  • Adds a gate pilot that replaces simple replan() after batch failures with richer failure analysis, retry strategies, context routing, and skip recommendations
  • Stateless design: reads disk artifacts, returns structured JSON, orchestrator validates and applies
  • Falls back to replan() on failure. Config flag pilot: false to disable. Zero overhead when no failures.
  • Codex-reviewed (3 rounds, APPROVED). 484 unit tests pass. 53 tasks across 18 e2e runs, 0 regressions.

Status: NOT validated on real failures

The pilot never fired during benchmarking because the coding agent passed all tasks. This is expected — the pilot's value is at i2p scale (8+ tasks, multiple batches, partial failures). Shipping as a safe no-op upgrade.

What's new

File What
otto/pilot.py Gate pilot module — context assembly, LLM call, decision parsing
otto/orchestrator.py Pilot at batch boundaries, fallback to replan, config flag
otto/runner.py Pilot guidance separated in retry prompts
tests/test_pilot.py 22 unit tests
tests/test_pilot_benchmark.py 6 scenario tests
bench/pilot-benchmark.sh A/B benchmark runner
bench/pressure/projects/pilot-test-* 3 synthetic test projects

Design docs

  • Spec: docs/superpowers/specs/2026-03-29-gate-pilot.md
  • Plan: docs/superpowers/plans/2026-03-29-gate-pilot-stage1.md
  • i2p spec: docs/superpowers/specs/2026-03-26-otto-intent-to-product.md

Test plan

  • 484 unit tests pass (0 new failures)
  • Codex adversarial review: 3 rounds, APPROVED
  • 18 e2e runs (6 projects × baseline/pilot): zero overhead, zero regressions
  • 4 real-world combined runs (ufo, humanize, camelcase, pre-commit)
  • 5-task greenfield run with merge conflicts
  • Pending: real-world pilot invocation — needs a run where batch has mixed pass/fail results with remaining tasks. Monitor pilot.log on next failure.

🤖 Generated with Claude Code

Adds a gate pilot that replaces the simple replan() call after batch
failures. The pilot reads disk artifacts (verify logs, QA verdicts,
task summaries, learnings) and returns structured decisions: failure
analysis, retry strategies, routed context for upcoming tasks, skip
recommendations, and re-batching.

Key design:
- Stateless: reconstructs context from files each invocation
- No telephone game: pilot makes system-level decisions, coding agents
  interpret their own errors directly
- Structured JSON output, orchestrator validates and applies
- Same model as planner (configurable via planner_model)
- Falls back to replan() on parse failure
- Config flag: pilot: false in otto.yaml to disable
- Zero overhead when no failures (pilot only invoked at batch boundary
  with failures + remaining tasks)

Codex-reviewed: 3 rounds, all CRITICAL/IMPORTANT findings fixed, APPROVED.
Benchmark: 53 tasks across 18 runs, 0 regressions, 0 pilot overhead.
Pilot not yet validated on real failures — shipping as safe no-op upgrade
for i2p readiness. Will prove value at scale (5+ tasks, multiple batches).

New files:
- otto/pilot.py — context assembly, LLM invocation, decision parsing
- tests/test_pilot.py — 22 unit tests
- tests/test_pilot_benchmark.py — 6 scenario benchmark tests
- bench/pilot-benchmark.sh — A/B benchmark runner
- bench/pressure/projects/pilot-test-* — 3 synthetic test projects

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 30, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6de94498-333a-48d3-b2f0-f8f8313d2328

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch worktree-gate-pilot

Comment @coderabbitai help to get the list of available commands and usage tips.

Supersedes the gates + gate pilot approach. Simplified to 5 steps:
classify → plan → execute → verify → fix-or-replan.

Key decisions:
- Single-task is a valid plan (no forced decomposition)
- Product artifacts at project root (not otto_arch/)
- Persistent context.md accumulates across tasks
- Vertical slices over horizontal layers
- User journeys from user's perspective, not feature list
- Fix rounds continue while making progress, replan on planning failures
- Codex-reviewed design

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant