feat: gate pilot — LLM at batch decision boundaries#1
Draft
Conversation
Adds a gate pilot that replaces the simple replan() call after batch failures. The pilot reads disk artifacts (verify logs, QA verdicts, task summaries, learnings) and returns structured decisions: failure analysis, retry strategies, routed context for upcoming tasks, skip recommendations, and re-batching. Key design: - Stateless: reconstructs context from files each invocation - No telephone game: pilot makes system-level decisions, coding agents interpret their own errors directly - Structured JSON output, orchestrator validates and applies - Same model as planner (configurable via planner_model) - Falls back to replan() on parse failure - Config flag: pilot: false in otto.yaml to disable - Zero overhead when no failures (pilot only invoked at batch boundary with failures + remaining tasks) Codex-reviewed: 3 rounds, all CRITICAL/IMPORTANT findings fixed, APPROVED. Benchmark: 53 tasks across 18 runs, 0 regressions, 0 pilot overhead. Pilot not yet validated on real failures — shipping as safe no-op upgrade for i2p readiness. Will prove value at scale (5+ tasks, multiple batches). New files: - otto/pilot.py — context assembly, LLM invocation, decision parsing - tests/test_pilot.py — 22 unit tests - tests/test_pilot_benchmark.py — 6 scenario benchmark tests - bench/pilot-benchmark.sh — A/B benchmark runner - bench/pressure/projects/pilot-test-* — 3 synthetic test projects Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Supersedes the gates + gate pilot approach. Simplified to 5 steps: classify → plan → execute → verify → fix-or-replan. Key decisions: - Single-task is a valid plan (no forced decomposition) - Product artifacts at project root (not otto_arch/) - Persistent context.md accumulates across tasks - Vertical slices over horizontal layers - User journeys from user's perspective, not feature list - Fix rounds continue while making progress, replan on planning failures - Codex-reviewed design Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
replan()after batch failures with richer failure analysis, retry strategies, context routing, and skip recommendationsreplan()on failure. Config flagpilot: falseto disable. Zero overhead when no failures.Status: NOT validated on real failures
The pilot never fired during benchmarking because the coding agent passed all tasks. This is expected — the pilot's value is at i2p scale (8+ tasks, multiple batches, partial failures). Shipping as a safe no-op upgrade.
What's new
otto/pilot.pyotto/orchestrator.pyotto/runner.pytests/test_pilot.pytests/test_pilot_benchmark.pybench/pilot-benchmark.shbench/pressure/projects/pilot-test-*Design docs
docs/superpowers/specs/2026-03-29-gate-pilot.mddocs/superpowers/plans/2026-03-29-gate-pilot-stage1.mddocs/superpowers/specs/2026-03-26-otto-intent-to-product.mdTest plan
pilot.logon next failure.🤖 Generated with Claude Code