Multi-agent development pipeline orchestrator. Turns a one-line task description into shipped, reviewed, documented code via 27 specialized AI agents running through a deterministic Python pipeline.
Status: Pre-1.0. Battle-tested on a personal Python project (Claude CLI integration). Schema designed to be project-agnostic — should work on any git repo with appropriate config; cross-project validation is in progress.
You hand code-loops a task (free-text or markdown file). It runs that
task through a 12-stage pipeline of specialized agents:
PRD → Business Analyst writes a structured product brief with NFR gate
Research → 5 specialists scan codebase / prompts / incidents / data / AI surface in parallel
Design → Software Architect drafts an RFC; perspective lenses critique;
debate-arbiter judges convergence (theme-based, not bug-count)
Design Review→ Safety + Elegance + Hallucination + AI critics review;
Architect responds; Review-Arbiter emits verdict
(approved / needs_revision / redesign_needed)
Impl Plan → Tech Lead decomposes design into atomic, file-disjoint subtasks
(TDD-ordered, dependency-aware, optional wave grouping)
Implementation
→ For each subtask: optional Prompt Engineer / Dataset Curator / Eval Engineer →
QA Engineer writes failing tests (locked chmod 444) →
Software Engineer implements →
Code Reviewer audits diff →
Triage Engineer routes failures (max 3 attempts/target)
Validation → Programmatic gate: pytest, ruff, file-coverage check (no LLM)
Regression → Conditional eval-bench gate (no LLM). Off by default; opt-in
via project.yaml. Runs the project's bench, compares each
metric vs saved baseline, fails if any drops > threshold_pct.
First run captures baseline.
Release Review → Release Manager gates semantic compliance vs PRD/RFC.
Can issue corrective_subtasks → engine re-enters Implementation
Release Docs → Tech Writer produces changelog + ADR + maintenance notes
(flags brief.md staleness)
Auto Resurvey → If maintenance notes flagged staleness, project-surveyor
regenerates projects/<name>/brief.md automatically (else skip)
At 4 stages you get a human-review checkpoint (approve / abort / revise with comment). Auto-loops handle two failure modes: critique detecting a patching anti-pattern bubbles back to design with a redesign signal; release-review detecting missing implementation appends corrective subtasks and re-enters Implementation.
⚠️ Python-only today. The validation stage and several agent prompts hardcodepytest+ruff. The orchestrator itself, project config, worktree management, and TDD loop are language-agnostic in shape — but a NodeJS / Go / Rust user would hit failures in the validation gate. Multi-language support (configurabletest_command/lint_commandper project) is the top roadmap item; see Roadmap.
- Python ≥ 3.11
- uv — Python package manager
- Anthropic Claude CLI
on your
$PATH, authenticated. The orchestrator shells out toclaude --printfor every agent invocation. - git (worktrees require ≥ 2.5)
uv tool install git+https://github.com/icyberdeveloper/code-loops.git
code-loops --helpNow code-loops is on your $PATH everywhere. pipeline.yaml and the
27 agent prompts ship inside the wheel as package data.
git clone https://github.com/icyberdeveloper/code-loops.git
cd code-loops
uv sync
uv run code-loops --helpUse this if you want to modify agent prompts or pipeline.yaml.
code-loops creates tasks/ and projects/ subdirectories in your
current working directory. Run it from a directory you have write
access to (e.g. ~/code-loops-workspace). Override via
$CODE_LOOPS_WORKSPACE env var.
That's it for the orchestrator itself. Real cost lives in Anthropic API usage during pipeline runs (see Costs below).
code-loops operates on a target project — the codebase you want
the pipeline to evolve. You bootstrap it once:
uv run code-loops init /absolute/path/to/your/projectThis:
- Creates
projects/<name>/project.yamlwithname+base_repo. - Invokes the
project-surveyoragent against your repo (~$0.30–3.00, 1–6 min — depends on project size). Surveyor scans README / CLAUDE.md / source tree and writesprojects/<name>/brief.mddocumenting architecture, layout, key modules, storage layer, RAG/vector search (if any), conventions, domain glossary, external integrations, and rules every downstream agent should follow.
The brief is auto-loaded into every code-loops agent via the
{PROJECT_BRIEF} placeholder — agents work with full project context
without you having to feed it manually.
If you have multiple projects, pass --project <name> on subsequent
commands; with one project it's auto-selected.
To skip the surveyor LLM call (dev/cheap path):
uv run code-loops init /path/to/project --no-surveyBrief gets a placeholder — you can edit it manually or run
uv run code-loops resurvey <name> later.
Pass either a free-text description or a path to a .md file. The CLI
auto-detects:
uv run code-loops new "Add /export-data command for weekly meeting export"
uv run code-loops new path/to/postmortems/2026-05-06_timeout_incident.md
uv run code-loops new ~/notes/feature_idea.mdMode (feature vs from_problem) is auto-detected from path keywords
(problem, postmortem, incident) and content markers (## Postmortem,
## Incident, ## Problem, ## Symptoms, etc — Russian equivalents
also recognized for bilingual input).
Output: tasks/<NNNN>_<slug>/ with task.md and meta.yaml.
uv run code-loops run <task_id>Or with explicit project: --project <name>. Pipeline streams progress
to your terminal and pauses at human-review checkpoints.
uv run code-loops projects # list configured projects
uv run code-loops list # list all tasks (status + cost)
uv run code-loops status <task_id> # single-task progress + cost
uv run code-loops commit <task_id> # print branch + push instructions for a completed task
uv run code-loops cancel <task_id> # mark task cancelled (artifacts preserved)
uv run code-loops resurvey <name> # refresh brief.md after material project changes
uv run code-loops eval # pipeline-evaluator: meta-analysis over recent runscode-loops/
├── pyproject.toml # package metadata + `code-loops` entry point
├── src/code_loops/ # the package (ships in the wheel as installed data)
│ ├── pipeline.yaml # ⭐ stage definitions (12 stages, types, role bindings)
│ ├── agents/ # 27 agent prompts in 6 family folders:
│ │ ├── strategy/ # business-analyst, tech-lead
│ │ ├── research/ # research-lead + 5 researchers
│ │ ├── architects/ # software-architect + 7 architect-* (perspective, arbiters, 4 critics)
│ │ ├── engineering/ # qa, software, code-reviewer, triage, prompt, eval engineers
│ │ ├── release/ # release-manager, tech-writer
│ │ └── meta/ # pipeline-evaluator, project-surveyor
│ ├── engine.py # orchestrator: loads pipeline, dispatches by stage type,
│ │ # handles auto-loops (redesign_needed, needs_more_work)
│ ├── runner.py # ClaudeRunner — claude --print subprocess wrapper
│ ├── project_loader.py # project.yaml loader + brief injection
│ ├── meta.py # per-task meta.yaml (status, cost, durations)
│ ├── worktree.py # git worktree mgmt + configurable test-file protection
│ ├── isolation.py # research-question slicing (one researcher = one tag)
│ ├── human_review.py # checkpoint UI (approve/abort/revise)
│ ├── eval_aggregator.py # cross-run aggregation for pipeline-evaluator
│ ├── cli.py # typer entry point (exposed as `code-loops` command)
│ └── stages/ # stage handlers (one per `type:` in pipeline.yaml):
│ ├── prompt.py, parallel.py, debate_writer.py, debate_critique.py,
│ ├── impl_planner.py, subtask_iterator.py, action.py,
│ ├── final_validation.py, regression_check.py, final_review.py,
│ ├── tech_writer.py,
│ └── auto_resurvey.py # Stage 12 — conditional brief.md refresh
├── examples/ # starter templates (project.yaml)
├── scripts/ # CI helpers (e.g. check_no_leakage.sh)
└── tests/ # 214 pytest tests (orchestrator only)
# WORKSPACE (created in user's CWD when running code-loops):
<workspace>/
├── projects/<name>/ # per-project config + auto-generated brief
│ ├── project.yaml # name + base_repo + optional test_infrastructure
│ └── brief.md # auto-generated project knowledge
├── tasks/<NNNN>_<slug>/ # per-task workspaces
│ ├── task.md, meta.yaml
│ ├── prd/, research_plan/, research/, design/, design_review/,
│ │ impl_plan/, implementation/, validation/, release_review/, docs/
│ └── worktree/wt/ # git worktree off base_repo
└── _eval/ # pipeline-evaluator reports
Each stage declares name (semantic id), type (engine handler),
prompts, inputs, outputs, and optional human_review: true,
max_rounds: N. See pipeline.yaml for the full 12-stage definition;
top of file documents the schema. A defaults: block sets
model + effort for all stages (override per-role if needed).
Minimal schema (defaults preserve sensible behavior):
project:
name: my-project
base_repo: /absolute/path/to/your/project
# Optional — markdown file with project-specific architecture/conventions.
# Auto-generated by `code-loops resurvey` (project-surveyor agent).
brief_file: brief.md
# Optional — test infrastructure config (defaults below preserve prior behavior).
test_infrastructure:
enabled: true # false → skip test_writer entirely (manual-QA projects)
test_paths: [tests] # dirs the coder MUST NOT touch
lock_strategy: chmod_444_dir # | none (no chmod, only git-diff guard)See examples/project.yaml for full annotated template.
25 markdown files, one per agent. Each has:
- Role identity opening line
## Project contextblock with{PROJECT_BRIEF}placeholder (auto-substituted at load time)- Domain-specific scan plan / output schema / rules
Customize freely — agents are markdown, not code. Pipeline.yaml binds agents to roles by file path.
Edit projects/<name>/project.yaml:
test_infrastructure:
enabled: true
test_paths: [src/test, e2e] # multiple test dirs
lock_strategy: chmod_444_dir # | noneFor projects with embedded tests (Rust #[cfg(test)], Go *_test.go
colocated): set lock_strategy: none — git-diff guard remains active
as a safety net even without chmod. Glob-based locking
(chmod_444_glob) and embedded-test detection are deferred until real
non-Python projects exercise the pipeline.
After material project changes (new modules, new dependencies, new conventions, renamed dirs), regenerate the brief:
uv run code-loops resurvey <name>The tech-writer stage automatically flags resurvey need in
tasks/<id>/docs/maintenance_notes.md after each task ships.
Every agent prompt is a markdown file under agents/. To change agent
behavior: edit the file, run a task, observe. No engine restart needed.
For systematic A/B comparison run code-loops eval — pipeline-evaluator
detects prompt diffs in git and computes Cohen's d / p-values across
recent runs.
uv run code-loops init /path/to/another/project --name backend-apiEach project lives in its own projects/<name>/ dir. Use
--project backend-api on subsequent commands to disambiguate.
Typical per-task spend (Opus-4.7 at max effort, 2026 pricing):
| Stage | Cost (typical) | Notes |
|---|---|---|
| PRD | $0.05–0.15 | Single Opus call |
| Research plan | $0.05–0.10 | Single Opus call |
| Research (5 parallel) | $0.30–1.00 | 5 specialists, deep scans |
| Design (RFC debate) | $1–4 | 2–5 rounds × (writer + perspectives + arbiter) |
| Design Review (critique) | $0.60–2.50 | 1–3 rounds × (4 critics + responder + arbiter) |
| Impl Plan | $0.20–0.50 | Tech-lead decomposition |
| Implementation | $1–5 | per subtask × subtask count, depends on fix-loop iterations |
| Validation | $0.00 | Programmatic only |
| Regression | $0.00 | Programmatic; off by default. When on, runs project's eval bench. |
| Release Review | $0.30–1 | Single release-manager call |
| Release Docs | $0.05–0.20 | tech-writer |
| Auto Resurvey | $0–3 | $0 if brief stays accurate (typical); $0.30–3 if tech-writer flagged staleness |
| Total per task | $3–18 | Varies hugely with task complexity |
Project-surveyor on init: $0.30–3.00 once per project (re-runs only
when Stage 11 auto-fires or you call code-loops resurvey manually).
To reduce cost: override model to claude-sonnet-4-6 per-stage in
pipeline.yaml for stages where Opus is overkill (research / debate
critics / facilitator). The pipeline-evaluator (Mode B) helps
identify which stages tolerate downgrade.
Every run writes tasks/<id>/meta.yaml with per-stage cost, duration,
attempts, and verdicts. Run:
uv run code-loops eval --last 20…to invoke the pipeline-evaluator agent (Mode B). It aggregates
recent runs and produces a report covering:
- Convergence rate — % runs reaching
release_review.approvedfirst-pass - Per-stage retry rate — debate rounds, fix-router bounces
- Code-quality scorecard trend — 5-axis weighted (correctness / maintainability / performance / security / best-practices)
- A/B prompt comparison — when
agents/<role>.mdchanged in git, computes χ² + Welch's t-test + Cohen's d across before/after runs - Hallucination rate per agent — scans cited file paths and grep- verifies they exist in the target project
- Context-length × degradation tracking — flags stages exceeding 70% of model's safe context range (RULER thresholds for Claude Opus 4.5 ~100K, Sonnet 4.5 ~80K)
- Pass@k tracking — for AI-touching subtasks with golden eval files
Reports land at _eval/report_<timestamp>.md.
Done:
- Full 12-stage pipeline with auto-loops (redesign_needed, needs_more_work, auto-resurvey) and conditional regression gate
- 27 agents in 6 families with
{PROJECT_BRIEF}injection - Configurable test infrastructure (Python
tests/default; pluggable) init/resurvey/projects/evalCLI commands- 195 pytest tests covering orchestrator + worktree + agents
Next:
- Multi-language support — today the validation stage and several
agent prompts hardcode
pytest+ruff. Move test/lint/typecheck commands intoproject.yamlso NodeJS / Go / Rust / etc projects can configure their own. (#1) chmod_444_glob(Go-style*_test.go),git_diff_only(Rust embedded tests), surveyor auto-detect of test paths intoproject.yaml- Real parallel subtask execution (currently sequential even for wave-marked subtasks; deferred until measured wall-clock pain)
- Shared prompt blocks (language rule, Iron Law, revision mode) — deferred until first production runs measure real drift
Not planned:
- Built-in support for non-Anthropic LLM providers — we use Claude CLI exclusively. PRs welcome if there's interest.
This is a personal tool that grew into something potentially useful. PRs welcome but expect:
- Strict pre-commit gate:
uv run pytest && uv run ruff check . - New stage handlers / agent prompts must include tests + a rationale in the PR description (why this addition vs simpler alternatives).
- Backward-compat preserved by default — additive changes only unless there's a clear migration path.
MIT — see LICENSE.
Pipeline shape inspired by patterns from obra/superpowers, awesome-ai-dev-prompts, xfstudio/skills, Fandry96/k3-agentic-skills, and Agent Skills for Context Engineering.