Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -13,18 +13,88 @@ created before this contract run unchanged (the conductor falls back to its prio

```jsonc
{
// — sidecar metadata (Codex-convention-aligned; deliberately lighter — no createdAt/updatedAt, kept deterministic) —
"schemaVersion": 1,
"schemaName": "run-ledger",
"schemaUrl": ".claude/skills/deep-plan-pipeline/references/run-ledger.schema.md",
"artifactKind": "deep-plan-run-ledger",
"producer": { "skill": "deep-plan", "skillVersion": "1.0.0", "module": "run-ledger.mjs" },

"legs": {
"deepPlan": { "status": "<status>", "verdict": "sound|minor-only|needs-critical-fixes|has-major-gaps", "degraded": false, "at": "YYYY-MM-DD" },
"review": { "status": "<status>", "verdict": "<review verdict>", "counts": { /* … */ }, "at": "YYYY-MM-DD" },
"revise": { "status": "revised", "findingsDischarged": 3, "at": "YYYY-MM-DD" }, // informational
"implementation": { "status": "<status>", "rawStatus": "<leg-3 return status>", "branch": "…", "at": "YYYY-MM-DD" }
},
"loopbackCount": 0, // bumped by the revise leg on each revise→re-review loop
"lastEscalation": null // conductor-managed
"lastEscalation": null, // conductor-managed

// — Phase B: governable-autonomy fields (all OPTIONAL; set via writeMeta; absent ⇒ prior behaviour) —
"authority": "triage", // what the NEXT stage MAY do: triage<implement<push<merge (fail-closed: absent/unknown ⇒ nothing)
"budget": { "capTokens": 800000, "spent": 0 }, // (reserved) loop spend-tracking — NOT yet wired; the active cap is goal.bounds.capTokens
"goal": { // the quantified completion condition
"objective": "<human-readable>",
"metric": "<the measured success criterion>",
"predicates": [ { "label": "parity", "kind": "command", "cmd": "node …/X-golden-parity.test.js" } ],
"met": false, // = AND(predicate exit codes); empty predicates ⇒ false (fail-closed)
"bounds": { "maxIterations": 3, "capTokens": 800000, "noProgressRounds": 2 }
},

// — Phase C: loop-engine resume state (written by run-loop.mjs; absent until the loop runs) —
"loop": {
"completed": ["item-id"], // queue items whose action + goal passed (the resume key set)
"escalated": { "item": "…", "reason": "goal-not-met|goal-absent|action-failed|invalid-queue|authority-denied|maxIterations|capTokens|noProgressRounds", "failing": ["<predicate label>"] } // null when complete
}
}
```

## Governable-autonomy fields (Phases B–C)

These make the ledger ready for the autonomous loop engine (`run-loop.mjs`, Phase C — **built but
un-wired** to real domains). Each is optional and inert until a consumer reads it (the loop engine, once
its real-domain config is wired; the conductor for resume).

- **`authority`** — the granted permission tier (`triage < implement < push < merge`), checked by
`authorityAllows(ledger, action)`. **Fail-closed**: absent/unknown authority permits nothing, so a
handoff can never silently escalate from triage to merge.
- **`goal.bounds`** — the ACTIVE loop boundary, read by `boundExceeded(bounds, {iterations, tokens,
noProgress})`, which returns the tripped bound name (a stop reason) or null. The boundary that prevents
runaway (the 6M-token failure had none). Bound fields are positive integers; `0`/negative trip
immediately and should be treated as misconfiguration.
- **`budget`** (reserved) — a top-level `{capTokens, spent}` field for the loop's own spend-tracking;
NOT yet read by any helper or by the Phase-C loop engine (which today tracks completion in
`loop.completed`, not token spend). The **active** token cap is `goal.bounds.capTokens`; the budget
field is wired when the loop's real-domain config tracks spend.
- **`loop`** — the loop engine's resume state (`{completed, escalated}`), written by `run-loop.mjs` and
read back by `firstPendingItem` to resume mid-queue. Disjoint from `legs` (a real pipeline's per-leg
state) — the two coexist in one ledger without collision.
- **`goal`** — the quantified "done", evaluated by `evaluateGoal(goal, {cwd})`:
`met = AND(predicate.cmd exit 0)`. Every predicate is a **falsifiable command** (deterministic-first;
a metric/verdict predicate is just a command that exits 0/1). **Empty predicates ⇒ not met**
(fail-closed: a loop needs something that can say *no*). The loop runs
`until met ∨ boundExceeded(goal.bounds) → else escalate`.

## Sibling: `run-events.jsonl` (append-only transition log)

Written alongside the ledger by `appendEvent()` (called write-ahead from every `writeLeg`/`writeMeta`).
The Claude-side machine transition trail — the lighter counterpart to the Codex runtime's
`deep-plan-events.jsonl` (same `.claude`/`.agents` runtime split). One JSON object per line, never
rewritten; each carries a **monotonic 1-based `seq`** for ordering + replay/gap detection.

```jsonc
{ "seq": 1, "type": "leg", "leg": "deepPlan", "status": "done-advance", "verdict": "sound", "at": "YYYY-MM-DD" }
{ "seq": 2, "type": "meta", "keys": ["loopbackCount"], "at": null }
{ "seq": 3, "type": "loop", "item": "…", "actionOk": true, "actionSkipped": false, "met": true } // Phase C: one per loop item
{ "seq": 4, "type": "loop-end", "status": "complete|escalated", "reason": null, "completed": 2 } // Phase C: loop terminal
```

Write-ahead ordering (event appended *before* the ledger write) means a crash over-counts (an event
with no committed state — auditable) rather than silently under-counting; a torn-write guard in
`appendEvent` separates a crashed partial line so `seq` stays sane. Reader (intended): **observability**
— the events log is NOT yet consumed by any code; the conductor resumes from `run-ledger.json` via
`firstPendingLeg`, not from this file. It is *unit-tested*, not JSON-schema-validated (the deliberate lean-Claude
posture; the Codex `deep-plan-events.jsonl` carries the heavier registered-schema discipline).

## Status enum (the single point that preserves verdict semantics)

A leg's `status` is **derived from its verdict** — never collapsed to a binary done/not-done — via the
Expand Down
47 changes: 47 additions & 0 deletions .claude/skills/deep-plan/DECISIONS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# DECISIONS — deep-plan skill family

Human-legible decision journal: **why**, not what. Append dated, high-level entries for meaningful
decisions only (not routine actions, never secrets). Distinct from `run-events.jsonl` (machine
transition log) and `CHANGELOG.md` (releases). Purpose: a person who stays out of the per-step loop
can re-enter with full understanding. *(Steinberger's persistent log; Karpathy: "outsource your
thinking, never your understanding.")*

---

## 2026-06-15 — Run-state ledger as the integration seam (not a monolith)
**Decision:** integrate the legs through ONE shared `run-ledger.json` (state) + `signal-map.md`
(routing), with the conductor as a thin reader — rather than welding the pipeline into a single
workflow. **Why:** every source (Anthropic skills, Spec Kit, orchestrator-worker) and our own
6M-token convergence failure say composed-with-contracts beats monolith; "seamless handoff" comes
from the shared state contract, not a shared container.

## 2026-06-15 — Build the harness lean + test-gated, NOT via /deep-plan on itself
**Decision:** harness/tooling changes are built directly, gated by `node --test`, using `/deep-plan`
at most ONCE to extract invariants. **Why:** two `/deep-plan` runs to plan the ledger cost ~6M tokens
and never converged (category error: app-change planner aimed at meta-tooling; self-referential;
additive-by-design amplifies complexity). The test suite is the verifier, not another audit loop.
Captured in memory: `deep-plan-harness-self-modification`.

## 2026-06-15 — Status derives from verdict; never collapse to binary done/not-done
**Decision:** a leg's ledger status is *derived* from its verdict via maps mirroring `signal-map.md`,
fail-closed. **Why:** the binary-flatten defect (a `has-major-gaps` leg-1 auto-advancing to review)
was the exact resume-correctness bug both audits kept flagging; deriving status preserves the
conductor's routing semantics so a resumed run can never skip a certification gate.

## 2026-06-15 — Governable-autonomy essentials = authority, completion condition, decision journal
**Decision:** prioritize three of the four "final" handoff components — `authority` tier, quantified
`goal`/completion condition, and this journal — as additive ledger fields + one doc. **Why:**
Karpathy's rule for autonomy is "set the boundaries first" (objective, metric, permissions) and keep
a record you can re-enter through. Budget boundary + the rollout loop are deferred until a consumer
(the loop) exists — don't build machinery ahead of its substrate.

## 2026-06-16 — Built the loop ENGINE (Phase C) but kept it UN-WIRED
**Decision:** build a generic, domain-agnostic loop engine (`run-loop.mjs`) now — validated against a
synthetic harness — but inject `actionFor`/`goalFor` and leave it un-wired to real domains. **Why:** the
"don't build machinery before its substrate" rule applies at the *wiring* boundary, not the *engine*
boundary. The engine's mechanics (advance / gate / escalate / bound / resume) are knowable from first
principles + the orchestrator-worker research and are fully testable against mocks — so a synthetic
harness *can* say "no" to the engine. What's genuinely unknowable without Stage 1 is the *fit* to real
migration (the real queue, the real parity tests, the escalation UX), so the **wiring** stays deferred.
A synthetic test proves "the engine does what it's coded to," not "this design fits real migration" — the
latter still wants Stage 1 to prove one migration by hand first.
84 changes: 84 additions & 0 deletions .claude/skills/deep-plan/VISION.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# VISION — The deep-plan skill family as a governable-autonomy foundation

**North star.** A composed system of single-purpose skills that plans, reviews, implements, and
audits changes — integrated through *shared contracts* (the run-ledger state + the signal-map
routing law), not a monolith — so work can be tracked, handed off, resumed, and (eventually) looped
**without a human in every step, yet never without a human's understanding.**

This file is the constitution: it states intent + boundaries + the extension contract, so each
addition (and each loop tick, later) does not re-derive intent from scratch.

## The 7-layer reference architecture (industry-convergent: Anthropic skills, Spec Kit, orchestrator-worker)

| Layer | Role | Where it lives | Status |
|---|---|---|---|
| 1. Constitution / VISION | durable intent + boundaries | this file | ✅ |
| 2. Staged artifact pipeline | spec → plan → review → implement → audit | the deep-plan skills | ✅ |
| 3. Control-plane / worker | orchestrator routes; workers execute (tool-enforced) | `deep-plan-pipeline` conductor | ⚠️ prose-enforced |
| 4. Durable checkpoint | resume from state N, not scratch | `run-ledger.json` | ✅ |
| 5. Validator that says "no" | tests / audit refute the work | `node --test` + adversarial audit | ✅ |
| 6. Bounded routing | verdict → action, caps, no improvisation | `signal-map.md` + bounds | ✅ |
| 7. Decision-ready HITL escalation | opinionated brief, not a raw question | conductor escalation block | ⚠️ format |

## Governable-autonomy components (the tracking surface)

Every component is a **field on the run-ledger or a sibling file** — never a new monolith.

| Component | Form | Purpose | Status |
|---|---|---|---|
| Tracking | `run-ledger.json` (`legs`, status) | resume at the right leg | ✅ shipped |
| Handoff | the ledger contract | leg→leg carries the verdict | ✅ shipped |
| Transcript | platform `subagents/workflows/*` | full-fidelity replay | ✅ platform |
| Logging | `run-events.jsonl` (append-only, `seq`-stamped, write-ahead) | machine transition trail; reader = conductor resume/observability | ✅ Phase A — Claude-side counterpart to Codex `deep-plan-events.jsonl`, deliberately lighter (unit-tested, not JSON-schema-registered) |
| Sidecar conformance | `schemaName`/`schemaUrl`/`artifactKind`/`producer` | cross-side consistency | ✅ Phase A — Codex-convention-aligned, kept deterministic (omits `createdAt`/`updatedAt`) |
| **Authority** | `run-ledger.authority` + `authorityAllows()` | what the next stage MAY do (triage<implement<push<merge, fail-closed) | ✅ Phase B |
| **Completion condition** | `run-ledger.goal` + `evaluateGoal()` | quantified, falsifiable "done" (AND of command predicates) | ✅ Phase B |
| Decision journal | `DECISIONS.md` | human rationale (why, not what) | ✅ Phase A |
| Budget boundary | `boundExceeded()` over `goal.bounds.capTokens` | token cap; stops runaway | ✅ Phase B (bound helper + active bound). The top-level `budget` field is *reserved* for the loop's spend-tracking — not yet wired. |
| **Loop engine** | `run-loop.mjs` + synthetic harness | advance / gate / escalate / bound / authority / **resume** over the ledger | ✅ Phase C — **UN-WIRED**: mechanics proven against mocks; real-domain config (the migration skill + parity tests) deferred to Stage 1 |

### Completion condition — quantified (Layer-6 core)
`done = (P₁ ∧ … ∧ Pₙ)` where each predicate is a **falsifiable command (exit 0 = pass)**,
deterministic-first, threshold-and-judge only where irreducible, measured against a captured
baseline. The loop runs `until done ∨ bound (iterations / budget / no-progress) → else escalate`.
A predicate that cannot fail is "the agent agreeing with itself." (Internalized the hard way as
audit findings F9/F10: *prove by command, never by prose.*)

## The extension contract (what makes this an evolutionary foundation)

Adding any component is **always** the same four-step, additive move — never a breaking change:

```
1. add an OPTIONAL field/leg to run-ledger (readers tolerate its absence — readLedger→null seam)
2. wire its WRITER into one skill's persist script
3. ship its VALIDATOR (node --test) + update the drift guard
4. land as its OWN small PR
```

Properties this guarantees: backward-compatible (inert on pre-contract folders), fail-closed
(unknown verdict/status → escalate, never silent mis-route), validated-by-construction (no
component ships without its own test), and incremental (Tier-0 useful, every tier additive).

**Hold this line and the foundation evolves seamlessly; break it (a required field, a breaking
schema change, an untested component) and you forfeit the property.**

## Backlog (build order)

- **Phase A** — `run-events.jsonl` + sidecar metadata + this `VISION.md` + `DECISIONS.md`. *(built ✅)*
- **Phase B** — `authority` tier + `goal`/`evaluateGoal` completion condition + `goal.bounds`/`boundExceeded` loop bounds. *(built ✅)* — the top-level `budget` field stays reserved (see Later).
- **Phase C** — generic loop ENGINE (`run-loop.mjs`): advance/gate/escalate/bound/authority/resume over the ledger, validated by a synthetic harness. *(built ✅ — UN-WIRED: the engine exists; the real-domain config is deferred, see Later.)*
- **Later (consumer-gated — build when its reader exists):**
- Tool-enforced control plane (strip `Edit` from the conductor) — Layer 3.
- Decision-ready escalation brief format — Layer 7.
- Budget boundary field — when a loop reads it.
- **Wire the loop to real domains** — point the Phase-C engine's `actionFor`/`goalFor` at the Stage-3+
per-domain MCP→Python migration skill + its real parity tests. The engine is built; the wiring + the
real validators come from Stage 1 (which proves one migration by hand first).
migrations — the repetitive substrate where automation pays off. Do NOT build before that
substrate exists (the "build machinery before there is work for it" trap).

## Non-negotiables (the boundaries before autonomy)
- A loop needs **something that can say no** (a test / type-check / real error). Never loop without one.
- **Constitution first**: intent is durable here, not re-derived per tick.
- Don't build a consumer-less component. Don't merge components into a monolith. Don't deep-plan the
deep-plan harness (lean + test-gated only).
Loading