Number531 · Number531 · Jun 15, 2026 · Jun 16, 2026 · Jun 16, 2026
diff --git a/.claude/skills/deep-plan-pipeline/references/run-ledger.schema.md b/.claude/skills/deep-plan-pipeline/references/run-ledger.schema.md
@@ -13,18 +13,88 @@ created before this contract run unchanged (the conductor falls back to its prio
 
 ```jsonc
 {
+  // — sidecar metadata (Codex-convention-aligned; deliberately lighter — no createdAt/updatedAt, kept deterministic) —
   "schemaVersion": 1,
+  "schemaName": "run-ledger",
+  "schemaUrl": ".claude/skills/deep-plan-pipeline/references/run-ledger.schema.md",
+  "artifactKind": "deep-plan-run-ledger",
+  "producer": { "skill": "deep-plan", "skillVersion": "1.0.0", "module": "run-ledger.mjs" },
+
   "legs": {
     "deepPlan":       { "status": "<status>", "verdict": "sound|minor-only|needs-critical-fixes|has-major-gaps", "degraded": false, "at": "YYYY-MM-DD" },
     "review":         { "status": "<status>", "verdict": "<review verdict>", "counts": { /* … */ }, "at": "YYYY-MM-DD" },
     "revise":         { "status": "revised", "findingsDischarged": 3, "at": "YYYY-MM-DD" },   // informational
     "implementation": { "status": "<status>", "rawStatus": "<leg-3 return status>", "branch": "…", "at": "YYYY-MM-DD" }
   },
   "loopbackCount": 0,        // bumped by the revise leg on each revise→re-review loop
-  "lastEscalation": null     // conductor-managed
+  "lastEscalation": null,    // conductor-managed
+
+  // — Phase B: governable-autonomy fields (all OPTIONAL; set via writeMeta; absent ⇒ prior behaviour) —
+  "authority": "triage",     // what the NEXT stage MAY do: triage<implement<push<merge (fail-closed: absent/unknown ⇒ nothing)
+  "budget": { "capTokens": 800000, "spent": 0 },   // (reserved) loop spend-tracking — NOT yet wired; the active cap is goal.bounds.capTokens
+  "goal": {                  // the quantified completion condition
+    "objective": "<human-readable>",
+    "metric": "<the measured success criterion>",
+    "predicates": [ { "label": "parity", "kind": "command", "cmd": "node …/X-golden-parity.test.js" } ],
+    "met": false,            // = AND(predicate exit codes); empty predicates ⇒ false (fail-closed)
+    "bounds": { "maxIterations": 3, "capTokens": 800000, "noProgressRounds": 2 }
+  },
+
+  // — Phase C: loop-engine resume state (written by run-loop.mjs; absent until the loop runs) —
+  "loop": {
+    "completed": ["item-id"],                 // queue items whose action + goal passed (the resume key set)
+    "escalated": { "item": "…", "reason": "goal-not-met|goal-absent|action-failed|invalid-queue|authority-denied|maxIterations|capTokens|noProgressRounds", "failing": ["<predicate label>"] }  // null when complete
+  }
 }
 ```
 
+## Governable-autonomy fields (Phases B–C)
+
+These make the ledger ready for the autonomous loop engine (`run-loop.mjs`, Phase C — **built but
+un-wired** to real domains). Each is optional and inert until a consumer reads it (the loop engine, once
+its real-domain config is wired; the conductor for resume).
+
+- **`authority`** — the granted permission tier (`triage < implement < push < merge`), checked by
+  `authorityAllows(ledger, action)`. **Fail-closed**: absent/unknown authority permits nothing, so a
+  handoff can never silently escalate from triage to merge.
+- **`goal.bounds`** — the ACTIVE loop boundary, read by `boundExceeded(bounds, {iterations, tokens,
+  noProgress})`, which returns the tripped bound name (a stop reason) or null. The boundary that prevents
+  runaway (the 6M-token failure had none). Bound fields are positive integers; `0`/negative trip
+  immediately and should be treated as misconfiguration.
+- **`budget`** (reserved) — a top-level `{capTokens, spent}` field for the loop's own spend-tracking;
+  NOT yet read by any helper or by the Phase-C loop engine (which today tracks completion in
+  `loop.completed`, not token spend). The **active** token cap is `goal.bounds.capTokens`; the budget
+  field is wired when the loop's real-domain config tracks spend.
+- **`loop`** — the loop engine's resume state (`{completed, escalated}`), written by `run-loop.mjs` and
+  read back by `firstPendingItem` to resume mid-queue. Disjoint from `legs` (a real pipeline's per-leg
+  state) — the two coexist in one ledger without collision.
+- **`goal`** — the quantified "done", evaluated by `evaluateGoal(goal, {cwd})`:
+  `met = AND(predicate.cmd exit 0)`. Every predicate is a **falsifiable command** (deterministic-first;
+  a metric/verdict predicate is just a command that exits 0/1). **Empty predicates ⇒ not met**
+  (fail-closed: a loop needs something that can say *no*). The loop runs
+  `until met ∨ boundExceeded(goal.bounds) → else escalate`.
+
+## Sibling: `run-events.jsonl` (append-only transition log)
+
+Written alongside the ledger by `appendEvent()` (called write-ahead from every `writeLeg`/`writeMeta`).
+The Claude-side machine transition trail — the lighter counterpart to the Codex runtime's
+`deep-plan-events.jsonl` (same `.claude`/`.agents` runtime split). One JSON object per line, never
+rewritten; each carries a **monotonic 1-based `seq`** for ordering + replay/gap detection.
+
+```jsonc
+{ "seq": 1, "type": "leg",      "leg": "deepPlan", "status": "done-advance", "verdict": "sound", "at": "YYYY-MM-DD" }
+{ "seq": 2, "type": "meta",     "keys": ["loopbackCount"], "at": null }
+{ "seq": 3, "type": "loop",     "item": "…", "actionOk": true, "actionSkipped": false, "met": true }   // Phase C: one per loop item
+{ "seq": 4, "type": "loop-end", "status": "complete|escalated", "reason": null, "completed": 2 }        // Phase C: loop terminal
+```
+
+Write-ahead ordering (event appended *before* the ledger write) means a crash over-counts (an event
+with no committed state — auditable) rather than silently under-counting; a torn-write guard in
+`appendEvent` separates a crashed partial line so `seq` stays sane. Reader (intended): **observability**
+— the events log is NOT yet consumed by any code; the conductor resumes from `run-ledger.json` via
+`firstPendingLeg`, not from this file. It is *unit-tested*, not JSON-schema-validated (the deliberate lean-Claude
+posture; the Codex `deep-plan-events.jsonl` carries the heavier registered-schema discipline).
+
 ## Status enum (the single point that preserves verdict semantics)
 
 A leg's `status` is **derived from its verdict** — never collapsed to a binary done/not-done — via the

diff --git a/.claude/skills/deep-plan/DECISIONS.md b/.claude/skills/deep-plan/DECISIONS.md
@@ -0,0 +1,47 @@
+# DECISIONS — deep-plan skill family
+
+Human-legible decision journal: **why**, not what. Append dated, high-level entries for meaningful
+decisions only (not routine actions, never secrets). Distinct from `run-events.jsonl` (machine
+transition log) and `CHANGELOG.md` (releases). Purpose: a person who stays out of the per-step loop
+can re-enter with full understanding. *(Steinberger's persistent log; Karpathy: "outsource your
+thinking, never your understanding.")*
+
+---
+
+## 2026-06-15 — Run-state ledger as the integration seam (not a monolith)
+**Decision:** integrate the legs through ONE shared `run-ledger.json` (state) + `signal-map.md`
+(routing), with the conductor as a thin reader — rather than welding the pipeline into a single
+workflow. **Why:** every source (Anthropic skills, Spec Kit, orchestrator-worker) and our own
+6M-token convergence failure say composed-with-contracts beats monolith; "seamless handoff" comes
+from the shared state contract, not a shared container.
+
+## 2026-06-15 — Build the harness lean + test-gated, NOT via /deep-plan on itself
+**Decision:** harness/tooling changes are built directly, gated by `node --test`, using `/deep-plan`
+at most ONCE to extract invariants. **Why:** two `/deep-plan` runs to plan the ledger cost ~6M tokens
+and never converged (category error: app-change planner aimed at meta-tooling; self-referential;
+additive-by-design amplifies complexity). The test suite is the verifier, not another audit loop.
+Captured in memory: `deep-plan-harness-self-modification`.
+
+## 2026-06-15 — Status derives from verdict; never collapse to binary done/not-done
+**Decision:** a leg's ledger status is *derived* from its verdict via maps mirroring `signal-map.md`,
+fail-closed. **Why:** the binary-flatten defect (a `has-major-gaps` leg-1 auto-advancing to review)
+was the exact resume-correctness bug both audits kept flagging; deriving status preserves the
+conductor's routing semantics so a resumed run can never skip a certification gate.
+
+## 2026-06-15 — Governable-autonomy essentials = authority, completion condition, decision journal
+**Decision:** prioritize three of the four "final" handoff components — `authority` tier, quantified
+`goal`/completion condition, and this journal — as additive ledger fields + one doc. **Why:**
+Karpathy's rule for autonomy is "set the boundaries first" (objective, metric, permissions) and keep
+a record you can re-enter through. Budget boundary + the rollout loop are deferred until a consumer
+(the loop) exists — don't build machinery ahead of its substrate.
+
+## 2026-06-16 — Built the loop ENGINE (Phase C) but kept it UN-WIRED
+**Decision:** build a generic, domain-agnostic loop engine (`run-loop.mjs`) now — validated against a
+synthetic harness — but inject `actionFor`/`goalFor` and leave it un-wired to real domains. **Why:** the
+"don't build machinery before its substrate" rule applies at the *wiring* boundary, not the *engine*
+boundary. The engine's mechanics (advance / gate / escalate / bound / resume) are knowable from first
+principles + the orchestrator-worker research and are fully testable against mocks — so a synthetic
+harness *can* say "no" to the engine. What's genuinely unknowable without Stage 1 is the *fit* to real
+migration (the real queue, the real parity tests, the escalation UX), so the **wiring** stays deferred.
+A synthetic test proves "the engine does what it's coded to," not "this design fits real migration" — the
+latter still wants Stage 1 to prove one migration by hand first.
diff --git a/.claude/skills/deep-plan/VISION.md b/.claude/skills/deep-plan/VISION.md
@@ -0,0 +1,84 @@
+# VISION — The deep-plan skill family as a governable-autonomy foundation
+
+**North star.** A composed system of single-purpose skills that plans, reviews, implements, and
+audits changes — integrated through *shared contracts* (the run-ledger state + the signal-map
+routing law), not a monolith — so work can be tracked, handed off, resumed, and (eventually) looped
+**without a human in every step, yet never without a human's understanding.**
+
+This file is the constitution: it states intent + boundaries + the extension contract, so each
+addition (and each loop tick, later) does not re-derive intent from scratch.
+
+## The 7-layer reference architecture (industry-convergent: Anthropic skills, Spec Kit, orchestrator-worker)
+
+| Layer | Role | Where it lives | Status |
+|---|---|---|---|
+| 1. Constitution / VISION | durable intent + boundaries | this file | ✅ |
+| 2. Staged artifact pipeline | spec → plan → review → implement → audit | the deep-plan skills | ✅ |
+| 3. Control-plane / worker | orchestrator routes; workers execute (tool-enforced) | `deep-plan-pipeline` conductor | ⚠️ prose-enforced |
+| 4. Durable checkpoint | resume from state N, not scratch | `run-ledger.json` | ✅ |
+| 5. Validator that says "no" | tests / audit refute the work | `node --test` + adversarial audit | ✅ |
+| 6. Bounded routing | verdict → action, caps, no improvisation | `signal-map.md` + bounds | ✅ |
+| 7. Decision-ready HITL escalation | opinionated brief, not a raw question | conductor escalation block | ⚠️ format |
+
+## Governable-autonomy components (the tracking surface)
+
+Every component is a **field on the run-ledger or a sibling file** — never a new monolith.
+
+| Component | Form | Purpose | Status |
+|---|---|---|---|
+| Tracking | `run-ledger.json` (`legs`, status) | resume at the right leg | ✅ shipped |
+| Handoff | the ledger contract | leg→leg carries the verdict | ✅ shipped |
+| Transcript | platform `subagents/workflows/*` | full-fidelity replay | ✅ platform |
+| Logging | `run-events.jsonl` (append-only, `seq`-stamped, write-ahead) | machine transition trail; reader = conductor resume/observability | ✅ Phase A — Claude-side counterpart to Codex `deep-plan-events.jsonl`, deliberately lighter (unit-tested, not JSON-schema-registered) |
+| Sidecar conformance | `schemaName`/`schemaUrl`/`artifactKind`/`producer` | cross-side consistency | ✅ Phase A — Codex-convention-aligned, kept deterministic (omits `createdAt`/`updatedAt`) |
+| **Authority** | `run-ledger.authority` + `authorityAllows()` | what the next stage MAY do (triage<implement<push<merge, fail-closed) | ✅ Phase B |
+| **Completion condition** | `run-ledger.goal` + `evaluateGoal()` | quantified, falsifiable "done" (AND of command predicates) | ✅ Phase B |
+| Decision journal | `DECISIONS.md` | human rationale (why, not what) | ✅ Phase A |
+| Budget boundary | `boundExceeded()` over `goal.bounds.capTokens` | token cap; stops runaway | ✅ Phase B (bound helper + active bound). The top-level `budget` field is *reserved* for the loop's spend-tracking — not yet wired. |
+| **Loop engine** | `run-loop.mjs` + synthetic harness | advance / gate / escalate / bound / authority / **resume** over the ledger | ✅ Phase C — **UN-WIRED**: mechanics proven against mocks; real-domain config (the migration skill + parity tests) deferred to Stage 1 |
+
+### Completion condition — quantified (Layer-6 core)
+`done = (P₁ ∧ … ∧ Pₙ)` where each predicate is a **falsifiable command (exit 0 = pass)**,
+deterministic-first, threshold-and-judge only where irreducible, measured against a captured
+baseline. The loop runs `until done ∨ bound (iterations / budget / no-progress) → else escalate`.
+A predicate that cannot fail is "the agent agreeing with itself." (Internalized the hard way as
+audit findings F9/F10: *prove by command, never by prose.*)
+
+## The extension contract (what makes this an evolutionary foundation)
+
+Adding any component is **always** the same four-step, additive move — never a breaking change:
+
+```
+1. add an OPTIONAL field/leg to run-ledger (readers tolerate its absence — readLedger→null seam)
+2. wire its WRITER into one skill's persist script
+3. ship its VALIDATOR (node --test) + update the drift guard
+4. land as its OWN small PR
+```
+
+Properties this guarantees: backward-compatible (inert on pre-contract folders), fail-closed
+(unknown verdict/status → escalate, never silent mis-route), validated-by-construction (no
+component ships without its own test), and incremental (Tier-0 useful, every tier additive).
+
+**Hold this line and the foundation evolves seamlessly; break it (a required field, a breaking
+schema change, an untested component) and you forfeit the property.**
+
+## Backlog (build order)
+
+- **Phase A** — `run-events.jsonl` + sidecar metadata + this `VISION.md` + `DECISIONS.md`. *(built ✅)*
+- **Phase B** — `authority` tier + `goal`/`evaluateGoal` completion condition + `goal.bounds`/`boundExceeded` loop bounds. *(built ✅)* — the top-level `budget` field stays reserved (see Later).
+- **Phase C** — generic loop ENGINE (`run-loop.mjs`): advance/gate/escalate/bound/authority/resume over the ledger, validated by a synthetic harness. *(built ✅ — UN-WIRED: the engine exists; the real-domain config is deferred, see Later.)*
+- **Later (consumer-gated — build when its reader exists):**
+  - Tool-enforced control plane (strip `Edit` from the conductor) — Layer 3.
+  - Decision-ready escalation brief format — Layer 7.
+  - Budget boundary field — when a loop reads it.
+  - **Wire the loop to real domains** — point the Phase-C engine's `actionFor`/`goalFor` at the Stage-3+
+    per-domain MCP→Python migration skill + its real parity tests. The engine is built; the wiring + the
+    real validators come from Stage 1 (which proves one migration by hand first).
+    migrations — the repetitive substrate where automation pays off. Do NOT build before that
+    substrate exists (the "build machinery before there is work for it" trap).
+
+## Non-negotiables (the boundaries before autonomy)
+- A loop needs **something that can say no** (a test / type-check / real error). Never loop without one.
+- **Constitution first**: intent is durable here, not re-derived per tick.
+- Don't build a consumer-less component. Don't merge components into a monolith. Don't deep-plan the
+  deep-plan harness (lean + test-gated only).