Factory-SDK Graduation — Runbook + Status (2026-06-14): machinery graduated & validated; autonomous merge-flip honestly pending

# Factory-SDK Graduation Runbook + Status — 2026-06-14

> Status: **MACHINERY GRADUATED & VALIDATED.** Autonomous **flip (mergePolicy never→on-green-with-review) HONESTLY PENDING** the real-PR leg.
> This is the disciplined outcome: every gate that could be proven was proven and mutation-cert'd; the one gate tonight's real backlog could not honestly supply (≥2-3 real PRs landed clean) is reported as pending rather than manufactured.

---

## 1. TL;DR

The factory-sdk autonomous loop (discovery → triage → dispatch → PR → PR-state completion → merge-gate → close → release) is **built, merged to main, unit-cert'd RED-on-revert, and live-validated** on a continuous synthetic soak. The 2-process production model (live daemon + external reaper) is **PROVEN non-vacuous over an ~8h continuous soak** (465 samples, single pinned pid 35382, clean-ended at 8h by intentional SIGTERM): **rss bounded 20–70MB, +3.8MB drift over 8h** (well under the 8MB no-leak threshold; the half-vs-half offset stayed stable +4–6MB across the 4h/6h/7h/end checks = NOT growing = no slow leak), **fd flat (31.0→30.6)**, **badReap=0 across the whole run**. Leak-test-under-load = the busy ~1.5h heavy-cycling window (completions 1→29 ≈6×/key, mean 39MB bounded); bounded through the idle stall tail (mean 46MB).

The autonomous **merge flip is intentionally NOT thrown tonight.** The flip-gate requires ≥2-3 *real* PRs landed clean, and tonight's real backlog yielded **zero clean autonomous-handleable candidates** (see §5). We did **not** enrich/manufacture a candidate — a manufactured proof isn't a proof. `mergePolicy` stays `'never'`; the flip waits for genuine candidates (or the LLM triage tier).

---

## 2. What graduated — machinery ledger (cert facts, authoritative via sb-review2)

All merged to `main`, each with mutation tests proven RED-on-revert:

| Fix | PR (SHA) | What | Key MUTs proven RED |
|-----|----------|------|---------------------|
| **A** close-probe | #280 (7c4ab03) | markerless synthetic-close + exact-number-bounded issue-key match + over-match guard | closer-relax, resolver-relax, exact-bound `-(?!\d)`, issue-gate, operator-wiring (caught the M2 operator-flag vacuity on re-cert) |
| **B** heartbeat-starvation | #281 (c652420) | false-reap-via-event-loop-starvation fix (enqueue→bounded drain + yield) | MUT-B1 (revert→void #handleLiveChange) RED; :2433 replay-suppression stabilized (MUT-S RED) |
| **C** completion sweep | #287 (96e6324) | PR-state completion sweep (the WEDGE fix) + gh-resolver fold | wedge, draft-skip, coalesce-idempotent, gh-resolver over-match-catastrophic (rejects its own fix PRs #287/#280), fail-closed, draft-via-gh, gh-not-found-backoff; bot drive-by 1f478a0 caught+verified-safe |
| **H** teardown-race | #290 (ae6486c) | refresh 'stopping' heartbeat during teardown | progressing-protected (M-H1) AND wedged-still-reaped (M-H2 — wedge-detection preserved) |
| **I** terminal-clear-on-reopen | #294 (a913949) | re-cycle enabler: clear dispatch-terminal on canonical Done→Ready | clear-enables-redispatch (M-I1) AND canonical-only safety (M-I2 — by-state alias flap does NOT clear) |
| **F** bounded-parallel drain | #299 (cc346eb) | dispatch throughput recovery under replay backlog | throughput, dedupe-under-K (serial-prep-before-parallel → no concurrent double-dispatch), dedupe-identity, :2433, MUT-B1-yield stays RED |

**Closed failure classes (tagged by proof level — every "proven" maps to its real proof):**
- **[live]** false-reap (B) · wedge (C re-cycle) · resolver-closes-its-own-fix-PR (C catastrophic-miss control PASSED) · terminal-block-no-recycle (I — the soak re-cycle IS the live proof).
- **[unit]** teardown-race (H) — unit-cert'd (M-H1/M-H2 RED). Live-teardown is a BONUS, NOT a flip-gate; taken only from a GENUINE clean multi-pair opportunity, never engineered (a fresh-seed-to-force-the-condition plan was raised + withdrawn as mild-manufacturing). The overnight broker-name stall left no clean ≥3-pair window; and the natural clean-end SIGTERM gave **NO H-direction live evidence** — the lone hung AR-237 pair SURVIVED the daemon's clean-stop (endSurv=2: abandoned/hung pairs are OUTSIDE the daemon's teardown path, which handles TRACKED in-flight pairs; the pair was cleaned by the external killtree backstop). So the clean-end exercised neither M-H1 (no tracked progressing pair) nor M-H2 (no hung-teardown-step). **H stays [unit]-cert'd (M-H1/M-H2 RED).** _(An earlier "M-H1-adjacent" framing was retracted on the endSurv=2 data — the pair didn't terminate via the daemon teardown, it survived it.)_ · dedupe-under-K (F) — unit-only **by design** (a 2nd dispatching daemon would contaminate the soak); F's live axis that WAS spot-checked is hb-freshness (maxHbAge 2s, staleHits 0), not dedupe.

_§2 SHAs are the squash-merge commits on `main` (correct for "merged to main"); each merge GO was bound to the PR-HEAD SHA and head-SHA-gated (e.g. F head 0e11632 → merged-as cc346eb)._

**Binding standing rule established:** wedge-signal is **progress-based** (completions↑/dispatches↑ over a window), **count-independent** (pids>0 false-clears the real dead-PID-full-slots wedge; raw regAgents false-trips on deregister-lag).

---

## 3. Live validation (non-vacuous, fv2)

- False-reap-dead 2-process cert + soak `badReap=0`/tick over live in-flight pairs.
- gh-resolver autonomous cycle end-to-end: mount-miss → gh-resolve → #completeIssue → auto-close fresh OPEN PR → slot-free → next-dispatch.
- #287/#280 UNTOUCHED on the real adversarial set (resolver does not close its own fix PRs).
- Soak-core re-cycle: **SUSTAINED ~1.5h** (completions 1→29 ≈6×/key past the 5-key pool, live pid-rotation, mem/fd bounded) **THEN STALLED** on a broker agent-name-reuse collision (relay#1116-family: the broker doesn't release agent names on completion → a fixed 5-key pool jams once all 5 names are stuck-registered → re-dispatch fails "agent already exists"). NOT indefinitely sustained — the stall is a real finding (see §8), not a leak (mem/fd stayed bounded through it).
- F heartbeat-freshness spot-check: maxHbAge 2s, staleHits 0.
- **Endurance soak (pinned daemon 35382 @ 00:11:29 → clean-end @ 28832s / ~8h SIGTERM):** ★ **(D) PROVEN — NO LEAK over ~8h, 465 samples, single continuous pid.** rss min=20 max=70 mean=45MB; first-half-mean 43.1 → last-half-mean 46.8 = **+3.8MB drift** (under the 8MB no-leak threshold; offset stable +4–6MB across 4h/6h/7h/end = not growing). fd min=30 max=38; first-half 31.0 → last-half 30.6 = **flat**. Honest scope: bounded NO-climb over the busy ~1.5h leak-test-under-load (mean 39MB) AND bounded through the idle stall tail (mean 46MB). badReap=0 every tick (coupled reaper stale=False reaped=0). Clean-end hygiene: daemon clean-stopped, broker 68009 untouched, 0 strays. **(E) clean-SIGTERM = 3184ms, rc=0** (clean exit, NOT SIGKILL/137 — the #273 force-exit-on-clean-stop working even with an abandoned pair lingering) → NOT slow, **resolves the (E) watch**. ★ **badReap=0 over 477 reaper-alive-ticks (the whole 8h)** — the coupled reaper NEVER false-reaped a healthy pair = strong reaper-safety live datapoint. ★ **endSurv=2:** the abandoned hung AR-237 pair SURVIVED the daemon's clean-stop and required the external backstop to clean — abandoned pairs are outside the daemon's teardown path (see §8 (K) process-side twin).

---

## 4. The 2-process production model

1. **Live daemon:** `factory start --mode live --config <live>` — subscription-driven; writes + refreshes a loop heartbeat (#276; freshness via the #281 queue+yield fix).
2. **External reaper (crash-backstop):** scheduled `fleet factory reap-orphans` — MUST run with **the same `--config` as the live daemon** (coupling vacuity: a mismatched config reaps nothing AND leaves the backstop broken). Gated on staleMs.
3. **Coupling proof:** coupled-tick stale=False over live pairs; crash-inverse SIGKILL→reaped=25/survivors=0.

---

## 5. Graduation boundary (clarified, not a failure)

**Autonomous on WELL-SPECIFIED issues; thin/underspecified issues ESCALATE (no auto-dispatch).** This is a deliberate **safety quality gate**: `isThinIssue` (desc<140 OR no acceptance-signal) → TieredTriage escalates; with no LLM tier configured → escalate, no dispatch.

**The autonomous-handleable class = non-thin AND pear-scoped AND no reserved-for-human design decision.**

**EXHAUSTIVE classification of all 98 non-thin issues → 0 clean candidates** (hand-verified, incl. the 13 the heuristic flagged "unknown"):
- **45** wrong-repo: relay/broker/relaycast/swift/PTY internals.
- **40** demo-app: JSON-CSV/tulip/rose-count (35) + the 5 "Haaland fan-site" pages AR-242–246 — all well-specified + tractable but **misroute to pear** via `repos.default` (their work belongs in a demo/fan-site repo).
- **5** cloud/github-app (AR-106 workspace-normalization etc.).
- **3** auto-generated test artifacts (AR-111/139/140) — not real coding tasks.
- **2** harness CLI-registry (AR-56 open-ended research; AR-57 refresh-CLI-versions = the v8 version-skew class).
- **1** relayfile-tests (AR-16).
- **2** merge-policy/behavioral-sensitive: **AR-99 "Reviewer: merge on green"** + AR-95 integration-behavior (reserved/risk-flagged).
- **1** genuinely pear-scoped non-thin: **AR-239** (factory relayfile creds-unify) — but **auth-critical** AND carries a **reserved-for-Khaliq design decision** (fix approach is the owner's call) → an autonomous agent would guess the reserved design → (reasoned prediction) would not land clean — sb-review2 would correctness-hold the guessed reserved-design; a predicted hold, not a tested outcome.
- **genuinely-pear-clean: 0.**

★ **AR-99 boundary illustration (worth surfacing):** the backlog literally asks for "Reviewer: merge on green" — that IS the auto-merge feature we built (#19 / the merge-gate / the flip itself). It's correctly classified reserved + behavioral-sensitive → out-of-autonomous-scope. So the factory built the exact feature its own backlog requested, and that feature is precisely the reserved-sensitive class an autonomous agent must NOT self-implement. The most safety-critical backlog items are the ones correctly NOT autonomously handled — the boundary working as designed.

★ **Pattern (twice):** the lone "pear-scoped non-thin" candidate hid a reserved cross-cutting design call (AR-82→v8/cloud#1932 migration; AR-239→creds-unify approach). The well-specified pear issues that remain are *exactly* the ones a human deliberately reserved the design on. This is the honest truth about the backlog, not a survey gap — and it's why escalate-don't-guess is correct (a naive "latest agent-relay" would hit the v8/cloud#1932 landmine).

**We did NOT enrich-to-manufacture a PR.** `mergePolicy` stays `'never'`; the FLIP WAITS on the real-PR leg.

---

## 6. Flip conditions (when to throw never→on-green-with-review)

ALL of:
- [x] F + H + I merged & cert'd.
- [ ] Sustained endurance soak GREEN (mem/fd-no-leak over hours + sustained re-cycle + reaped=0-coupled + gh-auth-live) — _[morning tally]_.
- [ ] ≥2-3 REAL PRs landed clean (correctness-reviewed) — **PENDING** (no clean candidate tonight).
- [ ] Lead's counting call on what satisfies the real-PR leg.

Flip mechanism (when thrown): head-SHA-guarded + all-required-CI-green + reviewDecision=APPROVED; the synthetic [factory-e2e] discriminator is preserved (synthetic = close-never-merge, even post-flip).

---

## 7. Operating preconditions (HARD)

- Scheduled-reaper `--config` == live-daemon `--config` (else vacuous reap + broken backstop).
- **gh-authed operator environment** — the gh-resolver is **completion-load-bearing** while the cloud GitHub→mount PR-sync is stuck (cloud#2108-family). A gh-auth drop halts completion.
- Verify the operator artifact (the exact `node bin/fleet.mjs --config <live>` path), not a shim.

---

## 8. Tracked follow-ups (non-blocking, documented)

- **(c) LLM triage tier** — wire an LLM into TieredTriage so thin issues get specced/escalated-with-context instead of dropped. The real fix for the THIN-issue class. = a parallel next-step (NOT "the" single gate — it handles thin issues; the multi-repo router below handles the larger demo-app-misroute class). Router is #1-by-volume; LLM-tier is the thin-issue handler. (port-injection, real change, needs review.)
- **(J) registry deregister-on-completion** — transient/bounded tonight (5-key re-cycle overwrites); a GATE before a multi-distinct-real-issue batch (unbounded entry growth there). *Process/registry hygiene.*
- **(K) issue-state-reset-on-abandoned-dispatch** — agent dies/exits BEFORE opening a PR → issue stuck "Agent Implementing" forever (the (C) completion-sweep only resets state when a non-draft PR exists). Fix: on agent-death-without-PR, reset issue state (→Ready/Backlog) + release slot. *Real-tracker issue-state hygiene; distinct from (J).* A GATE before a large real run. Evidence: AR-82 found pre-existing stuck in this state; the hung AR-237 soak pair (84181/84306, ~6.5h, no PR) = a live (K)-instance. ★ **Process-side twin (proven at the 8h clean-end, endSurv=2):** an abandoned/hung pair is OUTSIDE the daemon's own clean-stop teardown path (which handles tracked in-flight pairs) → it SURVIVES a clean daemon SIGTERM and requires the EXTERNAL reaper/backstop to clean. So the external reaper is load-bearing for abandoned-pair cleanup, and the (K) fix should also remove the abandoned agent from the daemon's teardown-tracked set.
- ★ **broker agent-name-not-released-on-completion (relay#1116-family — relay-side sibling of (J)/(K))** — the broker doesn't free an agent's NAME on completion → a fixed-key re-dispatch pool jams ("agent already exists", 68 errors across the 5 keys) once all names are stuck-registered → unattended re-cycle STALLED at ~1.5h on the 5-key soak. A real run with DISTINCT issue names delays it, but the name-leak still accumulates over a long run. **Bounds unattended fixed-key operation** (~1.5h before needing broker-name cleanup). Surface as a relay#1116-family issue ([[surface-relay-layer-fixes]]). NB: the (D) leak cert is INTACT — captured on the busy 1.5h window before the stall; the stall is a workload change, not a leak.
- ★ **multi-repo router (the demo-app-misroute ROOT — the #1 unlock)** — the STATE.md "designed-but-UNPROVEN multi-repo router" manifesting: no working per-issue→repo routing → everything defaults to pear → misroute. **This, not more machinery, is the #1 enabler for a scaled real run:** proving the router (route demo-app/relay/cloud issues to THEIR repos) makes the ~40 well-specified demo-app tasks legitimately dispatchable in-repo — the largest clean-candidate class. ⚠️ Building it requires a new target repo (gh-auth/branch/CI verification) + resolving "do demo-app PRs count as the graduation real-PR proof" — a Khaliq/Lead decision, deliberately NOT built at 3am (over-reach trap).
- **(F) dispatch-throughput + high-watermark-404 cloud-delivery-lag** — cloud#2108-family.
- **cloud#2108** the underlying GitHub→mount PR-sync stuck (routed-around via gh-resolver).

---

## 9. Honest residual & items for Khaliq (full disclosure — nothing hidden)

**Crossed-message race (disclosed, harm = zero):** during the real-candidate evaluation, a crossed-message race seeded AR-239→Ready before a retraction propagated. The dual-catch (reviewer STOP + lead STEER-HALT) caught it **PRE-SPAWN** → **0 agents / 0 PR / 0 auth-design-guess**; the pinned soak daemon (35382) + broker (68009) were untouched. AR-82 was restored to its verified as-found state; AR-239 was left flagged (original genuinely unverifiable → no guess-restore, per the honest-known-altered > confidently-wrong rule), logged to `/tmp/factory-run/KNOWN-ALTERED.md`. **Net residual trace = AR-239 in "Ready for Agent" (inert, flagged) + logged seed events.** This is disclosed both as honesty and as a demonstration that the safety net (steer-halt + reviewer backstop + never-policy + isolation) works under a real race.

**Items needing Khaliq's reset/verification:**
1. **AR-239** = "Ready for Agent" — known-altered by the crossed-race, original state unverifiable. Please reset to intended.
2. **AR-82** = "Agent Implementing" — ⚠️ PRE-EXISTING orphaned in-progress state (a prior-session leftover, **not** created this run; restored to as-found, not ours to alter). A real-tracker observation: AR-82 sits in Agent-Implementing with no active agent. Please verify/reset if stale. (Possible orphaned-state-cleanup gap on a prior run — sibling to follow-up (J).)

## 10. Honest morning posture (one line)

GRADUATED + safe + autonomous-self-completing + machinery-validated + endurance-PROVEN (~8h soak, +3.8MB rss drift / fd-flat / badReap=0, clean-ended); the "≥2-3 real PRs landed clean" leg is honestly PENDING because tonight's real backlog contains no clean autonomous-handleable candidate, and we refused to manufacture one. The flip waits. That integrity — applied to the proof itself, not just the certs — is the deliverable.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Factory-SDK Graduation — Runbook + Status (2026-06-14): machinery graduated & validated; autonomous merge-flip honestly pending #321

Factory-SDK Graduation Runbook + Status — 2026-06-14

1. TL;DR

2. What graduated — machinery ledger (cert facts, authoritative via sb-review2)

3. Live validation (non-vacuous, fv2)

4. The 2-process production model

5. Graduation boundary (clarified, not a failure)

6. Flip conditions (when to throw never→on-green-with-review)

7. Operating preconditions (HARD)

8. Tracked follow-ups (non-blocking, documented)

9. Honest residual & items for Khaliq (full disclosure — nothing hidden)

10. Honest morning posture (one line)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Fix	PR (SHA)	What	Key MUTs proven RED
A close-probe	#280 (`7c4ab03`)	markerless synthetic-close + exact-number-bounded issue-key match + over-match guard	closer-relax, resolver-relax, exact-bound `-(?!\d)`, issue-gate, operator-wiring (caught the M2 operator-flag vacuity on re-cert)
B heartbeat-starvation	#281 (`c652420`)	false-reap-via-event-loop-starvation fix (enqueue→bounded drain + yield)	MUT-B1 (revert→void #handleLiveChange) RED; :2433 replay-suppression stabilized (MUT-S RED)
C completion sweep	#287 (`96e6324`)	PR-state completion sweep (the WEDGE fix) + gh-resolver fold	wedge, draft-skip, coalesce-idempotent, gh-resolver over-match-catastrophic (rejects its own fix PRs #287/#280), fail-closed, draft-via-gh, gh-not-found-backoff; bot drive-by `1f478a0` caught+verified-safe
H teardown-race	#290 (`ae6486c`)	refresh 'stopping' heartbeat during teardown	progressing-protected (M-H1) AND wedged-still-reaped (M-H2 — wedge-detection preserved)
I terminal-clear-on-reopen	#294 (`a913949`)	re-cycle enabler: clear dispatch-terminal on canonical Done→Ready	clear-enables-redispatch (M-I1) AND canonical-only safety (M-I2 — by-state alias flap does NOT clear)
F bounded-parallel drain	#299 (`cc346eb`)	dispatch throughput recovery under replay backlog	throughput, dedupe-under-K (serial-prep-before-parallel → no concurrent double-dispatch), dedupe-identity, :2433, MUT-B1-yield stays RED

Uh oh!

Factory-SDK Graduation — Runbook + Status (2026-06-14): machinery graduated & validated; autonomous merge-flip honestly pending #321

Description

Factory-SDK Graduation Runbook + Status — 2026-06-14

1. TL;DR

2. What graduated — machinery ledger (cert facts, authoritative via sb-review2)

3. Live validation (non-vacuous, fv2)

4. The 2-process production model

5. Graduation boundary (clarified, not a failure)

6. Flip conditions (when to throw never→on-green-with-review)

7. Operating preconditions (HARD)

8. Tracked follow-ups (non-blocking, documented)

9. Honest residual & items for Khaliq (full disclosure — nothing hidden)

10. Honest morning posture (one line)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions