You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Factory-SDK Graduation Runbook + Status — 2026-06-14
Status: MACHINERY GRADUATED & VALIDATED. Autonomous flip (mergePolicy never→on-green-with-review) HONESTLY PENDING the real-PR leg.
This is the disciplined outcome: every gate that could be proven was proven and mutation-cert'd; the one gate tonight's real backlog could not honestly supply (≥2-3 real PRs landed clean) is reported as pending rather than manufactured.
1. TL;DR
The factory-sdk autonomous loop (discovery → triage → dispatch → PR → PR-state completion → merge-gate → close → release) is built, merged to main, unit-cert'd RED-on-revert, and live-validated on a continuous synthetic soak. The 2-process production model (live daemon + external reaper) is PROVEN non-vacuous over an ~8h continuous soak (465 samples, single pinned pid 35382, clean-ended at 8h by intentional SIGTERM): rss bounded 20–70MB, +3.8MB drift over 8h (well under the 8MB no-leak threshold; the half-vs-half offset stayed stable +4–6MB across the 4h/6h/7h/end checks = NOT growing = no slow leak), fd flat (31.0→30.6), badReap=0 across the whole run. Leak-test-under-load = the busy ~1.5h heavy-cycling window (completions 1→29 ≈6×/key, mean 39MB bounded); bounded through the idle stall tail (mean 46MB).
The autonomous merge flip is intentionally NOT thrown tonight. The flip-gate requires ≥2-3 real PRs landed clean, and tonight's real backlog yielded zero clean autonomous-handleable candidates (see §5). We did not enrich/manufacture a candidate — a manufactured proof isn't a proof. mergePolicy stays 'never'; the flip waits for genuine candidates (or the LLM triage tier).
2. What graduated — machinery ledger (cert facts, authoritative via sb-review2)
All merged to main, each with mutation tests proven RED-on-revert:
throughput, dedupe-under-K (serial-prep-before-parallel → no concurrent double-dispatch), dedupe-identity, :2433, MUT-B1-yield stays RED
Closed failure classes (tagged by proof level — every "proven" maps to its real proof):
[live] false-reap (B) · wedge (C re-cycle) · resolver-closes-its-own-fix-PR (C catastrophic-miss control PASSED) · terminal-block-no-recycle (I — the soak re-cycle IS the live proof).
[unit] teardown-race (H) — unit-cert'd (M-H1/M-H2 RED). Live-teardown is a BONUS, NOT a flip-gate; taken only from a GENUINE clean multi-pair opportunity, never engineered (a fresh-seed-to-force-the-condition plan was raised + withdrawn as mild-manufacturing). The overnight broker-name stall left no clean ≥3-pair window; and the natural clean-end SIGTERM gave NO H-direction live evidence — the lone hung AR-237 pair SURVIVED the daemon's clean-stop (endSurv=2: abandoned/hung pairs are OUTSIDE the daemon's teardown path, which handles TRACKED in-flight pairs; the pair was cleaned by the external killtree backstop). So the clean-end exercised neither M-H1 (no tracked progressing pair) nor M-H2 (no hung-teardown-step). H stays [unit]-cert'd (M-H1/M-H2 RED).(An earlier "M-H1-adjacent" framing was retracted on the endSurv=2 data — the pair didn't terminate via the daemon teardown, it survived it.) · dedupe-under-K (F) — unit-only by design (a 2nd dispatching daemon would contaminate the soak); F's live axis that WAS spot-checked is hb-freshness (maxHbAge 2s, staleHits 0), not dedupe.
§2 SHAs are the squash-merge commits on main (correct for "merged to main"); each merge GO was bound to the PR-HEAD SHA and head-SHA-gated (e.g. F head 0e11632 → merged-as cc346eb).
Binding standing rule established: wedge-signal is progress-based (completions↑/dispatches↑ over a window), count-independent (pids>0 false-clears the real dead-PID-full-slots wedge; raw regAgents false-trips on deregister-lag).
3. Live validation (non-vacuous, fv2)
False-reap-dead 2-process cert + soak badReap=0/tick over live in-flight pairs.
Soak-core re-cycle: SUSTAINED ~1.5h (completions 1→29 ≈6×/key past the 5-key pool, live pid-rotation, mem/fd bounded) THEN STALLED on a broker agent-name-reuse collision (relay#1116-family: the broker doesn't release agent names on completion → a fixed 5-key pool jams once all 5 names are stuck-registered → re-dispatch fails "agent already exists"). NOT indefinitely sustained — the stall is a real finding (see §8), not a leak (mem/fd stayed bounded through it).
F heartbeat-freshness spot-check: maxHbAge 2s, staleHits 0.
Endurance soak (pinned daemon 35382 @ 00:11:29 → clean-end @ 28832s / ~8h SIGTERM): ★ (D) PROVEN — NO LEAK over ~8h, 465 samples, single continuous pid. rss min=20 max=70 mean=45MB; first-half-mean 43.1 → last-half-mean 46.8 = +3.8MB drift (under the 8MB no-leak threshold; offset stable +4–6MB across 4h/6h/7h/end = not growing). fd min=30 max=38; first-half 31.0 → last-half 30.6 = flat. Honest scope: bounded NO-climb over the busy ~1.5h leak-test-under-load (mean 39MB) AND bounded through the idle stall tail (mean 46MB). badReap=0 every tick (coupled reaper stale=False reaped=0). Clean-end hygiene: daemon clean-stopped, broker 68009 untouched, 0 strays. (E) clean-SIGTERM = 3184ms, rc=0 (clean exit, NOT SIGKILL/137 — the Force factory daemon exit after graceful stop #273 force-exit-on-clean-stop working even with an abandoned pair lingering) → NOT slow, resolves the (E) watch. ★ badReap=0 over 477 reaper-alive-ticks (the whole 8h) — the coupled reaper NEVER false-reaped a healthy pair = strong reaper-safety live datapoint. ★ endSurv=2: the abandoned hung AR-237 pair SURVIVED the daemon's clean-stop and required the external backstop to clean — abandoned pairs are outside the daemon's teardown path (see §8 (K) process-side twin).
External reaper (crash-backstop): scheduled fleet factory reap-orphans — MUST run with the same --config as the live daemon (coupling vacuity: a mismatched config reaps nothing AND leaves the backstop broken). Gated on staleMs.
Coupling proof: coupled-tick stale=False over live pairs; crash-inverse SIGKILL→reaped=25/survivors=0.
5. Graduation boundary (clarified, not a failure)
Autonomous on WELL-SPECIFIED issues; thin/underspecified issues ESCALATE (no auto-dispatch). This is a deliberate safety quality gate: isThinIssue (desc<140 OR no acceptance-signal) → TieredTriage escalates; with no LLM tier configured → escalate, no dispatch.
The autonomous-handleable class = non-thin AND pear-scoped AND no reserved-for-human design decision.
EXHAUSTIVE classification of all 98 non-thin issues → 0 clean candidates (hand-verified, incl. the 13 the heuristic flagged "unknown"):
40 demo-app: JSON-CSV/tulip/rose-count (35) + the 5 "Haaland fan-site" pages AR-242–246 — all well-specified + tractable but misroute to pear via repos.default (their work belongs in a demo/fan-site repo).
1 genuinely pear-scoped non-thin: AR-239 (factory relayfile creds-unify) — but auth-critical AND carries a reserved-for-Khaliq design decision (fix approach is the owner's call) → an autonomous agent would guess the reserved design → (reasoned prediction) would not land clean — sb-review2 would correctness-hold the guessed reserved-design; a predicted hold, not a tested outcome.
genuinely-pear-clean: 0.
★ AR-99 boundary illustration (worth surfacing): the backlog literally asks for "Reviewer: merge on green" — that IS the auto-merge feature we built (#19 / the merge-gate / the flip itself). It's correctly classified reserved + behavioral-sensitive → out-of-autonomous-scope. So the factory built the exact feature its own backlog requested, and that feature is precisely the reserved-sensitive class an autonomous agent must NOT self-implement. The most safety-critical backlog items are the ones correctly NOT autonomously handled — the boundary working as designed.
★ Pattern (twice): the lone "pear-scoped non-thin" candidate hid a reserved cross-cutting design call (AR-82→v8/cloud#1932 migration; AR-239→creds-unify approach). The well-specified pear issues that remain are exactly the ones a human deliberately reserved the design on. This is the honest truth about the backlog, not a survey gap — and it's why escalate-don't-guess is correct (a naive "latest agent-relay" would hit the v8/cloud#1932 landmine).
We did NOT enrich-to-manufacture a PR.mergePolicy stays 'never'; the FLIP WAITS on the real-PR leg.
6. Flip conditions (when to throw never→on-green-with-review)
ALL of:
F + H + I merged & cert'd.
Sustained endurance soak GREEN (mem/fd-no-leak over hours + sustained re-cycle + reaped=0-coupled + gh-auth-live) — [morning tally].
≥2-3 REAL PRs landed clean (correctness-reviewed) — PENDING (no clean candidate tonight).
Lead's counting call on what satisfies the real-PR leg.
Flip mechanism (when thrown): head-SHA-guarded + all-required-CI-green + reviewDecision=APPROVED; the synthetic [factory-e2e] discriminator is preserved (synthetic = close-never-merge, even post-flip).
gh-authed operator environment — the gh-resolver is completion-load-bearing while the cloud GitHub→mount PR-sync is stuck (cloud#2108-family). A gh-auth drop halts completion.
Verify the operator artifact (the exact node bin/fleet.mjs --config <live> path), not a shim.
8. Tracked follow-ups (non-blocking, documented)
(c) LLM triage tier — wire an LLM into TieredTriage so thin issues get specced/escalated-with-context instead of dropped. The real fix for the THIN-issue class. = a parallel next-step (NOT "the" single gate — it handles thin issues; the multi-repo router below handles the larger demo-app-misroute class). Router is feat: add pear logo in sidebar header #1-by-volume; LLM-tier is the thin-issue handler. (port-injection, real change, needs review.)
(J) registry deregister-on-completion — transient/bounded tonight (5-key re-cycle overwrites); a GATE before a multi-distinct-real-issue batch (unbounded entry growth there). Process/registry hygiene.
(K) issue-state-reset-on-abandoned-dispatch — agent dies/exits BEFORE opening a PR → issue stuck "Agent Implementing" forever (the (C) completion-sweep only resets state when a non-draft PR exists). Fix: on agent-death-without-PR, reset issue state (→Ready/Backlog) + release slot. Real-tracker issue-state hygiene; distinct from (J). A GATE before a large real run. Evidence: AR-82 found pre-existing stuck in this state; the hung AR-237 soak pair (84181/84306, ~6.5h, no PR) = a live (K)-instance. ★ Process-side twin (proven at the 8h clean-end, endSurv=2): an abandoned/hung pair is OUTSIDE the daemon's own clean-stop teardown path (which handles tracked in-flight pairs) → it SURVIVES a clean daemon SIGTERM and requires the EXTERNAL reaper/backstop to clean. So the external reaper is load-bearing for abandoned-pair cleanup, and the (K) fix should also remove the abandoned agent from the daemon's teardown-tracked set.
★ broker agent-name-not-released-on-completion (relay#1116-family — relay-side sibling of (J)/(K)) — the broker doesn't free an agent's NAME on completion → a fixed-key re-dispatch pool jams ("agent already exists", 68 errors across the 5 keys) once all names are stuck-registered → unattended re-cycle STALLED at ~1.5h on the 5-key soak. A real run with DISTINCT issue names delays it, but the name-leak still accumulates over a long run. Bounds unattended fixed-key operation (~1.5h before needing broker-name cleanup). Surface as a relay#1116-family issue ([[surface-relay-layer-fixes]]). NB: the (D) leak cert is INTACT — captured on the busy 1.5h window before the stall; the stall is a workload change, not a leak.
★ multi-repo router (the demo-app-misroute ROOT — the feat: add pear logo in sidebar header #1 unlock) — the STATE.md "designed-but-UNPROVEN multi-repo router" manifesting: no working per-issue→repo routing → everything defaults to pear → misroute. This, not more machinery, is the feat: add pear logo in sidebar header #1 enabler for a scaled real run: proving the router (route demo-app/relay/cloud issues to THEIR repos) makes the ~40 well-specified demo-app tasks legitimately dispatchable in-repo — the largest clean-candidate class. ⚠️ Building it requires a new target repo (gh-auth/branch/CI verification) + resolving "do demo-app PRs count as the graduation real-PR proof" — a Khaliq/Lead decision, deliberately NOT built at 3am (over-reach trap).
Crossed-message race (disclosed, harm = zero): during the real-candidate evaluation, a crossed-message race seeded AR-239→Ready before a retraction propagated. The dual-catch (reviewer STOP + lead STEER-HALT) caught it PRE-SPAWN → 0 agents / 0 PR / 0 auth-design-guess; the pinned soak daemon (35382) + broker (68009) were untouched. AR-82 was restored to its verified as-found state; AR-239 was left flagged (original genuinely unverifiable → no guess-restore, per the honest-known-altered > confidently-wrong rule), logged to /tmp/factory-run/KNOWN-ALTERED.md. Net residual trace = AR-239 in "Ready for Agent" (inert, flagged) + logged seed events. This is disclosed both as honesty and as a demonstration that the safety net (steer-halt + reviewer backstop + never-policy + isolation) works under a real race.
Items needing Khaliq's reset/verification:
AR-239 = "Ready for Agent" — known-altered by the crossed-race, original state unverifiable. Please reset to intended.
AR-82 = "Agent Implementing" — ⚠️ PRE-EXISTING orphaned in-progress state (a prior-session leftover, not created this run; restored to as-found, not ours to alter). A real-tracker observation: AR-82 sits in Agent-Implementing with no active agent. Please verify/reset if stale. (Possible orphaned-state-cleanup gap on a prior run — sibling to follow-up (J).)
10. Honest morning posture (one line)
GRADUATED + safe + autonomous-self-completing + machinery-validated + endurance-PROVEN (~8h soak, +3.8MB rss drift / fd-flat / badReap=0, clean-ended); the "≥2-3 real PRs landed clean" leg is honestly PENDING because tonight's real backlog contains no clean autonomous-handleable candidate, and we refused to manufacture one. The flip waits. That integrity — applied to the proof itself, not just the certs — is the deliverable.
Factory-SDK Graduation Runbook + Status — 2026-06-14
1. TL;DR
The factory-sdk autonomous loop (discovery → triage → dispatch → PR → PR-state completion → merge-gate → close → release) is built, merged to main, unit-cert'd RED-on-revert, and live-validated on a continuous synthetic soak. The 2-process production model (live daemon + external reaper) is PROVEN non-vacuous over an ~8h continuous soak (465 samples, single pinned pid 35382, clean-ended at 8h by intentional SIGTERM): rss bounded 20–70MB, +3.8MB drift over 8h (well under the 8MB no-leak threshold; the half-vs-half offset stayed stable +4–6MB across the 4h/6h/7h/end checks = NOT growing = no slow leak), fd flat (31.0→30.6), badReap=0 across the whole run. Leak-test-under-load = the busy ~1.5h heavy-cycling window (completions 1→29 ≈6×/key, mean 39MB bounded); bounded through the idle stall tail (mean 46MB).
The autonomous merge flip is intentionally NOT thrown tonight. The flip-gate requires ≥2-3 real PRs landed clean, and tonight's real backlog yielded zero clean autonomous-handleable candidates (see §5). We did not enrich/manufacture a candidate — a manufactured proof isn't a proof.
mergePolicystays'never'; the flip waits for genuine candidates (or the LLM triage tier).2. What graduated — machinery ledger (cert facts, authoritative via sb-review2)
All merged to
main, each with mutation tests proven RED-on-revert:-(?!\d), issue-gate, operator-wiring (caught the M2 operator-flag vacuity on re-cert)Closed failure classes (tagged by proof level — every "proven" maps to its real proof):
§2 SHAs are the squash-merge commits on
main(correct for "merged to main"); each merge GO was bound to the PR-HEAD SHA and head-SHA-gated (e.g. F head 0e11632 → merged-as cc346eb).Binding standing rule established: wedge-signal is progress-based (completions↑/dispatches↑ over a window), count-independent (pids>0 false-clears the real dead-PID-full-slots wedge; raw regAgents false-trips on deregister-lag).
3. Live validation (non-vacuous, fv2)
badReap=0/tick over live in-flight pairs.4. The 2-process production model
factory start --mode live --config <live>— subscription-driven; writes + refreshes a loop heartbeat (Write live daemon heartbeat #276; freshness via the Yield live event drains to protect heartbeat #281 queue+yield fix).fleet factory reap-orphans— MUST run with the same--configas the live daemon (coupling vacuity: a mismatched config reaps nothing AND leaves the backstop broken). Gated on staleMs.5. Graduation boundary (clarified, not a failure)
Autonomous on WELL-SPECIFIED issues; thin/underspecified issues ESCALATE (no auto-dispatch). This is a deliberate safety quality gate:
isThinIssue(desc<140 OR no acceptance-signal) → TieredTriage escalates; with no LLM tier configured → escalate, no dispatch.The autonomous-handleable class = non-thin AND pear-scoped AND no reserved-for-human design decision.
EXHAUSTIVE classification of all 98 non-thin issues → 0 clean candidates (hand-verified, incl. the 13 the heuristic flagged "unknown"):
repos.default(their work belongs in a demo/fan-site repo).★ AR-99 boundary illustration (worth surfacing): the backlog literally asks for "Reviewer: merge on green" — that IS the auto-merge feature we built (#19 / the merge-gate / the flip itself). It's correctly classified reserved + behavioral-sensitive → out-of-autonomous-scope. So the factory built the exact feature its own backlog requested, and that feature is precisely the reserved-sensitive class an autonomous agent must NOT self-implement. The most safety-critical backlog items are the ones correctly NOT autonomously handled — the boundary working as designed.
★ Pattern (twice): the lone "pear-scoped non-thin" candidate hid a reserved cross-cutting design call (AR-82→v8/cloud#1932 migration; AR-239→creds-unify approach). The well-specified pear issues that remain are exactly the ones a human deliberately reserved the design on. This is the honest truth about the backlog, not a survey gap — and it's why escalate-don't-guess is correct (a naive "latest agent-relay" would hit the v8/cloud#1932 landmine).
We did NOT enrich-to-manufacture a PR.
mergePolicystays'never'; the FLIP WAITS on the real-PR leg.6. Flip conditions (when to throw never→on-green-with-review)
ALL of:
Flip mechanism (when thrown): head-SHA-guarded + all-required-CI-green + reviewDecision=APPROVED; the synthetic [factory-e2e] discriminator is preserved (synthetic = close-never-merge, even post-flip).
7. Operating preconditions (HARD)
--config== live-daemon--config(else vacuous reap + broken backstop).node bin/fleet.mjs --config <live>path), not a shim.8. Tracked follow-ups (non-blocking, documented)
9. Honest residual & items for Khaliq (full disclosure — nothing hidden)
Crossed-message race (disclosed, harm = zero): during the real-candidate evaluation, a crossed-message race seeded AR-239→Ready before a retraction propagated. The dual-catch (reviewer STOP + lead STEER-HALT) caught it PRE-SPAWN → 0 agents / 0 PR / 0 auth-design-guess; the pinned soak daemon (35382) + broker (68009) were untouched. AR-82 was restored to its verified as-found state; AR-239 was left flagged (original genuinely unverifiable → no guess-restore, per the honest-known-altered > confidently-wrong rule), logged to
/tmp/factory-run/KNOWN-ALTERED.md. Net residual trace = AR-239 in "Ready for Agent" (inert, flagged) + logged seed events. This is disclosed both as honesty and as a demonstration that the safety net (steer-halt + reviewer backstop + never-policy + isolation) works under a real race.Items needing Khaliq's reset/verification:
10. Honest morning posture (one line)
GRADUATED + safe + autonomous-self-completing + machinery-validated + endurance-PROVEN (~8h soak, +3.8MB rss drift / fd-flat / badReap=0, clean-ended); the "≥2-3 real PRs landed clean" leg is honestly PENDING because tonight's real backlog contains no clean autonomous-handleable candidate, and we refused to manufacture one. The flip waits. That integrity — applied to the proof itself, not just the certs — is the deliverable.