Skip to content

Factory-SDK Graduation — Runbook + Status (2026-06-14): machinery graduated & validated; autonomous merge-flip honestly pending #321

Description

@kjgbot

Factory-SDK Graduation Runbook + Status — 2026-06-14

Status: MACHINERY GRADUATED & VALIDATED. Autonomous flip (mergePolicy never→on-green-with-review) HONESTLY PENDING the real-PR leg.
This is the disciplined outcome: every gate that could be proven was proven and mutation-cert'd; the one gate tonight's real backlog could not honestly supply (≥2-3 real PRs landed clean) is reported as pending rather than manufactured.


1. TL;DR

The factory-sdk autonomous loop (discovery → triage → dispatch → PR → PR-state completion → merge-gate → close → release) is built, merged to main, unit-cert'd RED-on-revert, and live-validated on a continuous synthetic soak. The 2-process production model (live daemon + external reaper) is PROVEN non-vacuous over an ~8h continuous soak (465 samples, single pinned pid 35382, clean-ended at 8h by intentional SIGTERM): rss bounded 20–70MB, +3.8MB drift over 8h (well under the 8MB no-leak threshold; the half-vs-half offset stayed stable +4–6MB across the 4h/6h/7h/end checks = NOT growing = no slow leak), fd flat (31.0→30.6), badReap=0 across the whole run. Leak-test-under-load = the busy ~1.5h heavy-cycling window (completions 1→29 ≈6×/key, mean 39MB bounded); bounded through the idle stall tail (mean 46MB).

The autonomous merge flip is intentionally NOT thrown tonight. The flip-gate requires ≥2-3 real PRs landed clean, and tonight's real backlog yielded zero clean autonomous-handleable candidates (see §5). We did not enrich/manufacture a candidate — a manufactured proof isn't a proof. mergePolicy stays 'never'; the flip waits for genuine candidates (or the LLM triage tier).


2. What graduated — machinery ledger (cert facts, authoritative via sb-review2)

All merged to main, each with mutation tests proven RED-on-revert:

Fix PR (SHA) What Key MUTs proven RED
A close-probe #280 (7c4ab03) markerless synthetic-close + exact-number-bounded issue-key match + over-match guard closer-relax, resolver-relax, exact-bound -(?!\d), issue-gate, operator-wiring (caught the M2 operator-flag vacuity on re-cert)
B heartbeat-starvation #281 (c652420) false-reap-via-event-loop-starvation fix (enqueue→bounded drain + yield) MUT-B1 (revert→void #handleLiveChange) RED; :2433 replay-suppression stabilized (MUT-S RED)
C completion sweep #287 (96e6324) PR-state completion sweep (the WEDGE fix) + gh-resolver fold wedge, draft-skip, coalesce-idempotent, gh-resolver over-match-catastrophic (rejects its own fix PRs #287/#280), fail-closed, draft-via-gh, gh-not-found-backoff; bot drive-by 1f478a0 caught+verified-safe
H teardown-race #290 (ae6486c) refresh 'stopping' heartbeat during teardown progressing-protected (M-H1) AND wedged-still-reaped (M-H2 — wedge-detection preserved)
I terminal-clear-on-reopen #294 (a913949) re-cycle enabler: clear dispatch-terminal on canonical Done→Ready clear-enables-redispatch (M-I1) AND canonical-only safety (M-I2 — by-state alias flap does NOT clear)
F bounded-parallel drain #299 (cc346eb) dispatch throughput recovery under replay backlog throughput, dedupe-under-K (serial-prep-before-parallel → no concurrent double-dispatch), dedupe-identity, :2433, MUT-B1-yield stays RED

Closed failure classes (tagged by proof level — every "proven" maps to its real proof):

  • [live] false-reap (B) · wedge (C re-cycle) · resolver-closes-its-own-fix-PR (C catastrophic-miss control PASSED) · terminal-block-no-recycle (I — the soak re-cycle IS the live proof).
  • [unit] teardown-race (H) — unit-cert'd (M-H1/M-H2 RED). Live-teardown is a BONUS, NOT a flip-gate; taken only from a GENUINE clean multi-pair opportunity, never engineered (a fresh-seed-to-force-the-condition plan was raised + withdrawn as mild-manufacturing). The overnight broker-name stall left no clean ≥3-pair window; and the natural clean-end SIGTERM gave NO H-direction live evidence — the lone hung AR-237 pair SURVIVED the daemon's clean-stop (endSurv=2: abandoned/hung pairs are OUTSIDE the daemon's teardown path, which handles TRACKED in-flight pairs; the pair was cleaned by the external killtree backstop). So the clean-end exercised neither M-H1 (no tracked progressing pair) nor M-H2 (no hung-teardown-step). H stays [unit]-cert'd (M-H1/M-H2 RED). (An earlier "M-H1-adjacent" framing was retracted on the endSurv=2 data — the pair didn't terminate via the daemon teardown, it survived it.) · dedupe-under-K (F) — unit-only by design (a 2nd dispatching daemon would contaminate the soak); F's live axis that WAS spot-checked is hb-freshness (maxHbAge 2s, staleHits 0), not dedupe.

§2 SHAs are the squash-merge commits on main (correct for "merged to main"); each merge GO was bound to the PR-HEAD SHA and head-SHA-gated (e.g. F head 0e11632 → merged-as cc346eb).

Binding standing rule established: wedge-signal is progress-based (completions↑/dispatches↑ over a window), count-independent (pids>0 false-clears the real dead-PID-full-slots wedge; raw regAgents false-trips on deregister-lag).


3. Live validation (non-vacuous, fv2)

  • False-reap-dead 2-process cert + soak badReap=0/tick over live in-flight pairs.
  • gh-resolver autonomous cycle end-to-end: mount-miss → gh-resolve → #completeIssue → auto-close fresh OPEN PR → slot-free → next-dispatch.
  • Add PR-state completion sweep #287/Fix synthetic probe PR close matching #280 UNTOUCHED on the real adversarial set (resolver does not close its own fix PRs).
  • Soak-core re-cycle: SUSTAINED ~1.5h (completions 1→29 ≈6×/key past the 5-key pool, live pid-rotation, mem/fd bounded) THEN STALLED on a broker agent-name-reuse collision (relay#1116-family: the broker doesn't release agent names on completion → a fixed 5-key pool jams once all 5 names are stuck-registered → re-dispatch fails "agent already exists"). NOT indefinitely sustained — the stall is a real finding (see §8), not a leak (mem/fd stayed bounded through it).
  • F heartbeat-freshness spot-check: maxHbAge 2s, staleHits 0.
  • Endurance soak (pinned daemon 35382 @ 00:11:29 → clean-end @ 28832s / ~8h SIGTERM):(D) PROVEN — NO LEAK over ~8h, 465 samples, single continuous pid. rss min=20 max=70 mean=45MB; first-half-mean 43.1 → last-half-mean 46.8 = +3.8MB drift (under the 8MB no-leak threshold; offset stable +4–6MB across 4h/6h/7h/end = not growing). fd min=30 max=38; first-half 31.0 → last-half 30.6 = flat. Honest scope: bounded NO-climb over the busy ~1.5h leak-test-under-load (mean 39MB) AND bounded through the idle stall tail (mean 46MB). badReap=0 every tick (coupled reaper stale=False reaped=0). Clean-end hygiene: daemon clean-stopped, broker 68009 untouched, 0 strays. (E) clean-SIGTERM = 3184ms, rc=0 (clean exit, NOT SIGKILL/137 — the Force factory daemon exit after graceful stop #273 force-exit-on-clean-stop working even with an abandoned pair lingering) → NOT slow, resolves the (E) watch. ★ badReap=0 over 477 reaper-alive-ticks (the whole 8h) — the coupled reaper NEVER false-reaped a healthy pair = strong reaper-safety live datapoint. ★ endSurv=2: the abandoned hung AR-237 pair SURVIVED the daemon's clean-stop and required the external backstop to clean — abandoned pairs are outside the daemon's teardown path (see §8 (K) process-side twin).

4. The 2-process production model

  1. Live daemon: factory start --mode live --config <live> — subscription-driven; writes + refreshes a loop heartbeat (Write live daemon heartbeat #276; freshness via the Yield live event drains to protect heartbeat #281 queue+yield fix).
  2. External reaper (crash-backstop): scheduled fleet factory reap-orphans — MUST run with the same --config as the live daemon (coupling vacuity: a mismatched config reaps nothing AND leaves the backstop broken). Gated on staleMs.
  3. Coupling proof: coupled-tick stale=False over live pairs; crash-inverse SIGKILL→reaped=25/survivors=0.

5. Graduation boundary (clarified, not a failure)

Autonomous on WELL-SPECIFIED issues; thin/underspecified issues ESCALATE (no auto-dispatch). This is a deliberate safety quality gate: isThinIssue (desc<140 OR no acceptance-signal) → TieredTriage escalates; with no LLM tier configured → escalate, no dispatch.

The autonomous-handleable class = non-thin AND pear-scoped AND no reserved-for-human design decision.

EXHAUSTIVE classification of all 98 non-thin issues → 0 clean candidates (hand-verified, incl. the 13 the heuristic flagged "unknown"):

  • 45 wrong-repo: relay/broker/relaycast/swift/PTY internals.
  • 40 demo-app: JSON-CSV/tulip/rose-count (35) + the 5 "Haaland fan-site" pages AR-242–246 — all well-specified + tractable but misroute to pear via repos.default (their work belongs in a demo/fan-site repo).
  • 5 cloud/github-app (AR-106 workspace-normalization etc.).
  • 3 auto-generated test artifacts (AR-111/139/140) — not real coding tasks.
  • 2 harness CLI-registry (AR-56 open-ended research; AR-57 refresh-CLI-versions = the v8 version-skew class).
  • 1 relayfile-tests (AR-16).
  • 2 merge-policy/behavioral-sensitive: AR-99 "Reviewer: merge on green" + AR-95 integration-behavior (reserved/risk-flagged).
  • 1 genuinely pear-scoped non-thin: AR-239 (factory relayfile creds-unify) — but auth-critical AND carries a reserved-for-Khaliq design decision (fix approach is the owner's call) → an autonomous agent would guess the reserved design → (reasoned prediction) would not land clean — sb-review2 would correctness-hold the guessed reserved-design; a predicted hold, not a tested outcome.
  • genuinely-pear-clean: 0.

AR-99 boundary illustration (worth surfacing): the backlog literally asks for "Reviewer: merge on green" — that IS the auto-merge feature we built (#19 / the merge-gate / the flip itself). It's correctly classified reserved + behavioral-sensitive → out-of-autonomous-scope. So the factory built the exact feature its own backlog requested, and that feature is precisely the reserved-sensitive class an autonomous agent must NOT self-implement. The most safety-critical backlog items are the ones correctly NOT autonomously handled — the boundary working as designed.

Pattern (twice): the lone "pear-scoped non-thin" candidate hid a reserved cross-cutting design call (AR-82→v8/cloud#1932 migration; AR-239→creds-unify approach). The well-specified pear issues that remain are exactly the ones a human deliberately reserved the design on. This is the honest truth about the backlog, not a survey gap — and it's why escalate-don't-guess is correct (a naive "latest agent-relay" would hit the v8/cloud#1932 landmine).

We did NOT enrich-to-manufacture a PR. mergePolicy stays 'never'; the FLIP WAITS on the real-PR leg.


6. Flip conditions (when to throw never→on-green-with-review)

ALL of:

  • F + H + I merged & cert'd.
  • Sustained endurance soak GREEN (mem/fd-no-leak over hours + sustained re-cycle + reaped=0-coupled + gh-auth-live) — [morning tally].
  • ≥2-3 REAL PRs landed clean (correctness-reviewed) — PENDING (no clean candidate tonight).
  • Lead's counting call on what satisfies the real-PR leg.

Flip mechanism (when thrown): head-SHA-guarded + all-required-CI-green + reviewDecision=APPROVED; the synthetic [factory-e2e] discriminator is preserved (synthetic = close-never-merge, even post-flip).


7. Operating preconditions (HARD)

  • Scheduled-reaper --config == live-daemon --config (else vacuous reap + broken backstop).
  • gh-authed operator environment — the gh-resolver is completion-load-bearing while the cloud GitHub→mount PR-sync is stuck (cloud#2108-family). A gh-auth drop halts completion.
  • Verify the operator artifact (the exact node bin/fleet.mjs --config <live> path), not a shim.

8. Tracked follow-ups (non-blocking, documented)

  • (c) LLM triage tier — wire an LLM into TieredTriage so thin issues get specced/escalated-with-context instead of dropped. The real fix for the THIN-issue class. = a parallel next-step (NOT "the" single gate — it handles thin issues; the multi-repo router below handles the larger demo-app-misroute class). Router is feat: add pear logo in sidebar header #1-by-volume; LLM-tier is the thin-issue handler. (port-injection, real change, needs review.)
  • (J) registry deregister-on-completion — transient/bounded tonight (5-key re-cycle overwrites); a GATE before a multi-distinct-real-issue batch (unbounded entry growth there). Process/registry hygiene.
  • (K) issue-state-reset-on-abandoned-dispatch — agent dies/exits BEFORE opening a PR → issue stuck "Agent Implementing" forever (the (C) completion-sweep only resets state when a non-draft PR exists). Fix: on agent-death-without-PR, reset issue state (→Ready/Backlog) + release slot. Real-tracker issue-state hygiene; distinct from (J). A GATE before a large real run. Evidence: AR-82 found pre-existing stuck in this state; the hung AR-237 soak pair (84181/84306, ~6.5h, no PR) = a live (K)-instance. ★ Process-side twin (proven at the 8h clean-end, endSurv=2): an abandoned/hung pair is OUTSIDE the daemon's own clean-stop teardown path (which handles tracked in-flight pairs) → it SURVIVES a clean daemon SIGTERM and requires the EXTERNAL reaper/backstop to clean. So the external reaper is load-bearing for abandoned-pair cleanup, and the (K) fix should also remove the abandoned agent from the daemon's teardown-tracked set.
  • broker agent-name-not-released-on-completion (relay#1116-family — relay-side sibling of (J)/(K)) — the broker doesn't free an agent's NAME on completion → a fixed-key re-dispatch pool jams ("agent already exists", 68 errors across the 5 keys) once all names are stuck-registered → unattended re-cycle STALLED at ~1.5h on the 5-key soak. A real run with DISTINCT issue names delays it, but the name-leak still accumulates over a long run. Bounds unattended fixed-key operation (~1.5h before needing broker-name cleanup). Surface as a relay#1116-family issue ([[surface-relay-layer-fixes]]). NB: the (D) leak cert is INTACT — captured on the busy 1.5h window before the stall; the stall is a workload change, not a leak.
  • multi-repo router (the demo-app-misroute ROOT — the feat: add pear logo in sidebar header #1 unlock) — the STATE.md "designed-but-UNPROVEN multi-repo router" manifesting: no working per-issue→repo routing → everything defaults to pear → misroute. This, not more machinery, is the feat: add pear logo in sidebar header #1 enabler for a scaled real run: proving the router (route demo-app/relay/cloud issues to THEIR repos) makes the ~40 well-specified demo-app tasks legitimately dispatchable in-repo — the largest clean-candidate class. ⚠️ Building it requires a new target repo (gh-auth/branch/CI verification) + resolving "do demo-app PRs count as the graduation real-PR proof" — a Khaliq/Lead decision, deliberately NOT built at 3am (over-reach trap).
  • (F) dispatch-throughput + high-watermark-404 cloud-delivery-lag — cloud#2108-family.
  • cloud#2108 the underlying GitHub→mount PR-sync stuck (routed-around via gh-resolver).

9. Honest residual & items for Khaliq (full disclosure — nothing hidden)

Crossed-message race (disclosed, harm = zero): during the real-candidate evaluation, a crossed-message race seeded AR-239→Ready before a retraction propagated. The dual-catch (reviewer STOP + lead STEER-HALT) caught it PRE-SPAWN0 agents / 0 PR / 0 auth-design-guess; the pinned soak daemon (35382) + broker (68009) were untouched. AR-82 was restored to its verified as-found state; AR-239 was left flagged (original genuinely unverifiable → no guess-restore, per the honest-known-altered > confidently-wrong rule), logged to /tmp/factory-run/KNOWN-ALTERED.md. Net residual trace = AR-239 in "Ready for Agent" (inert, flagged) + logged seed events. This is disclosed both as honesty and as a demonstration that the safety net (steer-halt + reviewer backstop + never-policy + isolation) works under a real race.

Items needing Khaliq's reset/verification:

  1. AR-239 = "Ready for Agent" — known-altered by the crossed-race, original state unverifiable. Please reset to intended.
  2. AR-82 = "Agent Implementing" — ⚠️ PRE-EXISTING orphaned in-progress state (a prior-session leftover, not created this run; restored to as-found, not ours to alter). A real-tracker observation: AR-82 sits in Agent-Implementing with no active agent. Please verify/reset if stale. (Possible orphaned-state-cleanup gap on a prior run — sibling to follow-up (J).)

10. Honest morning posture (one line)

GRADUATED + safe + autonomous-self-completing + machinery-validated + endurance-PROVEN (~8h soak, +3.8MB rss drift / fd-flat / badReap=0, clean-ended); the "≥2-3 real PRs landed clean" leg is honestly PENDING because tonight's real backlog contains no clean autonomous-handleable candidate, and we refused to manufacture one. The flip waits. That integrity — applied to the proof itself, not just the certs — is the deliverable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions