Skip to content

v1.16.0.0 feat: tunnel allowlist 17→26 + canDispatchOverTunnel pure function#1253

Merged
garrytan merged 6 commits intomainfrom
garrytan/gbrowser-unleashed
Apr 28, 2026
Merged

v1.16.0.0 feat: tunnel allowlist 17→26 + canDispatchOverTunnel pure function#1253
garrytan merged 6 commits intomainfrom
garrytan/gbrowser-unleashed

Conversation

@garrytan
Copy link
Copy Markdown
Owner

@garrytan garrytan commented Apr 28, 2026

Summary

The visible bug: a paired remote agent over the ngrok tunnel hit 403s on newtab, tabs, goto-on-existing-tab, and a chain of other commands the operator docs claimed worked. The hidden bug: the v1.6.0.0 TUNNEL_COMMANDS allowlist sat at 17 entries while docs/REMOTE_BROWSER_ACCESS.md, browse/src/cli.ts:546-586, and operator-facing instruction blocks all documented 26. Drift shipped silently for releases.

This PR closes the gap.

Code changes

  • browse/src/server.ts:111-120: TUNNEL_COMMANDS extended from 17 → 26 commands. Added: newtab, tabs, back, forward, reload, snapshot, fill, url, closetab. Each is bounded by the existing per-tab ownership check at server.ts:613-624 — scoped tokens default to tabPolicy: 'own-only', so paired agents still can't operate on tabs they don't own.
  • browse/src/server.ts: extracted the inline gate check into a pure exported canDispatchOverTunnel(command) function. Same behavior; the difference is unit-testability without HTTP.
  • browse/src/server.ts:2080-2104: added BROWSE_TUNNEL_LOCAL_ONLY=1 test-mode flag. Binds the second Bun.serve listener via makeFetchHandler('tunnel') on 127.0.0.1 without invoking ngrok. Production tunnel still requires BROWSE_TUNNEL=1 + valid NGROK_AUTHTOKEN.

Test changes

  • browse/test/dual-listener.test.ts: replaced must-include + must-exclude assertions with exact-set equality on the 26-command literal. Added command !== 'newtab' ownership-exemption regex assertion (catches refactors that re-introduce the catch-22 from the ownership side).
  • browse/test/tunnel-gate-unit.test.ts (new): 53 expects covering all 26 allowed, 20 blocked, null/undefined/empty/non-string defensive handling, alias canonicalization.
  • browse/test/pair-agent-tunnel-eval.test.ts (new): 4 behavioral tests with both listeners locally (no ngrok). Asserts the gate fires on tunnel surface only, denial-log writes, and the catch-22 regression (newtab → goto without ownership 403).

Doc changes

  • docs/REMOTE_BROWSER_ACCESS.md:35,168: bumped "17-command allowlist" → "26-command". Removed eval from the (incorrect) denied-commands list.
  • CLAUDE.md: same count bump in transport-layer security section.

Test Coverage

TUNNEL ALLOWLIST GATE
[+] browse/src/server.ts (TUNNEL_COMMANDS Set + canDispatchOverTunnel)
  ├── [★★★ TESTED] exact-set equality on 26 commands — dual-listener.test.ts
  ├── [★★★ TESTED] ownership-exemption regex — dual-listener.test.ts
  ├── [★★★ TESTED] /command handler delegates to canDispatchOverTunnel — dual-listener.test.ts
  └── [★★★ TESTED] all 26 allowed, 20 blocked, null/empty/non-string, aliases — tunnel-gate-unit.test.ts (53 expects)

[+] browse/src/server.ts (BROWSE_TUNNEL_LOCAL_ONLY=1)
  └── [★★★ TESTED] both listeners bind, gate fires on tunnel surface only — pair-agent-tunnel-eval.test.ts

USER FLOWS (paired agent over tunnel)
[+] Catch-22 regression
  └── [★★★ TESTED] newtab → goto on owned tab does not 403 with "Tab not owned" — pair-agent-tunnel-eval.test.ts

COVERAGE: 6/6 paths tested (100%)  |  GAPS: 0
QUALITY: ★★★:6

Tests: 69 pass / 0 fail across the four touched test files. Full free bun test suite green.

Pre-Landing Review

No issues found. Plan was reviewed under /plan-eng-review and went through 2 sequential codex outside-voice passes during plan mode. 6 of 7 substantive findings landed in the implementation; the 7th (a pre-existing /pair-agent /health probe mismatch at cli.ts:656-668) is logged out-of-scope.

Eval Results

No prompt-related files changed — evals skipped.

Plan Completion

All implementation items DONE. Plan: ~/.claude/plans/zippy-churning-boot.md.

Item Status Evidence
Extend TUNNEL_COMMANDS to 26 commands DONE server.ts:111-120
Export TUNNEL_COMMANDS for tests DONE server.ts:111
Extract canDispatchOverTunnel() pure function DONE server.ts near literal
Refactor handler to call extracted function DONE server.ts:1773-1781
Add BROWSE_TUNNEL_LOCAL_ONLY=1 test mode DONE server.ts:2080-2104
Source-level guards (exact-set + exemption regex) DONE dual-listener.test.ts
Pure-function unit test DONE tunnel-gate-unit.test.ts (new)
Behavioral eval (both listeners locally) DONE pair-agent-tunnel-eval.test.ts (new)
Doc count bump (REMOTE_BROWSER_ACCESS + CLAUDE) DONE both files
bun run build DONE binary recompiled

One known accepted risk: tabs over the tunnel returns metadata for ALL tabs in the browser, not just tabs the agent owns. Codex flagged that tightening this requires touching the ownership gate itself (server.ts:603-614 falls back to getActiveTabId() BEFORE dispatch), which is out of scope for a catch-22 fix. Logged in the plan failure-mode table; user accepted.

Test plan

  • bun test browse/test/dual-listener.test.ts browse/test/tunnel-gate-unit.test.ts browse/test/pair-agent-tunnel-eval.test.ts browse/test/pair-agent-e2e.test.ts → 69 pass, 0 fail
  • Full free suite (bun test) exit code 0
  • bun run build → binaries recompiled, all SKILL.md regenerated

🤖 Generated with Claude Code


View in Codesmith
Need help on this PR? Tag @codesmith with what you need.

  • Let Codesmith autofix CI failures and bot reviews

garrytan and others added 5 commits April 27, 2026 23:50
…rTunnel

Adds newtab, tabs, back, forward, reload, snapshot, fill, url, closetab to
TUNNEL_COMMANDS (matching what cli.ts and REMOTE_BROWSER_ACCESS.md already
documented). Each new command is bounded by the existing per-tab ownership
check at server.ts:613-624 — scoped tokens default to tabPolicy: 'own-only'
so paired agents still can't operate on tabs they don't own.

Refactors the inline gate check at server.ts:1771-1783 into a pure exported
function canDispatchOverTunnel(command). Same behavior as the inline check;
the difference is unit-testability without HTTP.

Adds BROWSE_TUNNEL_LOCAL_ONLY=1 test-mode flag that binds the second Bun.serve
listener with makeFetchHandler('tunnel') on 127.0.0.1 — no ngrok needed.
Production tunnel still requires BROWSE_TUNNEL=1 + valid NGROK_AUTHTOKEN.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ehavioral eval

Three layers of regression coverage for the tunnel allowlist:

1. dual-listener.test.ts: replaces must-include/must-exclude with exact-set
   equality on the 26-command literal (the prior intersection-only style let
   new commands sneak into the source without test updates). Adds a regex
   assertion that the `command !== 'newtab'` ownership exemption at
   server.ts:613 still exists — catches refactors that re-introduce the
   catch-22 from the other side. Updates the /command handler test to look
   for canDispatchOverTunnel(body?.command) instead of the inline check.

2. tunnel-gate-unit.test.ts (new): 53 expects covering all 26 allowed,
   20 blocked, null/undefined/empty/non-string defensive handling, and alias
   canonicalization (e.g. 'set-content' resolves to 'load-html' which is
   correctly rejected since 'load-html' isn't tunnel-allowed).

3. pair-agent-tunnel-eval.test.ts (new): 4 behavioral tests that spawn the
   daemon under BROWSE_HEADLESS_SKIP=1 BROWSE_TUNNEL_LOCAL_ONLY=1, bind both
   listeners on 127.0.0.1, mint a scoped token via /pair → /connect, and
   assert: (a) newtab over tunnel passes the gate; (b) pair over tunnel
   403s with disallowed_command:pair AND writes a denial-log entry;
   (c) pair over local does NOT trigger the tunnel gate (proves the gate
   is surface-scoped); (d) regression for the catch-22 — newtab + goto on
   the resulting tab does not 403 with "Tab not owned by your agent".

All four tests run free under bun test (no API spend, no ngrok).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…OWSER_ACCESS.md

Both docs already named the 9 new commands as remote-accessible (the operator
guide's per-command sections at lines 86-119 and 168, plus cli.ts:546-586's
instruction blocks). The allowlist count was the only place the drift was
visible. Also corrected REMOTE_BROWSER_ACCESS.md's denied-commands list:
'eval' is in the allowlist, not the denied list — prior doc was wrong.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous bump landed at v1.21.0.0 because gstack-next-version
advances past the highest claimed slot (v1.20.0.0 from #1252) rather
than picking the lowest unclaimed. v1.16-v1.18 are unclaimed and
v1.16.0.0 preserves monotonic version ordering on main once #1234
(v1.17), #1233 (v1.19), and #1252 (v1.20) merge after us.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@garrytan garrytan changed the title v1.21.0.0 feat: tunnel allowlist 17→26 + canDispatchOverTunnel pure function v1.16.0.0 feat: tunnel allowlist 17→26 + canDispatchOverTunnel pure function Apr 28, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 28, 2026

E2E Evals: ✅ PASS

8/8 tests passed | $1.22 total cost | 12 parallel runners

Suite Result Status Cost
e2e-browse 2/2 $0.13
e2e-deploy 2/2 $0.3
e2e-qa-workflow 1/1 $0.47
llm-judge 1/1 $0.02
e2e-deploy 2/2 $0.3

12x ubicloud-standard-2 (Docker: pre-baked toolchain + deps) | wall clock ≈ slowest suite

garrytan added a commit that referenced this pull request Apr 28, 2026
Version-gate workflow rejected v1.20.0.0 because the queue moved during
the windows-free-tests fix loop:

  v1.16.0.0 → garrytan/gbrowser-unleashed (PR #1253)  [new since last bump]
  v1.17.0.0 → garrytan/setup-gbrain-run    (PR #1234)
  v1.19.0.0 → garrytan/browserharness       (PR #1233)
  v1.21.1.0 → garrytan/pty-plan-mode-e2e    (PR #1255)  [new since last bump]

Two new sibling PRs landed slot claims while we iterated on Windows.
Next free MINOR slot is v1.22.0.0.

Updated VERSION, package.json, CHANGELOG header + body. Also pushing the
round-13 windows-fix in parallel (test invokes bin/gstack-paths via bash
to handle Windows shebang).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… slots

The gate was rejecting any PR VERSION below the util's next-slot
recommendation, even when the lower slot was unclaimed. This blocked
PRs that legitimately want to land at an unclaimed slot below the queue
max — which is what /ship should pick when the goal is monotonic version
ordering on main (lower-numbered PRs landing first preserves order; the
util's "advance past max claimed" semantics only optimizes for fresh
runs picking unique slots, not for queue ordering on merge).

New gate logic:

1. Hard-fail if PR VERSION <= base VERSION (no actual bump).
2. Hard-fail if PR VERSION exactly matches another open PR's VERSION
   (real collision).
3. Pass otherwise. If the PR is below the util's suggestion, emit an
   informational ::notice:: explaining the slot is unclaimed.

The util's output stays informational — it tells fresh /ship runs what
the next-up slot should be, but the gate only blocks actual conflicts.
This is a strict relaxation: every PR that passed the old gate also
passes the new one.

Confirmed by dry-run against the current queue (4 open PRs claiming
1.17.0.0, 1.19.0.0, 1.21.1.0, 1.22.0.0):
  - v1.16.0.0  → pass with informational notice (unclaimed)
  - v1.17.0.0  → fail (collision with #1234)
  - v1.15.0.0  → fail (no bump from base)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@garrytan garrytan merged commit 8f3701b into main Apr 28, 2026
22 checks passed
anbangr added a commit to anbangr/gstack that referenced this pull request Apr 28, 2026
…(gstack-build v1.15.0)

* feat(dual-impl): Phase 1 — types, worktree, parser dualImpl stamp

- types.ts: 6 new PhaseStatus values (dual_impl_running → dual_winner_pending);
  DualImplState + DualImplTestResult interfaces; dualImpl? on Phase + PhaseState
- parser.ts: accepts ParseOpts { dualImpl? }; stamps dualImpl=true on all phases
  when flag is set; backward compat — defaults to false
- worktree.ts: createWorktrees (two isolated git worktrees + branches),
  teardownWorktrees (idempotent git worktree remove + branch -D),
  applyWinner (cherry-pick with patch fallback)
- __tests__/worktree.test.ts: 3 tests against real temp git repo (green)
- __tests__/parser.test.ts: 2 new dualImpl stamping tests (green)

110 tests pass, 0 fail.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(dual-impl): Phase 1 post-review fixes — align WorktreePair field names + os.tmpdir + commit exit codes

- WorktreePair: geminiPath→geminiWorktreePath, codexPath→codexWorktreePath
  (aligns with DualImplState so callers can spread directly)
- worktree.ts: use os.tmpdir() instead of hardcoded /tmp
- applyWinner patch fallback: check exit codes of git add + git commit;
  return { ok: false } instead of silently returning ok:true on commit failure
- worktree.test.ts: update all field references to new names

110 tests pass, 0 fail.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat(dual-impl): Phase 2 — phase-runner state machine + ApplyResultExtra

- 4 new Action types: RUN_DUAL_IMPL, RUN_DUAL_TESTS, RUN_JUDGE_OPUS, APPLY_WINNER
- decideNextAction:
  * tests_red + phase.dualImpl=true → RUN_DUAL_IMPL (single-impl unchanged otherwise)
  * dual_impl_running → RUN_DUAL_IMPL (crash recovery)
  * dual_impl_done → RUN_DUAL_TESTS
  * dual_tests_running → RUN_DUAL_TESTS (crash recovery)
  * dual_judge_pending / dual_judge_running → RUN_JUDGE_OPUS
  * dual_winner_pending → APPLY_WINNER (winner from selectedImplementor)
- applyResult: new optional 4th param ApplyResultExtra carries dual-impl
  data (worktree init, test results, judge verdict) that won't fit a
  single SubAgentResult
- applyResult handlers:
  * RUN_DUAL_IMPL → dual_impl_done (stamps worktree paths/branches)
  * RUN_DUAL_TESTS → dual_judge_pending (both pass) | dual_winner_pending
    with auto-select (one passes / both fail → fewer-failures winner)
  * RUN_JUDGE_OPUS → dual_winner_pending with selectedBy='judge'
  * APPLY_WINNER → gemini_done (handoff to existing pipeline)
- 8 new state-machine tests covering all dual-impl transitions
- Existing tddPhase/legacyPhase fixtures updated with dualImpl: false

118 tests pass, 0 fail. Exhaustiveness guard preserved.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(dual-impl): Phase 2 post-review HIGH fixes — fail-closed on missing signal

Three fail-closed paths added (Codex review HIGH findings):

1. dual_winner_pending without selectedImplementor → FAIL
   Was silently defaulting to 'gemini' which could apply unverified code if
   state was corrupted between persistence and resume.

2. RUN_DUAL_IMPL without dualImplInit in extra → status failed
   Was silently transitioning to dual_impl_done without recording worktree
   paths, making downstream tests/judge/apply impossible.

3. Both dual-impl test runs timed out → status failed
   Was selecting 'gemini' via the both-fail/MAX_SAFE_INTEGER tie path —
   applying unverified code with no test evidence at all.

4. Both dual-impl tests failed with missing failureCount on both → failed
   Same rationale as (3): no signal to choose a winner.

4 new tests cover the fail-closed paths. 122 tests pass, 0 fail.

CRITICAL finding (cli.ts not handling dual actions) is BY-DESIGN — Phase 4
of the plan wires up the CLI dispatch. Phase 2 scope is the pure state machine.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* v1.16.0.0 feat: tunnel allowlist 17→26 + canDispatchOverTunnel pure function (garrytan#1253)

* feat: extend tunnel allowlist to 26 commands + extract canDispatchOverTunnel

Adds newtab, tabs, back, forward, reload, snapshot, fill, url, closetab to
TUNNEL_COMMANDS (matching what cli.ts and REMOTE_BROWSER_ACCESS.md already
documented). Each new command is bounded by the existing per-tab ownership
check at server.ts:613-624 — scoped tokens default to tabPolicy: 'own-only'
so paired agents still can't operate on tabs they don't own.

Refactors the inline gate check at server.ts:1771-1783 into a pure exported
function canDispatchOverTunnel(command). Same behavior as the inline check;
the difference is unit-testability without HTTP.

Adds BROWSE_TUNNEL_LOCAL_ONLY=1 test-mode flag that binds the second Bun.serve
listener with makeFetchHandler('tunnel') on 127.0.0.1 — no ngrok needed.
Production tunnel still requires BROWSE_TUNNEL=1 + valid NGROK_AUTHTOKEN.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: source-level guards + pure-function unit test + dual-listener behavioral eval

Three layers of regression coverage for the tunnel allowlist:

1. dual-listener.test.ts: replaces must-include/must-exclude with exact-set
   equality on the 26-command literal (the prior intersection-only style let
   new commands sneak into the source without test updates). Adds a regex
   assertion that the `command !== 'newtab'` ownership exemption at
   server.ts:613 still exists — catches refactors that re-introduce the
   catch-22 from the other side. Updates the /command handler test to look
   for canDispatchOverTunnel(body?.command) instead of the inline check.

2. tunnel-gate-unit.test.ts (new): 53 expects covering all 26 allowed,
   20 blocked, null/undefined/empty/non-string defensive handling, and alias
   canonicalization (e.g. 'set-content' resolves to 'load-html' which is
   correctly rejected since 'load-html' isn't tunnel-allowed).

3. pair-agent-tunnel-eval.test.ts (new): 4 behavioral tests that spawn the
   daemon under BROWSE_HEADLESS_SKIP=1 BROWSE_TUNNEL_LOCAL_ONLY=1, bind both
   listeners on 127.0.0.1, mint a scoped token via /pair → /connect, and
   assert: (a) newtab over tunnel passes the gate; (b) pair over tunnel
   403s with disallowed_command:pair AND writes a denial-log entry;
   (c) pair over local does NOT trigger the tunnel gate (proves the gate
   is surface-scoped); (d) regression for the catch-22 — newtab + goto on
   the resulting tab does not 403 with "Tab not owned by your agent".

All four tests run free under bun test (no API spend, no ngrok).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: bump tunnel allowlist count 17 -> 26 in CLAUDE.md and REMOTE_BROWSER_ACCESS.md

Both docs already named the 9 new commands as remote-accessible (the operator
guide's per-command sections at lines 86-119 and 168, plus cli.ts:546-586's
instruction blocks). The allowlist count was the only place the drift was
visible. Also corrected REMOTE_BROWSER_ACCESS.md's denied-commands list:
'eval' is in the allowlist, not the denied list — prior doc was wrong.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v1.21.0.0)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: re-version v1.21.0.0 -> v1.16.0.0 (lowest unclaimed slot)

The previous bump landed at v1.21.0.0 because gstack-next-version
advances past the highest claimed slot (v1.20.0.0 from garrytan#1252) rather
than picking the lowest unclaimed. v1.16-v1.18 are unclaimed and
v1.16.0.0 preserves monotonic version ordering on main once garrytan#1234
(v1.17), garrytan#1233 (v1.19), and garrytan#1252 (v1.20) merge after us.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(ci): version-gate enforces collisions, allows lower-but-unclaimed slots

The gate was rejecting any PR VERSION below the util's next-slot
recommendation, even when the lower slot was unclaimed. This blocked
PRs that legitimately want to land at an unclaimed slot below the queue
max — which is what /ship should pick when the goal is monotonic version
ordering on main (lower-numbered PRs landing first preserves order; the
util's "advance past max claimed" semantics only optimizes for fresh
runs picking unique slots, not for queue ordering on merge).

New gate logic:

1. Hard-fail if PR VERSION <= base VERSION (no actual bump).
2. Hard-fail if PR VERSION exactly matches another open PR's VERSION
   (real collision).
3. Pass otherwise. If the PR is below the util's suggestion, emit an
   informational ::notice:: explaining the slot is unclaimed.

The util's output stays informational — it tells fresh /ship runs what
the next-up slot should be, but the gate only blocks actual conflicts.
This is a strict relaxation: every PR that passed the old gate also
passes the new one.

Confirmed by dry-run against the current queue (4 open PRs claiming
1.17.0.0, 1.19.0.0, 1.21.1.0, 1.22.0.0):
  - v1.16.0.0  → pass with informational notice (unclaimed)
  - v1.17.0.0  → fail (collision with garrytan#1234)
  - v1.15.0.0  → fail (no bump from base)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v1.17.0.0: setup-gbrain wireup ships the gbrain federation surface (garrytan#1234)

* feat: gstack-gbrain-source-wireup helper + 13 unit tests

The new bin/gstack-gbrain-source-wireup is the single helper that registers
the gstack brain repo as a gbrain federated source via `git worktree`, runs
incremental sync, and supports --uninstall + --probe + --strict modes.

Replaces the dead `consumers.json + ingest_url + /ingest-repo` HTTP wireup
introduced in v1.12.0.0 — that endpoint never shipped on the gbrain side.
The federation surface (`gbrain sources` / `gbrain sync`) shipped in gbrain
v0.18.0; this helper adapts to its actual semantics (no `sources update`, so
path drift recovery is `remove + re-add`; no `--install-cron` either, so
freshness rides on the existing skill-end push hook).

Source-id derivation is multi-fallback: ~/.gstack/.git origin URL →
~/.gstack-brain-remote.txt → --source-id flag. This makes `--uninstall`
work even after `~/.gstack/.git` is destroyed by the parent uninstall script.

Worktree is `--detach`ed at $GSTACK_HOME's HEAD because main is already
checked out there; advance is a re-checkout of the parent's current HEAD,
not a `git pull`. Divergence recovery removes + re-adds the worktree.

Test suite covers 13 cases: fresh-state registration, idempotent re-runs,
drift recovery, --strict failure modes, source-id fallback chain, --probe
non-mutation, sync errors, and --uninstall. Fake gbrain on $PATH, real git
ops at GSTACK_HOME tmp dir.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: wire setup-gbrain + brain-restore + brain-uninstall to use the helper

setup-gbrain Step 7 now invokes gstack-gbrain-source-wireup --strict after
gstack-brain-init + gbrain_sync_mode is set. Strict mode means the user sees
the failure rather than silently ending up with an unwired brain.

bin/gstack-brain-init drops 60 lines of dead code: the HTTP POST to
${GBRAIN_URL}/ingest-repo, the GBRAIN_URL_VAL/GBRAIN_TOKEN_VAL probes, the
consumers.json writer, and the chore commit step. CONSUMERS_FILE variable
declaration removed. The closing message no longer points at the dead
gstack-brain-consumer add path.

bin/gstack-brain-restore drops the 18-line consumers.json token-rehydration
block (was a no-op for the only consumer that ever existed). Adds a
best-effort wireup invocation after the brain-repo clone so 2nd-Mac restore
gets gbrain federation automatically. Failure prints a stderr WARNING but
does not abort the restore — restore's primary job is the git clone.

bin/gstack-brain-uninstall calls the helper's --uninstall mode (which
removes the gbrain source registration, the git worktree, and the
future-launchd-plist stub) before the existing legacy consumers.json
removal. Ordering is fragile-by-design: helper derives source-id via
multi-fallback so it works even after .git is destroyed.

bin/gstack-brain-consumer gets a DEPRECATED header note. Stays in the tree
for one cycle of grace; removal in v1.13.0.0.

setup-gbrain/SKILL.md is regenerated from the .tmpl via gen:skill-docs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: v1.12.3.0 migration — wire existing brain-sync repos into gbrain

Idempotent migration script. For users who already opted into brain-sync
before this release (gbrain_sync_mode != off, ~/.gstack/.git exists), runs
the new gstack-gbrain-source-wireup helper so their existing brain repo
becomes searchable via gbrain immediately on /gstack-upgrade.

Skip conditions (each ends with exit 0):
  - HOME unset or empty (defensive)
  - gbrain_sync_mode = off or empty (user opted out)
  - no ~/.gstack/.git (brain-init never ran)
  - helper missing on disk (broken install)

No --strict on the helper invocation: missing or old gbrain is a benign
skip during a batch upgrade rather than a blocker.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v1.12.3.0: setup-gbrain wireup ships the gbrain federation surface

Bumps VERSION 1.12.2.0 → 1.12.3.0 with a release-notes-format entry in
CHANGELOG.md. After upgrade, the placeholder consumers.json wireup is gone,
gbrain sources + sync + skill-end hook is the new path, your gstack memory
is actually searchable in gbrain.

The CHANGELOG entry follows the release-summary format from CLAUDE.md:
two-line bold headline, lead paragraph naming what shipped, "verify after
upgrade" command block readers can run on their own brain to see the
delta, then the standard Itemized changes / What this means / For
contributors sections.

Three pre-existing test failures on this branch are flagged in the
contributor section: the GSTACK_HOME isolation test (reads Garry's actual
~/.gstack/config.yaml), the 2MB tracked-binary test (security-bench
fixtures > 2MB), and the Opus 4.7 pacing-directive test (overlay text
drifted). All three were verified to fail on the base branch too — out
of scope for this PR, follow-up needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: helper locks GBRAIN_DATABASE_URL at startup, defends against config rewrites

The wireup helper previously read ~/.gbrain/config.json on every gbrain
subprocess invocation. On Garry's Mac, multiple concurrent test runs and
agent integrations were rewriting that file mid-sync, redirecting the
wireup at the wrong brain partway through a 4-min initial import.

This commit adds a `--database-url <url>` flag to the helper and locks
the URL at startup. Precedence:
  1. --database-url flag                       (explicit caller intent)
  2. GBRAIN_DATABASE_URL / DATABASE_URL env    (CI / manual override)
  3. read once from ~/.gbrain/config.json      (default)

Whichever wins gets exported as GBRAIN_DATABASE_URL for every child
`gbrain` invocation. Per gbrain's loadConfig at src/core/config.ts:53,
env-var URLs override the file URL — so a process that flips config.json
between two of our gbrain calls can't redirect us. Defense-in-depth:
once the URL is locked, the wireup completes against the original brain
even under hostile filesystem conditions.

setup-gbrain/SKILL.md.tmpl Step 7 now reads the URL out of config.json
once (via python3 inline) and passes it explicitly with --database-url,
so even the very first wireup call is decoupled from config.json mutability.

Three new test cases cover the lock behavior:
  - --database-url flag is exported to child gbrain calls
  - falls back to ~/.gbrain/config.json when no flag and no env
  - flag overrides env GBRAIN_DATABASE_URL and config.json values

The fake gbrain in the test suite now records GBRAIN_DATABASE_URL alongside
each call so tests can assert the helper exported the locked URL.

Total test count: 13 → 16 passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: bump v1.12.3.0 references to v1.15.1.0 to match merged-with-main release

Internal-only renames after merging origin/main bumped this branch's release
target from v1.12.3.0 → v1.15.1.0:

- gstack-upgrade/migrations/v1.12.3.0.sh → v1.15.1.0.sh (rename + log-prefix
  bump from "[v1.12.3.0]" to "[v1.15.1.0]")
- bin/gstack-brain-consumer header: "DEPRECATED in v1.12.3.0" → "DEPRECATED in
  v1.15.1.0"; removal target bumped from v1.13.0.0 → v1.16.0.0 (next minor
  after v1.15.1.0).
- bin/gstack-brain-uninstall: "no longer written ... since v1.12.3.0" →
  "since v1.15.1.0".

No behavior change. Test suite still 16/16 passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: 10 new cases close coverage gaps (helper defensive paths + migration)

/ship Step 7 coverage audit reported 48% (22/46 branches). Added 10 cases
covering the highest-impact gaps:

Helper (test/gstack-gbrain-source-wireup.test.ts, +3 cases → 19 total):
- --uninstall when gbrain is missing: best-effort exit 0, worktree still cleaned
- --no-pull skips HEAD advance on existing worktree (was untested)
- Stray non-git directory at worktree path is cleaned up + worktree created

Migration (test/gstack-upgrade-migration-v1_15_1_0.test.ts, NEW, 7 cases):
- HOME unset → defensive exit 0
- gbrain_sync_mode=off → exit 0 silently
- gbrain_sync_mode unset → exit 0 silently
- no ~/.gstack/.git → exit 0 silently
- helper missing on PATH → warning + exit 0
- happy path → invokes helper without --strict
- helper exits non-zero → migration prints retry hint, still exits 0 (non-blocking)

Also syncs package.json version from 1.15.0.0 → 1.15.1.0 to match VERSION
file (DRIFT_STALE_PKG repair from /ship Step 12 idempotency check; was a
manual-edit-bypass artifact from the merge step).

Coverage estimate: 48% → ~75%. Mainline + migration script + key defensive
paths all exercised. 26 tests total covering the new code surface.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: pre-landing review auto-fixes (5 correctness + observability)

/ship Step 9 review surfaced 9 INFORMATIONAL findings on the new helper +
migration. Five auto-fixed with no behavior regression (26/26 tests pass):

bin/gstack-gbrain-source-wireup:
- Version compare: put floor "0.18.0" first in `sort -V` stdin so equal-or-
  greater $v always sorts to position 2. Stable across sort implementations.
- _worktree_add_detached: drop `2>/dev/null` on the `worktree add`, surface
  git's stderr through `prefix` so users see WHY adds fail (disk, perms).
- ensure_worktree: same observability fix on the `git checkout --detach` path
  during HEAD-advance, so users see the actual git error before recovery.
- do_probe: replace `[ -d X ] || [ -f X ] && set=present` (precedence trap —
  the `&&` short-circuits when the dir branch fails) with explicit if-block.
- do_probe: capture `check_source_state`'s return code explicitly via
  `set +e; ...; rc=$?; set -e`. `$?` after an `if`/`elif` chain is fragile
  under set -e and may not reach the elif under some shell versions.
- do_wireup: same explicit return-code capture for `ensure_worktree`. The
  prior `ensure_worktree || { if [ $? = 2 ]; ...` pattern relied on `$?`
  reflecting the function's return after `||`, which is implementation-defined.

gstack-upgrade/migrations/v1.15.1.0.sh:
- Trim whitespace from `gstack-config get gbrain_sync_mode` output via
  `tr -d '[:space:]'`. Trailing newlines would mis-classify "off\n" as a
  non-empty non-off mode and incorrectly invoke the helper.

Skipped findings (cosmetic / out of scope):
- `python3 -c` reads `~/.gbrain/config.json` via `expanduser` instead of
  the helper's `$GBRAIN_CONFIG` variable (cosmetic; HONORS HOME override).
- Long sync-failure error message could truncate to last N lines (cosmetic
  log readability).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: adversarial review hardening (rm safety, jq probe, secret redaction, multi-Mac)

/ship Step 11 adversarial review surfaced 7 CRITICAL issues. Five fixed
inline (no behavior regression, 26/26 tests still pass):

bin/gstack-gbrain-source-wireup:

1. **rm -rf path validation** (was: F-c-CRITICAL 9/10).
   Added `safe_rm_worktree` helper that refuses any path not strictly under
   $HOME/, plus dangerous-path allowlist for /, /Users, $HOME root. Replaces
   raw `rm -rf "$WORKTREE"` calls (lines 161, 169 originally). If user sets
   GSTACK_BRAIN_WORKTREE="" or "/", the helper now dies cleanly instead of
   nuking the home dir or root.

2. **jq dependency probe** (was: F-c-CRITICAL 9/10).
   `check_source_state` now hard-fails with a clear message if jq is missing,
   instead of silently returning "absent" → re-add → die-on-duplicate. Plus
   trims whitespace from jq output (`tr -d '[:space:]'`) to defend against
   gbrain emitting `\n` for missing fields. Header comment claimed jq was a
   transitive dep; now we enforce it.

3. **Python heredoc warns on JSON parse failure** (was: F-c-CRITICAL 8/10).
   Previously `except Exception: pass` silently swallowed malformed JSON,
   leaving _locked_url empty and defeating the URL-lock defense. Now writes
   the parse error to a temp file and warns the user that the URL was not
   locked. Also passes the config path via env var (GBRAIN_CONFIG_PATH)
   instead of hardcoded `~/.gbrain/config.json`, respecting any HOME override.

4. **Multi-Mac source-id collision fix** (was: F-c-CRITICAL 9/10).
   When `check_source_state` returns 1 (source exists at different path), the
   helper used to remove + re-add. Two Macs sharing one Supabase brain would
   ping-pong the local_path metadata on every sync. Now: if the existing
   path's basename matches the local worktree's basename (likely another
   machine's local copy of the SAME brain repo), skip re-registration and
   sync against the local worktree. gbrain stores pages by content; metadata
   is informational. No more ping-pong.

5. **Redact DB URL from sync-failure error message** (was: F-c-CRITICAL 7/10).
   `gbrain sync` failures used to echo the full stderr (which can contain
   the postgres connection string with password) into the user's terminal
   and any log redirect. Now we sed-replace any `postgres://...` with
   `postgres://***REDACTED***` before the die() call, and only show the
   last 10 lines.

Bonus minor fix: `die()` now uses `$1` instead of `$*` for the warn
message, so the exit-code arg ($2) doesn't get appended to the warning text.

Acknowledged-but-deferred:
- GBRAIN_DATABASE_URL env exposure on Linux via /proc/$PID/environ. This is
  a Linux-only concern; gstack is Mac-targeted today and macOS restricts
  process env reads. Document as a follow-up if Linux support lands.
- gbrain version parser brittleness if gbrain switches to "v0.18.0" prefix.
  Defensive only; current gbrain output matches `gbrain X.Y.Z` exactly.
- bash 3.2 PIPESTATUS reliability. Tests pass on the host bash version (3.2+
  via macOS); modern bash 5.x is widely available.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: sync gbrain-source-wireup helper into USING_GBRAIN + gbrain-sync

USING_GBRAIN_WITH_GSTACK.md: add gstack-gbrain-source-wireup row to the bin
helpers table — describes federation registration via `gbrain sources add` +
worktree, lists flags, calls out it replaces the dead consumers.json/ingest-repo
HTTP wireup.

docs/gbrain-sync.md: replace the `gstack-brain-reader add --ingest-url` step
in gstack-brain-init's flow (which targeted the never-shipped /ingest-repo
endpoint) with the real flow — federate via gbrain sources + worktree, point
to bin/gstack-gbrain-source-wireup.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* v1.16.1.0: rebump after queue-collision (PR garrytan#1233 took v1.16.0.0)

CI's "Check VERSION is not stale vs queue" job (job 73105686380) failed
with: "VERSION drift: PR garrytan#1234 claims v1.15.1.0 but the queue has moved —
next free slot is v1.16.1.0." PR garrytan#1233 (garrytan/browserharness) entered
the queue claiming v1.16.0.0 between when this branch's prior /ship ran
and when CI evaluated, so v1.15.1.0 is stale. Rebumping on top.

Files updated:
- VERSION                                                     1.15.1.0 → 1.16.1.0
- package.json                                                1.15.1.0 → 1.16.1.0
- CHANGELOG.md heading + Before/After columns                 1.15.1.0 → 1.16.1.0
- CHANGELOG removal target (consumers.json + config keys)     1.16.0.0 → 1.17.0.0
- gstack-upgrade/migrations/v1.15.1.0.sh                      → renamed v1.16.1.0.sh + log prefix
- bin/gstack-brain-consumer "DEPRECATED in" + "removal in"    1.15.1.0/1.16.0.0 → 1.16.1.0/1.17.0.0
- bin/gstack-brain-uninstall "since vX.Y.Z.W"                 1.15.1.0 → 1.16.1.0
- test/gstack-upgrade-migration-v1_15_1_0.test.ts             → renamed v1_16_1_0.test.ts

No behavior change. 26/26 wireup + migration tests still pass on the rename.
Full bun test suite: exit 0, 0 failures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v1.17.0.0: rebump again — bump-detection now classifies branch as MINOR

CI's version-stale check (job 73106360896) failed: PR garrytan#1234 claims v1.16.1.0
but the queue moved to v1.17.0.0. Root cause: bumping 1.15.1.0 → 1.16.1.0
to dodge the prior collision turned the branch's diff classification from
PATCH (1.15.0 → 1.15.1) into MINOR (1.15.0 → 1.16.x). detect-bump.ts now
sees MINOR, gstack-next-version walks the MINOR lane past garrytan#1233's
v1.16.0.0 claim, and the next free slot is v1.17.0.0.

Honestly accurate per CLAUDE.md scale-aware bumps: this branch IS a
MINOR ("substantial new capability shipped — skill, harness, command,
big refactor"). The new helper + migration + integration totals ~1200
lines added across 11 files with 26 new tests. PATCH was always the
wrong honest classification; the queue collision forced the right
answer.

Files updated:
- VERSION                                                     1.16.1.0 → 1.17.0.0
- package.json                                                1.16.1.0 → 1.17.0.0
- CHANGELOG.md heading + After column                         1.16.1.0 → 1.17.0.0
- CHANGELOG removal targets                                   1.17.0.0 → 1.18.0.0
- gstack-upgrade/migrations/v1.16.1.0.sh                      → renamed v1.17.0.0.sh + log prefix
- bin/gstack-brain-consumer "DEPRECATED in" + "removal in"    1.16.1.0/1.17.0.0 → 1.17.0.0/1.18.0.0
- bin/gstack-brain-uninstall "since vX.Y.Z.W"                 1.16.1.0 → 1.17.0.0
- test/gstack-upgrade-migration-v1_16_1_0.test.ts             → renamed v1_17_0_0.test.ts

26/26 tests still pass. No behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(dual-impl): /review pass — maxBuffer 50MB + cleaner squashed-commit message

Two informational findings from /review pre-landing pass:

1. spawnSync default maxBuffer is 1MB. A large cumulative diff (e.g., 10k+
   line refactor squashed across multiple commits) would silently truncate
   when piped to `git apply -3 -` in the cherry-pick fallback path. Set
   maxBuffer to 50 MB on every git invocation in worktree.ts.

2. Patch-fallback commit message used `git log --format=%s` across N commits,
   producing N subject lines in one ugly -m string. Now: single-commit case
   uses the original subject; multi-commit case uses
   "Apply <winner> implementation (N commits squashed)".

Both BY-DESIGN risk (latent dualImpl undefined spread) and repo hygiene
(untracked junk files predating this branch) deferred — not actionable here.

122 tests pass, 0 fail.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat(dual-impl): Phase 3 — sub-agents.ts (runCodexImpl, runJudgeOpus, parseFailureCount)

Four new exports for the dual-implementor tournament:

- parseFailureCount(output): counts ✗ markers (bun) or ^FAIL lines (jest/pytest);
  returns max of the two so different runners report comparable signal.
- parseJudgeVerdict(output): extracts WINNER: gemini|codex + REASONING from
  Opus output. Falls back to verdict='gemini' with explanatory reasoning if
  WINNER line is missing — better to ship one impl than fail on a parse quirk.
- buildCodexImplArgv(opts): pure helper exposing the codex exec argv shape
  (exec + danger-full-access + -C cwd + reasoning=high). Extracted so tests
  can assert the invocation without spawning the binary.
- runCodexImpl(opts): mirrors runGemini structure — file-path I/O, captured
  output, single retry on timeout. Operates inside an isolated worktree so
  danger-full-access is safe (no leakage to main cwd).
- runJudgeOpus(opts): spawns claude --model claude-opus-4-7 -p with file-path
  I/O. Caller invokes parseJudgeVerdict on result.stdout to extract verdict.
  GSTACK_BUILD_JUDGE_TIMEOUT env var (default 10 min).

12 new tests cover parseFailureCount (5), parseJudgeVerdict (5), and
buildCodexImplArgv (2). 134 tests pass, 0 fail.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(dual-impl): Phase 3 post-review HIGH+MEDIUM+LOW fixes

Codex review surfaced four issues. All fixed:

1. HIGH — parseJudgeVerdict silently fell back to 'gemini' when WINNER line
   was missing. That defeats Phase 2's fail-closed semantics (dual_winner_pending
   without selectedImplementor → FAIL). Now returns verdict=null on malformed
   output; Phase 4 caller MUST treat null as hard failure. WINNER pattern is
   also now anchored to ^ so it doesn't match prose like "the WINNER: gemini
   is better here".

2. HIGH — runCodexImpl defaulted to 'danger-full-access', which is unsafe in
   linked git worktrees (shared .git, remotes, credentials with main cwd).
   A bad command could push --delete origin main from inside the worktree.
   Default is now 'workspace-write'; opts.sandbox or
   GSTACK_BUILD_CODEX_IMPL_SANDBOX env var allows opt-in to looser sandboxes.

3. MEDIUM — parseFailureCount returned 0 when no signal was detectable,
   making "could not parse failures" beat "1 real failure" in tie-breaking.
   Now returns `number | undefined`; phase-runner already fails closed when
   both impls have undefined failureCount. Also added priority-1 summary-line
   parsing ("3 failed" anchored to ^) for better cross-runner accuracy.

4. LOW — judge model was hardcoded 'claude-opus-4-7'. Now overridable via
   GSTACK_BUILD_JUDGE_MODEL env var.

Tests updated accordingly: parseJudgeVerdict tests now check null fallback +
mid-sentence rejection; parseFailureCount tests check undefined + summary-line
priority; buildCodexImplArgv tests check workspace-write default + sandbox
override.

137 tests pass, 0 fail.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat(dual-impl): Phase 4 — cli.ts dispatch handlers + --dual-impl flag

- Args.dualImpl: boolean field; --dual-impl CLI flag wired through parseArgs
  (now exported); HELP_TEXT exported and documents the flag.
- parsePlan(content, { dualImpl }) stamps dualImpl=true on every parsed phase
  when the flag is set — single-impl plans are unchanged.
- buildCodexImplPromptBody(phase, planFile): tournament-mode Codex prompt
  ("competing against Gemini, do NOT change test assertions, write minimal
  correct code").
- buildJudgePrompt({ phase, geminiDiff, codexDiff, geminiTestResult,
  codexTestResult }): Opus judge prompt with anchored WINNER:/REASONING:
  format and 5KB-trimmed diffs.
- runPhase handlers for the 4 new actions:
  * RUN_DUAL_IMPL  — createWorktrees + Promise.all([runGemini, runCodexImpl]);
                     teardown + fail-closed if either impl crashes.
  * RUN_DUAL_TESTS — Promise.all([runTests(gemini), runTests(codex)]);
                     parses failureCount from each; passes both into ApplyResultExtra.
  * RUN_JUDGE_OPUS — reads worktree diffs, runJudgeOpus with file-path I/O;
                     parseJudgeVerdict; null verdict → fail-closed + teardown.
  * APPLY_WINNER   — applyWinner cherry-pick; ALWAYS tears down worktrees
                     (even on cherry-pick failure — Phase 4 invariant).
- readWorktreeDiff helper: git diff baseCommit..HEAD with 50MB maxBuffer.
- Exhaustiveness guard preserved (no _never violation on new actions).
- 9 new tests cover --help text, parseArgs flag, and both new prompt bodies.

146 tests pass, 0 fail.
bun build build/orchestrator/cli.ts → clean.
gstack-build --help shows --dual-impl.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(dual-impl): Phase 4 post-review HIGH+MEDIUM fixes

Codex review surfaced four issues. All fixed:

1. HIGH — readWorktreeDiff returned '' on git failure, letting the judge see
   empty evidence and pick arbitrarily. Now returns string|null; RUN_JUDGE_OPUS
   handler fails closed (teardown + status=failed) when either diff is null.

2. HIGH — implementations could pass tests with uncommitted edits, but
   applyWinner has nothing to cherry-pick. New countCommitsSinceBase helper +
   RUN_DUAL_IMPL now treats "neither implementor committed anything" as a
   catastrophic failure alongside timeouts and double-non-zero-exits.
   Single-implementor commit failures still let the test phase auto-select.

3. MEDIUM — RUN_DUAL_IMPL post-createWorktrees block had no cleanup guard.
   A throw from writeFileSync or unexpected Promise.all rejection would leak
   worktrees + branches. Now wrapped in try/catch/finally with teardown on
   any failure path; dualImplOk flag suppresses teardown on the success path
   (downstream phases own cleanup).

4. MEDIUM — APPLY_WINNER unconditionally tore down worktrees, including on
   apply failure — destroying the only copy of the winner's code. Now
   preserves worktrees on cherry-pick failure and surfaces paths/branches +
   manual-cleanup commands in the error message. Teardown only happens after
   a successful apply.

146 tests pass, 0 fail. bun build clean.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat(dual-impl): Phase 5 — README + SKILL.md.tmpl v1.15.0 + integration test

- README: new "Dual Implementor Mode" section (workflow, auto-select rules,
  worktree isolation, recovery semantics, env vars).
- SKILL.md.tmpl: version 1.14.0 → 1.15.0 in frontmatter + announce-version line.
- bun run gen:skill-docs --host claude → regenerated build/SKILL.md.
- skill-md.test.ts pinned to v1.15.0.
- integration.test.ts adds a second dry-run that asserts --dual-impl announces
  "Dual Impl", "Dual Tests", "Judge Opus", and "Apply Winner" — and that the
  TDD steps (Test Specification, Verify Red) still run after handoff.
- CHANGELOG: full Unreleased entry covering new flag, state machine extension,
  fail-closed paths, recovery semantics, and 42-test coverage delta (105→147).

Verified:
  - 147 tests pass, 0 fail.
  - bun build build/orchestrator/cli.ts → clean.
  - gstack-build --help shows --dual-impl.
  - bun run gen:skill-docs regen → SKILL.md frontmatter version: 1.15.0.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* docs(dual-impl): Phase 5 post-review LOW + MEDIUM fixes

- Clarify "each TDD phase" upfront (legacy 2-checkbox plans skip dual-impl
  silently — Phase 5 review LOW).
- Document required CLIs (gemini, codex, claude) for --dual-impl with explicit
  note that orchestrator does NOT preflight check; missing Codex degrades into
  one-sided tournament. (Phase 5 review MEDIUM.)
- Update stale "105 tests across 9 files" to "147 tests across 10 files" with
  full coverage breakdown including dual-impl primitives and integration tests.

DEFERRED (Phase 5 review MEDIUM #1): hermetic non-dry-run integration test
with fake GEMINI_BIN/CODEX_BIN/CLAUDE_BIN. Real handler paths (createWorktrees,
Promise.all dispatch, applyWinner cherry-pick, teardown invariants) are exercised
only through unit tests, not end-to-end. Acceptable for v1; landed feature is
opt-in and small-blast-radius.

147 tests pass, 0 fail.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(dual-impl): Codex /review pass — 3 P2/P3 findings fixed

Codex structured review (gpt-5.5, --base main, full diff) surfaced 3 valid
correctness issues in the dual-implementor flow. All fixed; no P1 findings.
GATE: PASS.

[P2] cli.ts:739-741 — Zero-commit implementor still advanced to test/judge
  Old logic: only fail if BOTH sides committed nothing. If gemini committed
  but codex didn't (or vice versa), the no-commit side could pass tests on
  uncommitted edits and win auto-select, then applyWinner would fail with
  "No commits found".
  Fix: when EXACTLY ONE side committed, short-circuit dual-impl: skip
  RUN_DUAL_TESTS + RUN_JUDGE_OPUS, auto-select the committed side, jump
  straight to dual_winner_pending. Logs the warning so the user sees which
  implementor failed to commit. Both-failed and neither-committed paths
  unchanged (still fail-closed).

[P2] sub-agents.ts parseFailureCount — pytest summary not matched
  Old regex: `^\s*(\d+)\s+fail` failed on pytest's `===== 2 failed in 0.10s =====`
  because of the leading `=====` decoration. Pytest projects would return
  undefined → fail-closed even when signal was present.
  Fix: priority-1 pytest pattern `^=+\s*(\d+)\s+failed\b` matches the
  decorated summary; priority-2 keeps the bare-line pattern for bun/jest/cargo;
  priority-3 marker count fixed from `^FAILED?\b` (which matched FAILE/FAILED)
  to `^FAIL(?:ED)?\b` (matches both FAIL and FAILED). 3 new pytest tests added.

[P3] cli.ts:806-808 — Parallel dual-test logs collide
  Both runTests calls used `iteration: 1`, racing for the same log file
  `phase-N-tests-1.log`. testLogPath fields would point to one overwritten log.
  Fix: extended runTests with optional `logSuffix` param ('gemini'/'codex' for
  dual mode); resulting logs are `phase-N-tests-1-gemini.log` and
  `phase-N-tests-1-codex.log`. Default behavior unchanged when suffix omitted.

150 tests pass, 0 fail. bun build clean.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(sub-agents): mergeOutputFile empty-fallback — preserve verdict stream when output file is empty

When Codex applies edits inline but skips writing the report file, the output
file is left empty. Without this fix mergeOutputFile replaces stdout with ''
and parseVerdict returns 'unclear' — the review loop never converges.

Fix: detect empty fileContent and fall through to merging stderr+stdout so the
GATE PASS / GATE FAIL signal is preserved for the verdict scan.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Garry Tan <garrytan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant