v1.21.1.0 test: tighten plan-ceo-review smoke (Step 0 must fire) by garrytan · Pull Request #1255 · garrytan/gstack

garrytan · 2026-04-28T07:03:26Z

Summary

Tightens the gate-tier /plan-ceo-review plan-mode smoke so the regression where the agent skips Step 0 entirely and goes straight to ExitPlanMode now FAILS the gate. Refactors the PTY harness so the contract is testable in 14ms of free unit tests instead of $0.50 of stochastic PTY.

PTY runner refactor: extracts classifyVisible() pure function (silent_write → plan_ready → asked → null). Permission dialogs filtered out of 'asked' so a permission prompt cannot pose as a Step 0 skill question. Adds TAIL_SCAN_BYTES = 1500 shared constant + env? passthrough on runPlanSkillObservation.
Tightened smoke: assertion narrows from ['asked', 'plan_ready'] to 'asked' only for plan-ceo (sibling smokes for plan-eng / plan-design / plan-devex stay loose — they legitimately reach plan_ready on certain branches without asking).
24 new unit tests: deterministic coverage of the classifier, permission-dialog co-trigger, env passthrough surface. Run free on every bun test in 14ms.
Co-trigger contract: bare Do you want to proceed? no longer triggers permission detection on its own — needs a file-edit context (Edit to <path> / Write to <path>).

Test Coverage

runPlanSkillObservation('plan-ceo-review', plan)
  └── while polling:
      ├── isPlanReadyVisible       → return 'plan_ready'   [GAP→COVERED] FAIL: agent skipped Step 0
      ├── isNumberedOptionListVisible → return 'asked'      [★★★ TESTED] PASS: Step 0 fired
      ├── Write/Edit unsanctioned   → return 'silent_write' [★★  TESTED] FAIL
      ├── proc.exited               → return 'exited'        [★★  TESTED] FAIL
      └── budget elapsed            → return 'timeout'       [★★  TESTED] FAIL

COVERAGE: 5/5 outcomes asserted (was 3/5 — 'plan_ready' previously treated as PASS).

Tests: 0 → 24 unit tests added (test/helpers/claude-pty-runner.unit.test.ts).

Pre-Landing Review

13 informational findings (testing + maintainability specialists), 0 critical. 8 auto-fixed, 5 escalated to user (D5 + D7 + D8 + D9 + plus the env-passthrough honesty fix). All resolved.

Adversarial review surfaced one real issue I owe transparency on: the env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' } passthrough is plumbing-only on current code — gstack-config reads ~/.gstack/config.yaml and not env vars. The test docstring now documents this honestly. P3 TODO captured to make gstack-config env-aware so the wiring is real, not advisory.

Plan Completion

D1 (tighten existing) ✓ · D3 (durable section names) ✓ · D4 (env passthrough) ✓ · D5 (classifier filter) ✓ · D6 (deterministic unit test) ✓ · D7 (regex co-trigger) ✓ · D8 (extract classifyVisible) ✓ · D9 (TAIL_SCAN_BYTES constant) ✓. D2 deferred to TODOS as planned (V2 per-finding count assertion).

TODOS

3 captured for v1.21.1.0:

P2: per-finding AskUserQuestion count assertion (V2)
P3: honor env vars in gstack-config (so QUESTION_TUNING=false actually isolates the test)
P3: path-confusion hardening on SANCTIONED_WRITE_SUBSTRINGS

Test plan

24/24 unit tests pass in 14ms (free, deterministic)
PTY gate test passes 4/4 runs (3 lock-in + 1 post-refactor verification, ~$2.00 spent)
339 helper + skill-validation tests pass post-commit
No collateral on sibling plan-*-plan-mode smokes (their loose assertion is unchanged)

🤖 Generated with Claude Code

^{Need help on this PR? Tag @codesmith with what you need.}

Let Codesmith autofix CI failures and bot reviews

Pure classifier extracted from runPlanSkillObservation's polling loop so unit tests can exercise the actual branch order with synthetic input strings. Runner gains: - env? passthrough on runPlanSkillObservation (forwarded to launchClaudePty). gstack-config does not yet honor env overrides; plumbing is in place for a future change to make tests hermetic. - TAIL_SCAN_BYTES = 1500 exported constant. Replaces a duplicated magic number in test/skill-e2e-plan-ceo-mode-routing.test.ts so tuning stays in sync. - isPermissionDialogVisible: the bare phrase "Do you want to proceed?" now requires a file-edit context co-trigger. Other clauses unchanged. Skill questions that contain the bare phrase are no longer mis-classified. - classifyVisible(visible): pure function. Branch order silent_write → plan_ready → asked → null. Permission dialogs filtered out of the 'asked' classification so a permission prompt cannot pose as a Step 0 skill question. Adds 24 unit tests covering all classifier branches, edge cases, and the co-trigger contract.

Assertion narrows from ['asked', 'plan_ready'] to 'asked' only. Reaching plan_ready first means the agent skipped Step 0 entirely and went straight to ExitPlanMode — the regression we want to catch. Why plan-ceo is special: unlike plan-eng / plan-design / plan-devex (whose smokes legitimately reach plan_ready on certain branches without asking), plan-ceo-review's template mandates Step 0A premise challenge plus Step 0F mode selection BEFORE any plan write. There is no legitimate path to plan_ready that does not first emit a skill-question numbered prompt. Failure message now branches on outcome (plan_ready vs timeout vs silent_write) with a tailored diagnosis line per case. References the skill template by section name ("Step 0 STOP rules", "One issue = one AskUserQuestion call") instead of line numbers, so it survives template edits. Passes env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' } through the runner. Today this is advisory — gstack-config reads only ~/.gstack/config.yaml, not env vars — but the wiring is in place for a future change. Documented honestly in the docstring. Verified across 4 PTY runs: 3 pre-refactor + 1 post-refactor, all PASS.

- P2: per-finding AskUserQuestion count assertion (V2) - P3: honor env vars in gstack-config so test isolation env actually works - P3: path-confusion hardening on SANCTIONED_WRITE_SUBSTRINGS All three surfaced during the v1.21.1.0 plan-eng-review and adversarial review passes. Captured here so the design intent persists.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-28T07:08:30Z

E2E Evals: ✅ PASS

0/0 tests passed | $0 total cost | 12 parallel runners

Suite	Result	Status	Cost

12x ubicloud-standard-2 (Docker: pre-baked toolchain + deps) | wall clock ≈ slowest suite

Version-gate workflow rejected v1.20.0.0 because the queue moved during the windows-free-tests fix loop: v1.16.0.0 → garrytan/gbrowser-unleashed (PR #1253) [new since last bump] v1.17.0.0 → garrytan/setup-gbrain-run (PR #1234) v1.19.0.0 → garrytan/browserharness (PR #1233) v1.21.1.0 → garrytan/pty-plan-mode-e2e (PR #1255) [new since last bump] Two new sibling PRs landed slot claims while we iterated on Windows. Next free MINOR slot is v1.22.0.0. Updated VERSION, package.json, CHANGELOG header + body. Also pushing the round-13 windows-fix in parallel (test invokes bin/gstack-paths via bash to handle Windows shebang). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Refactor prep for the upcoming per-finding AskUserQuestion count test across plan-{ceo,eng,design,devex}-review. Both new tests and the existing mode-routing test need the same mode regex and the same option-list fingerprint dedupe — pulling them into one source of truth in test/helpers/claude-pty-runner.ts so a fifth mode (or a tweak to the fingerprint shape) updates everywhere instead of drifting per-test. Mechanical: no behavior change in the mode-routing test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pure helpers landing ahead of runPlanSkillCounting: - parseQuestionPrompt(visible) — extract the 1-3 line prompt above the latest "❯ 1." cursor, normalize to a 240-char snippet - auqFingerprint(prompt, opts) — Bun.hash of normalized prompt + sorted options signature; distinct prompts with shared option labels (the generic A/B/C TODO menu) get distinct fingerprints - COMPLETION_SUMMARY_RE — terminal-signal regex matching all four plan-review skills' completion / verdict markers - assertReviewReportAtBottom(content) — checks "## GSTACK REVIEW REPORT" is present and is the last "## " heading in a plan file - Step0BoundaryPredicate type + four per-skill predicates (ceo / eng / design / devex) — fire on the answered AUQ's fingerprint, marking the end of Step 0 deterministically (event-based, not content-based, per Codex F7) Plus 37 deterministic unit tests covering option-label collision regression, prompt extraction edge cases, predicate positive AND negative cases, and review-report-at-bottom triple-check (missing / mid-file / multiple trailing). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Drives a plan-* skill end-to-end and counts distinct review-phase AskUserQuestions. Composes the primitives from the previous commit: - Boot + auto-trust handler (existing launchClaudePty) - Send slash command alone, sleep 3s, send plan content as follow-up message (proven pattern from skill-e2e-plan-design-with-ui) - Poll loop with permission-dialog auto-grant, same-redraw skip, empty-prompt re-poll - Event-based Step-0 boundary via isLastStep0AUQ predicate fired on the answered AUQ's fingerprint (Codex F7 — boundary is observed event, not later rendered content) - Multi-signal terminals: hard ceiling, COMPLETION_SUMMARY_RE, plan_ready, silent_write, exited, timeout Empty-prompt fingerprints are skipped per the contract documented in auqFingerprint's unit tests — fingerprinting them would re-introduce the option-label collision regression Codex F1 caught. No E2E tests yet — those land in commit 5 with the four skill fixtures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

garrytan and others added 4 commits April 28, 2026 00:00

chore: bump version and changelog (v1.21.1.0)

8bbe74f

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

garrytan and others added 3 commits April 28, 2026 09:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.21.1.0 test: tighten plan-ceo-review smoke (Step 0 must fire)#1255

v1.21.1.0 test: tighten plan-ceo-review smoke (Step 0 must fire)#1255
garrytan wants to merge 7 commits intomainfrom
garrytan/pty-plan-mode-e2e

garrytan commented Apr 28, 2026 •

edited by blacksmith-sh Bot

Loading

Uh oh!

github-actions Bot commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

garrytan commented Apr 28, 2026 • edited by blacksmith-sh Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Coverage

Pre-Landing Review

Plan Completion

TODOS

Test plan

Uh oh!

github-actions Bot commented Apr 28, 2026

E2E Evals: ✅ PASS

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

garrytan commented Apr 28, 2026 •

edited by blacksmith-sh Bot

Loading