Skip to content

v1.21.1.0 test: tighten plan-ceo-review smoke (Step 0 must fire)#1255

Open
garrytan wants to merge 7 commits intomainfrom
garrytan/pty-plan-mode-e2e
Open

v1.21.1.0 test: tighten plan-ceo-review smoke (Step 0 must fire)#1255
garrytan wants to merge 7 commits intomainfrom
garrytan/pty-plan-mode-e2e

Conversation

@garrytan
Copy link
Copy Markdown
Owner

@garrytan garrytan commented Apr 28, 2026

Summary

Tightens the gate-tier /plan-ceo-review plan-mode smoke so the regression where the agent skips Step 0 entirely and goes straight to ExitPlanMode now FAILS the gate. Refactors the PTY harness so the contract is testable in 14ms of free unit tests instead of $0.50 of stochastic PTY.

  • PTY runner refactor: extracts classifyVisible() pure function (silent_write → plan_ready → asked → null). Permission dialogs filtered out of 'asked' so a permission prompt cannot pose as a Step 0 skill question. Adds TAIL_SCAN_BYTES = 1500 shared constant + env? passthrough on runPlanSkillObservation.
  • Tightened smoke: assertion narrows from ['asked', 'plan_ready'] to 'asked' only for plan-ceo (sibling smokes for plan-eng / plan-design / plan-devex stay loose — they legitimately reach plan_ready on certain branches without asking).
  • 24 new unit tests: deterministic coverage of the classifier, permission-dialog co-trigger, env passthrough surface. Run free on every bun test in 14ms.
  • Co-trigger contract: bare Do you want to proceed? no longer triggers permission detection on its own — needs a file-edit context (Edit to <path> / Write to <path>).

Test Coverage

runPlanSkillObservation('plan-ceo-review', plan)
  └── while polling:
      ├── isPlanReadyVisible       → return 'plan_ready'   [GAP→COVERED] FAIL: agent skipped Step 0
      ├── isNumberedOptionListVisible → return 'asked'      [★★★ TESTED] PASS: Step 0 fired
      ├── Write/Edit unsanctioned   → return 'silent_write' [★★  TESTED] FAIL
      ├── proc.exited               → return 'exited'        [★★  TESTED] FAIL
      └── budget elapsed            → return 'timeout'       [★★  TESTED] FAIL

COVERAGE: 5/5 outcomes asserted (was 3/5 — 'plan_ready' previously treated as PASS).

Tests: 0 → 24 unit tests added (test/helpers/claude-pty-runner.unit.test.ts).

Pre-Landing Review

13 informational findings (testing + maintainability specialists), 0 critical. 8 auto-fixed, 5 escalated to user (D5 + D7 + D8 + D9 + plus the env-passthrough honesty fix). All resolved.

Adversarial review surfaced one real issue I owe transparency on: the env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' } passthrough is plumbing-only on current code — gstack-config reads ~/.gstack/config.yaml and not env vars. The test docstring now documents this honestly. P3 TODO captured to make gstack-config env-aware so the wiring is real, not advisory.

Plan Completion

D1 (tighten existing) ✓ · D3 (durable section names) ✓ · D4 (env passthrough) ✓ · D5 (classifier filter) ✓ · D6 (deterministic unit test) ✓ · D7 (regex co-trigger) ✓ · D8 (extract classifyVisible) ✓ · D9 (TAIL_SCAN_BYTES constant) ✓. D2 deferred to TODOS as planned (V2 per-finding count assertion).

TODOS

3 captured for v1.21.1.0:

  • P2: per-finding AskUserQuestion count assertion (V2)
  • P3: honor env vars in gstack-config (so QUESTION_TUNING=false actually isolates the test)
  • P3: path-confusion hardening on SANCTIONED_WRITE_SUBSTRINGS

Test plan

  • 24/24 unit tests pass in 14ms (free, deterministic)
  • PTY gate test passes 4/4 runs (3 lock-in + 1 post-refactor verification, ~$2.00 spent)
  • 339 helper + skill-validation tests pass post-commit
  • No collateral on sibling plan-*-plan-mode smokes (their loose assertion is unchanged)

🤖 Generated with Claude Code


View in Codesmith
Need help on this PR? Tag @codesmith with what you need.

  • Let Codesmith autofix CI failures and bot reviews

garrytan and others added 4 commits April 28, 2026 00:00
Pure classifier extracted from runPlanSkillObservation's polling loop so
unit tests can exercise the actual branch order with synthetic input
strings. Runner gains:

- env? passthrough on runPlanSkillObservation (forwarded to launchClaudePty).
  gstack-config does not yet honor env overrides; plumbing is in place for a
  future change to make tests hermetic.
- TAIL_SCAN_BYTES = 1500 exported constant. Replaces a duplicated magic
  number in test/skill-e2e-plan-ceo-mode-routing.test.ts so tuning stays
  in sync.
- isPermissionDialogVisible: the bare phrase "Do you want to proceed?" now
  requires a file-edit context co-trigger. Other clauses unchanged. Skill
  questions that contain the bare phrase are no longer mis-classified.
- classifyVisible(visible): pure function. Branch order silent_write →
  plan_ready → asked → null. Permission dialogs filtered out of the
  'asked' classification so a permission prompt cannot pose as a Step 0
  skill question.

Adds 24 unit tests covering all classifier branches, edge cases, and the
co-trigger contract.
Assertion narrows from ['asked', 'plan_ready'] to 'asked' only. Reaching
plan_ready first means the agent skipped Step 0 entirely and went
straight to ExitPlanMode — the regression we want to catch.

Why plan-ceo is special: unlike plan-eng / plan-design / plan-devex
(whose smokes legitimately reach plan_ready on certain branches without
asking), plan-ceo-review's template mandates Step 0A premise challenge
plus Step 0F mode selection BEFORE any plan write. There is no
legitimate path to plan_ready that does not first emit a skill-question
numbered prompt.

Failure message now branches on outcome (plan_ready vs timeout vs
silent_write) with a tailored diagnosis line per case. References the
skill template by section name ("Step 0 STOP rules", "One issue = one
AskUserQuestion call") instead of line numbers, so it survives template
edits.

Passes env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' }
through the runner. Today this is advisory — gstack-config reads only
~/.gstack/config.yaml, not env vars — but the wiring is in place for a
future change. Documented honestly in the docstring.

Verified across 4 PTY runs: 3 pre-refactor + 1 post-refactor, all PASS.
- P2: per-finding AskUserQuestion count assertion (V2)
- P3: honor env vars in gstack-config so test isolation env actually works
- P3: path-confusion hardening on SANCTIONED_WRITE_SUBSTRINGS

All three surfaced during the v1.21.1.0 plan-eng-review and adversarial
review passes. Captured here so the design intent persists.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

E2E Evals: ✅ PASS

0/0 tests passed | $0 total cost | 12 parallel runners

Suite Result Status Cost

12x ubicloud-standard-2 (Docker: pre-baked toolchain + deps) | wall clock ≈ slowest suite

garrytan added a commit that referenced this pull request Apr 28, 2026
Version-gate workflow rejected v1.20.0.0 because the queue moved during
the windows-free-tests fix loop:

  v1.16.0.0 → garrytan/gbrowser-unleashed (PR #1253)  [new since last bump]
  v1.17.0.0 → garrytan/setup-gbrain-run    (PR #1234)
  v1.19.0.0 → garrytan/browserharness       (PR #1233)
  v1.21.1.0 → garrytan/pty-plan-mode-e2e    (PR #1255)  [new since last bump]

Two new sibling PRs landed slot claims while we iterated on Windows.
Next free MINOR slot is v1.22.0.0.

Updated VERSION, package.json, CHANGELOG header + body. Also pushing the
round-13 windows-fix in parallel (test invokes bin/gstack-paths via bash
to handle Windows shebang).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
garrytan and others added 3 commits April 28, 2026 09:05
Refactor prep for the upcoming per-finding AskUserQuestion count test
across plan-{ceo,eng,design,devex}-review. Both new tests and the existing
mode-routing test need the same mode regex and the same option-list
fingerprint dedupe — pulling them into one source of truth in
test/helpers/claude-pty-runner.ts so a fifth mode (or a tweak to the
fingerprint shape) updates everywhere instead of drifting per-test.

Mechanical: no behavior change in the mode-routing test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pure helpers landing ahead of runPlanSkillCounting:

  - parseQuestionPrompt(visible) — extract the 1-3 line prompt above
    the latest "❯ 1." cursor, normalize to a 240-char snippet
  - auqFingerprint(prompt, opts) — Bun.hash of normalized prompt + sorted
    options signature; distinct prompts with shared option labels
    (the generic A/B/C TODO menu) get distinct fingerprints
  - COMPLETION_SUMMARY_RE — terminal-signal regex matching all four
    plan-review skills' completion / verdict markers
  - assertReviewReportAtBottom(content) — checks "## GSTACK REVIEW
    REPORT" is present and is the last "## " heading in a plan file
  - Step0BoundaryPredicate type + four per-skill predicates
    (ceo / eng / design / devex) — fire on the answered AUQ's
    fingerprint, marking the end of Step 0 deterministically
    (event-based, not content-based, per Codex F7)

Plus 37 deterministic unit tests covering option-label collision
regression, prompt extraction edge cases, predicate positive AND
negative cases, and review-report-at-bottom triple-check
(missing / mid-file / multiple trailing).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drives a plan-* skill end-to-end and counts distinct review-phase
AskUserQuestions. Composes the primitives from the previous commit:

  - Boot + auto-trust handler (existing launchClaudePty)
  - Send slash command alone, sleep 3s, send plan content as follow-up
    message (proven pattern from skill-e2e-plan-design-with-ui)
  - Poll loop with permission-dialog auto-grant, same-redraw skip,
    empty-prompt re-poll
  - Event-based Step-0 boundary via isLastStep0AUQ predicate fired on
    the answered AUQ's fingerprint (Codex F7 — boundary is observed
    event, not later rendered content)
  - Multi-signal terminals: hard ceiling, COMPLETION_SUMMARY_RE,
    plan_ready, silent_write, exited, timeout

Empty-prompt fingerprints are skipped per the contract documented in
auqFingerprint's unit tests — fingerprinting them would re-introduce
the option-label collision regression Codex F1 caught.

No E2E tests yet — those land in commit 5 with the four skill fixtures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant