v1.15.4.0 feat(ship): test-promise audit in Plan Completion (#1070) by gregario · Pull Request #1227 · garrytan/gstack

gregario · 2026-04-26T20:01:46Z

Summary

Closes #1070. /plan-eng-review produces a Test review section that becomes TEST-category plan items. /ship's Plan Completion Audit was rolling those into the generic NOT DONE count, so a disciplined review→implement→ship loop could silently drop the test step. The original retro that motivated the issue:

Metric	Count	Share
Total commits	172	—
`feat:`	67	34%
`fix:`	56	28%
`test:`	6	3%

Every plan listed specific tests. Most just didn't land.

What this PR does

This is the "small" scope the issue author asked for: pure observability, no new gate. The user already sees NOT DONE items via the existing deferred flow — this PR just makes the test gap explicit instead of hiding it inside an aggregate count.

New "Test-Promise Audit" section in scripts/resolvers/review.ts (generatePlanCompletionAuditInner) instructs the audit subagent to:
- For each TEST-category plan item, scan git diff origin/<base>...HEAD for evidence
- Per-language test patterns enumerated inline: JS/TS (*.test.ts, *.spec.ts, __tests__/*), Python (test_*.py, *_test.py), Go (*_test.go), Ruby (*_spec.rb, spec/*), Rust (#[test]/#[cfg(test)]), Java/Kotlin (*Test.java/kt under src/test/), shell (*.bats, *.test.sh)
- Heuristic match — "Add test for UserService" pairs with new user_service.test.ts or new it('UserService...') block
- Classify each as landed or missing
Three new fields on the JSON contract: tests_promised, tests_landed, tests_missing (list of item texts).
Parent processing in ship/SKILL.md.tmpl appends one informational line to the Plan Completion summary when tests_missing.length > 0:
> ⚠ Plan promised N tests, M landed. Missing: ...

Issue author's questions, answered

The issue closed with 4 questions. Answering inline:

Small first or soft gate? Small. Issue author's own gut, and the right call for a behavior change before we have data.
Keyword heuristic or formalize a ## Tests heading? Heuristic. Less invasive, no plan-format change, no breaking change for existing plans. Issue mentions formalizing as a follow-up.
Per-language patterns in bin/ or inline in resolver? Inline. Keeps the diff narrow, no new public API, no new file beyond the regression test.
Missing existing effort? I grepped "test promise", "test gate", "plan completion", "tests promised" — nothing. PR fix(review): add PR type triage for non-application PRs #645 (mine) touches the review skill but is unrelated.

What's NOT in this PR (deferred per issue's "Big" scope)

Soft gate via AskUserQuestion when tests_missing > 0
Formal ## Tests heading in plan files with JSONL artifact-passing between /plan-eng-review and /ship
Adding test_items_promised/test_items_verified to the existing JSONL (schema change)

If the visibility produces measurable behavior change in upcoming retros, the next iteration can add the soft gate.

Test plan

bun test test/resolver-test-promise-audit.test.ts — 4 cases pass
- Ship variant emits Test-Promise Audit section
- Review variant emits it too (delivery integrity is review's whole job)
- Item Extraction still produces TEST category (the audit depends on it)
- Cross-Reference still classifies DONE/PARTIAL/NOT DONE/CHANGED (the audit's landed/missing piggybacks on it)
bun run gen:skill-docs — regenerates ship/SKILL.md and review/SKILL.md cleanly

Tests are gate-tier (free). Resolver runs in <200ms. No E2E added — the subagent behavior is hard to test end-to-end without a real plan + diff fixture, and the resolver-shape test catches regressions on the prompt that drives it.

Bisected commits

feat(ship): test-promise audit in Plan Completion (informational) — resolver + template + test
chore(ship,review): regenerate SKILL.md for #1070 resolver change — generated files only

Files

scripts/resolvers/review.ts — Test-Promise Audit section (~30 lines)
ship/SKILL.md.tmpl — JSON contract extension + parent processing line
ship/SKILL.md, review/SKILL.md — regenerated
test/resolver-test-promise-audit.test.ts — 4 unit tests pinning the contract
VERSION, CHANGELOG.md — release v1.15.4.0

🤖 Generated with Claude Code

Closes garrytan#1070. /plan-eng-review extracts a Test review section that produces TEST-category plan items. /ship Step 8 already categorizes plan items as CODE/TEST/MIGRATION/CONFIG/DOCS and tracks DONE/PARTIAL/NOT DONE/CHANGED against the diff. But TEST items weren't called out specifically — they just rolled up into a generic NOT DONE count, so a disciplined review→implement→ship loop could silently drop the test step and nobody would notice until production. The original retro that motivated this issue: 172 commits over 7 days, 67 feat: (34%), 56 fix: (28%), and 6 test: (3%). Every plan was reviewed by /plan-eng-review and listed specific tests. Most just didn't land. Add a Test-Promise Audit step to the resolver that runs in both ship and review modes: - For each TEST-category item, scan `git diff origin/<base>...HEAD` for evidence: new test files (per-language patterns enumerated for JS/TS, Python, Go, Ruby, Rust, Java/Kotlin, shell), modified test files, new test functions/cases. - Classify each as `landed` or `missing`. - Aggregate into JSON: `tests_promised`, `tests_landed`, `tests_missing`. Extend the Step 8 JSON contract with those three fields and add parent processing that appends one informational line to the PR body's Plan Completion summary when `tests_missing.length > 0`: ⚠ Plan promised N tests, M landed. Missing: … This is the "small" scope from the issue: pure observability, no new gate, no JSONL schema change (deferred to follow-up). The user already sees NOT DONE items via the existing `deferred` flow — the test-gap line just makes the test gap explicit instead of hiding it inside a generic NOT DONE count. If the visibility produces measurable behavior change, the next iteration could add a soft gate (issue mentions this path). Per-language test patterns are enumerated inline in the resolver rather than punted to a new bin/ helper — keeps the diff narrow, no new files beyond the regression test, no new public API. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ange bun run gen:skill-docs Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

gregario and others added 3 commits April 26, 2026 21:00

chore(ship,review): regenerate SKILL.md for garrytan#1070 resolver ch…

7991b20

…ange bun run gen:skill-docs Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

chore: release v1.15.4.0 (ship test-promise audit)

8742a81

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.15.4.0 feat(ship): test-promise audit in Plan Completion (#1070)#1227

v1.15.4.0 feat(ship): test-promise audit in Plan Completion (#1070)#1227
gregario wants to merge 3 commits intogarrytan:mainfrom
gregario:feat/1070-ship-test-promise-audit

gregario commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gregario commented Apr 26, 2026

Summary

What this PR does

Issue author's questions, answered

What's NOT in this PR (deferred per issue's "Big" scope)

Test plan

Bisected commits

Files

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant