v1.15.4.0 feat(ship): test-promise audit in Plan Completion (#1070)#1227
Open
gregario wants to merge 3 commits intogarrytan:mainfrom
Open
v1.15.4.0 feat(ship): test-promise audit in Plan Completion (#1070)#1227gregario wants to merge 3 commits intogarrytan:mainfrom
gregario wants to merge 3 commits intogarrytan:mainfrom
Conversation
Closes garrytan#1070. /plan-eng-review extracts a Test review section that produces TEST-category plan items. /ship Step 8 already categorizes plan items as CODE/TEST/MIGRATION/CONFIG/DOCS and tracks DONE/PARTIAL/NOT DONE/CHANGED against the diff. But TEST items weren't called out specifically — they just rolled up into a generic NOT DONE count, so a disciplined review→implement→ship loop could silently drop the test step and nobody would notice until production. The original retro that motivated this issue: 172 commits over 7 days, 67 feat: (34%), 56 fix: (28%), and 6 test: (3%). Every plan was reviewed by /plan-eng-review and listed specific tests. Most just didn't land. Add a Test-Promise Audit step to the resolver that runs in both ship and review modes: - For each TEST-category item, scan `git diff origin/<base>...HEAD` for evidence: new test files (per-language patterns enumerated for JS/TS, Python, Go, Ruby, Rust, Java/Kotlin, shell), modified test files, new test functions/cases. - Classify each as `landed` or `missing`. - Aggregate into JSON: `tests_promised`, `tests_landed`, `tests_missing`. Extend the Step 8 JSON contract with those three fields and add parent processing that appends one informational line to the PR body's Plan Completion summary when `tests_missing.length > 0`: ⚠ Plan promised N tests, M landed. Missing: … This is the "small" scope from the issue: pure observability, no new gate, no JSONL schema change (deferred to follow-up). The user already sees NOT DONE items via the existing `deferred` flow — the test-gap line just makes the test gap explicit instead of hiding it inside a generic NOT DONE count. If the visibility produces measurable behavior change, the next iteration could add a soft gate (issue mentions this path). Per-language test patterns are enumerated inline in the resolver rather than punted to a new bin/ helper — keeps the diff narrow, no new files beyond the regression test, no new public API. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ange bun run gen:skill-docs Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #1070.
/plan-eng-reviewproduces a Test review section that becomes TEST-category plan items./ship's Plan Completion Audit was rolling those into the generic NOT DONE count, so a disciplined review→implement→ship loop could silently drop the test step. The original retro that motivated the issue:feat:fix:test:Every plan listed specific tests. Most just didn't land.
What this PR does
This is the "small" scope the issue author asked for: pure observability, no new gate. The user already sees NOT DONE items via the existing
deferredflow — this PR just makes the test gap explicit instead of hiding it inside an aggregate count.scripts/resolvers/review.ts(generatePlanCompletionAuditInner) instructs the audit subagent to:git diff origin/<base>...HEADfor evidence*.test.ts,*.spec.ts,__tests__/*), Python (test_*.py,*_test.py), Go (*_test.go), Ruby (*_spec.rb,spec/*), Rust (#[test]/#[cfg(test)]), Java/Kotlin (*Test.java/ktundersrc/test/), shell (*.bats,*.test.sh)user_service.test.tsor newit('UserService...')blocklandedormissingtests_promised,tests_landed,tests_missing(list of item texts).ship/SKILL.md.tmplappends one informational line to the Plan Completion summary whentests_missing.length > 0:> ⚠ Plan promised N tests, M landed. Missing: ...Issue author's questions, answered
The issue closed with 4 questions. Answering inline:
## Testsheading? Heuristic. Less invasive, no plan-format change, no breaking change for existing plans. Issue mentions formalizing as a follow-up.bin/or inline in resolver? Inline. Keeps the diff narrow, no new public API, no new file beyond the regression test.What's NOT in this PR (deferred per issue's "Big" scope)
## Testsheading in plan files with JSONL artifact-passing between/plan-eng-reviewand/shiptest_items_promised/test_items_verifiedto the existing JSONL (schema change)If the visibility produces measurable behavior change in upcoming retros, the next iteration can add the soft gate.
Test plan
bun test test/resolver-test-promise-audit.test.ts— 4 cases passlanded/missingpiggybacks on it)bun run gen:skill-docs— regeneratesship/SKILL.mdandreview/SKILL.mdcleanlyTests are gate-tier (free). Resolver runs in <200ms. No E2E added — the subagent behavior is hard to test end-to-end without a real plan + diff fixture, and the resolver-shape test catches regressions on the prompt that drives it.
Bisected commits
feat(ship): test-promise audit in Plan Completion (informational)— resolver + template + testchore(ship,review): regenerate SKILL.md for #1070 resolver change— generated files onlyFiles
scripts/resolvers/review.ts— Test-Promise Audit section (~30 lines)ship/SKILL.md.tmpl— JSON contract extension + parent processing lineship/SKILL.md,review/SKILL.md— regeneratedtest/resolver-test-promise-audit.test.ts— 4 unit tests pinning the contractVERSION,CHANGELOG.md— release v1.15.4.0🤖 Generated with Claude Code