Skip to content

v1.15.4.0 feat(ship): test-promise audit in Plan Completion (#1070)#1227

Open
gregario wants to merge 3 commits intogarrytan:mainfrom
gregario:feat/1070-ship-test-promise-audit
Open

v1.15.4.0 feat(ship): test-promise audit in Plan Completion (#1070)#1227
gregario wants to merge 3 commits intogarrytan:mainfrom
gregario:feat/1070-ship-test-promise-audit

Conversation

@gregario
Copy link
Copy Markdown
Contributor

Summary

Closes #1070. /plan-eng-review produces a Test review section that becomes TEST-category plan items. /ship's Plan Completion Audit was rolling those into the generic NOT DONE count, so a disciplined review→implement→ship loop could silently drop the test step. The original retro that motivated the issue:

Metric Count Share
Total commits 172
feat: 67 34%
fix: 56 28%
test: 6 3%

Every plan listed specific tests. Most just didn't land.

What this PR does

This is the "small" scope the issue author asked for: pure observability, no new gate. The user already sees NOT DONE items via the existing deferred flow — this PR just makes the test gap explicit instead of hiding it inside an aggregate count.

  1. New "Test-Promise Audit" section in scripts/resolvers/review.ts (generatePlanCompletionAuditInner) instructs the audit subagent to:
    • For each TEST-category plan item, scan git diff origin/<base>...HEAD for evidence
    • Per-language test patterns enumerated inline: JS/TS (*.test.ts, *.spec.ts, __tests__/*), Python (test_*.py, *_test.py), Go (*_test.go), Ruby (*_spec.rb, spec/*), Rust (#[test]/#[cfg(test)]), Java/Kotlin (*Test.java/kt under src/test/), shell (*.bats, *.test.sh)
    • Heuristic match — "Add test for UserService" pairs with new user_service.test.ts or new it('UserService...') block
    • Classify each as landed or missing
  2. Three new fields on the JSON contract: tests_promised, tests_landed, tests_missing (list of item texts).
  3. Parent processing in ship/SKILL.md.tmpl appends one informational line to the Plan Completion summary when tests_missing.length > 0:
    > ⚠ Plan promised N tests, M landed. Missing: ...

Issue author's questions, answered

The issue closed with 4 questions. Answering inline:

  1. Small first or soft gate? Small. Issue author's own gut, and the right call for a behavior change before we have data.
  2. Keyword heuristic or formalize a ## Tests heading? Heuristic. Less invasive, no plan-format change, no breaking change for existing plans. Issue mentions formalizing as a follow-up.
  3. Per-language patterns in bin/ or inline in resolver? Inline. Keeps the diff narrow, no new public API, no new file beyond the regression test.
  4. Missing existing effort? I grepped "test promise", "test gate", "plan completion", "tests promised" — nothing. PR fix(review): add PR type triage for non-application PRs #645 (mine) touches the review skill but is unrelated.

What's NOT in this PR (deferred per issue's "Big" scope)

  • Soft gate via AskUserQuestion when tests_missing > 0
  • Formal ## Tests heading in plan files with JSONL artifact-passing between /plan-eng-review and /ship
  • Adding test_items_promised/test_items_verified to the existing JSONL (schema change)

If the visibility produces measurable behavior change in upcoming retros, the next iteration can add the soft gate.

Test plan

  • bun test test/resolver-test-promise-audit.test.ts — 4 cases pass
    • Ship variant emits Test-Promise Audit section
    • Review variant emits it too (delivery integrity is review's whole job)
    • Item Extraction still produces TEST category (the audit depends on it)
    • Cross-Reference still classifies DONE/PARTIAL/NOT DONE/CHANGED (the audit's landed/missing piggybacks on it)
  • bun run gen:skill-docs — regenerates ship/SKILL.md and review/SKILL.md cleanly

Tests are gate-tier (free). Resolver runs in <200ms. No E2E added — the subagent behavior is hard to test end-to-end without a real plan + diff fixture, and the resolver-shape test catches regressions on the prompt that drives it.

Bisected commits

  1. feat(ship): test-promise audit in Plan Completion (informational) — resolver + template + test
  2. chore(ship,review): regenerate SKILL.md for #1070 resolver change — generated files only

Files

  • scripts/resolvers/review.ts — Test-Promise Audit section (~30 lines)
  • ship/SKILL.md.tmpl — JSON contract extension + parent processing line
  • ship/SKILL.md, review/SKILL.md — regenerated
  • test/resolver-test-promise-audit.test.ts — 4 unit tests pinning the contract
  • VERSION, CHANGELOG.md — release v1.15.4.0

🤖 Generated with Claude Code

gregario and others added 3 commits April 26, 2026 21:00
Closes garrytan#1070.

/plan-eng-review extracts a Test review section that produces TEST-category
plan items. /ship Step 8 already categorizes plan items as
CODE/TEST/MIGRATION/CONFIG/DOCS and tracks DONE/PARTIAL/NOT DONE/CHANGED
against the diff. But TEST items weren't called out specifically — they
just rolled up into a generic NOT DONE count, so a disciplined
review→implement→ship loop could silently drop the test step and nobody
would notice until production.

The original retro that motivated this issue: 172 commits over 7 days,
67 feat: (34%), 56 fix: (28%), and 6 test: (3%). Every plan was reviewed
by /plan-eng-review and listed specific tests. Most just didn't land.

Add a Test-Promise Audit step to the resolver that runs in both ship
and review modes:

  - For each TEST-category item, scan `git diff origin/<base>...HEAD`
    for evidence: new test files (per-language patterns enumerated for
    JS/TS, Python, Go, Ruby, Rust, Java/Kotlin, shell), modified test
    files, new test functions/cases.
  - Classify each as `landed` or `missing`.
  - Aggregate into JSON: `tests_promised`, `tests_landed`, `tests_missing`.

Extend the Step 8 JSON contract with those three fields and add parent
processing that appends one informational line to the PR body's Plan
Completion summary when `tests_missing.length > 0`:

  ⚠ Plan promised N tests, M landed. Missing: …

This is the "small" scope from the issue: pure observability, no new
gate, no JSONL schema change (deferred to follow-up). The user already
sees NOT DONE items via the existing `deferred` flow — the test-gap line
just makes the test gap explicit instead of hiding it inside a generic
NOT DONE count. If the visibility produces measurable behavior change,
the next iteration could add a soft gate (issue mentions this path).

Per-language test patterns are enumerated inline in the resolver rather
than punted to a new bin/ helper — keeps the diff narrow, no new files
beyond the regression test, no new public API.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ange

bun run gen:skill-docs

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ship: verify promised tests landed in diff before push

1 participant