Tighten acceptance evidence discipline across agents by shinpr · Pull Request #128 · shinpr/claude-code-workflows

shinpr · 2026-05-26T13:57:50Z

Summary

Introduce acceptance evidence discipline so test results cited as evidence reflect real behavior — not zero-match selectors, skipped tests, TODO/placeholder bodies, or always-passing assertions. The rule is scoped to test evidence so non-test verification (build, typecheck, CLI) and refactor tasks without test evidence are unaffected.
Extend the task-executor frontend with sections that previously existed only on the backend variant: Test Environment Check, Reference Representativeness (with a concrete "feature area" definition), and the four-bullet Iron Rule. Add Dependency Version Uncertain (2-4) and Test Environment Not Ready (2-7) escalation schemas, and register test_environment_not_ready in the orchestration guide's allowed escalation_type list so the workflow contract stays consistent.
Compress prompt files by ~280 lines without losing behavior-binding signals: collapse repeated preambles, deadwood "Specific Boundary" recaps, illustrative bullet lists, verbose JSON examples, and ENFORCEMENT boilerplate; rewrite negative-form instructions in positive form; restore anchor values and disambiguation phrasing where the first pass over-compressed.

What changes per agent

`task-executor.md` / `task-executor-frontend.md`

Add runnableCheck.result field spec and an Exit Gate item that count only test runs cited as evidence; non-test verification is exempt.
Add Test Environment Check (frontend), Reference Representativeness (frontend), and expand the Iron Rule to four bullets (frontend). Both files now share the same structural sections, while domain vocabulary stays specific to their stack.
Add Escalation Response 2-4 Dependency Version Uncertain (frontend) and 2-7 Test Environment Not Ready (both files), with structured missingComponent + description fields so the orchestrator can parse the cause.
Compress: collapse File Scope Constraint to direct rules, fold Similar Function/Component Duplication Check indicators into one inline list with explicit "Exactly the pair (a+c) or (b+c) → Escalation" thresholds, drop deadwood preambles and the redundant Execution Principles footer.
Add Change Category-style stop conditions to the Iron Rule's escalation paths in positive form: "Route any new library/pattern decision through Escalation Response 2-4" instead of "Do not introduce a new library or pattern".

`quality-fixer.md` / `quality-fixer-frontend.md`

Add a conditional Substance check in Step 3 and an approved criterion that only apply when a test run is cited as evidence for the task's intended behavior.
Split the blocked JSON example into two per-variant blocks (specification conflict / missing prerequisites) so the LLM does not produce hybrid responses, and keep concrete sample values as anchors ("500 error" / "Button disabled" etc.).
Frontend: rewrite Frontend-Specific Quality Criteria to be repository-local (prefer dominant local patterns, mock at network/API boundaries, allow browser primitive doubles, exercise the component under test through real renders). The "new library/pattern" rule routes through the agent's actual blocked output contract.
Compress: drop low-binding Important Principles, generic Debugging Hints, the Fix Continuation Determination Conditions duplicate of Step 5, and a 17-line mermaid diagram that restated text rules.

`integration-test-reviewer.md`

Add a substantive-assertion criterion to Quality Assessment and a Hollow or Placeholder Assertion entry to Common Issues, limited to text-inspectable patterns.
Tighten the JSON schema example while keeping meaningful placeholder text.

`code-reviewer.md`

Extend Test Coverage for Acceptance Criteria to flag tests that grep finds but do not exercise the AC (skipped, TODO-only, or always-true assertions), with explicit examples and an absence-verification clarification.
Compress the schema and example sections; drop the generic Special Considerations block.

`skills/subagents-orchestration-guide/SKILL.md`

Add test_environment_not_ready to the task-executor escalation_type list so the new typed escalation has a documented downstream handler.

Line-count impact

File	Before	After	Δ
`agents/task-executor.md`	444	406	-38
`agents/task-executor-frontend.md`	395	411	+16
`agents/quality-fixer.md`	330	230	-100
`agents/quality-fixer-frontend.md`	440	328	-112
`agents/integration-test-reviewer.md`	146	137	-9
`agents/code-reviewer.md`	275	243	-32
agents/ total	2030	1755	-275

All target files now sit at or below 450 lines; task-executor and quality-fixer are both well under the 400-line target.

Test plan

node scripts/sync-plugins.mjs regenerates the three plugin directories cleanly (handled by the pre-commit hook).
claude plugin validate passes for marketplace.json and each plugin manifest (handled by the pre-commit hook).
node scripts/check-skills-index.mjs reports all 11 skills consistent.
Run a sample task through task-executor and confirm the new Exit Gate item rejects zero-match / skipped / placeholder evidence while leaving non-test verification alone.
Run a sample frontend refactor task (no test evidence cited) and confirm the conditional substance check in quality-fixer-frontend does not falsely block it.
Trigger a missing-test-environment condition and confirm the agent returns escalation_type: "test_environment_not_ready" and that the orchestration guide recognizes it.

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add substance requirements so test results cited as evidence reflect real behavior rather than zero-match selectors, skipped tests, TODO/ placeholder bodies, or always-passing assertions. Scope the rule to test evidence so non-test verification (build, typecheck, CLI) and refactor tasks without test evidence are unaffected. Include concrete examples and clarify that intentional absence verification counts as substantive. - task-executor / task-executor-frontend: add runnableCheck.result field spec and an Exit Gate item scoped to test evidence. - quality-fixer / quality-fixer-frontend: add a conditional substance check in Step 3 and an approved criterion that applies only when test runs are cited as evidence. - integration-test-reviewer: add a substantive-assertion criterion to Quality Assessment and a Hollow or Placeholder Assertion entry to Common Issues, limited to text-inspectable patterns. - code-reviewer: extend Test Coverage for Acceptance Criteria to flag tests that grep finds but do not exercise the AC. Bump version to 0.20.1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Follow-ups on top of the acceptance evidence discipline work: Cross-platform parity (backend → frontend task-executor): - Add Test Environment Check with project-configured toolchain detection, scoped to test toolchains the changed behavior depends on. - Add Reference Representativeness with repository-local choice discipline and "feature area" definition (surrounding feature folder, or the nearest parent directory with siblings using the same concern). - Expand Iron Rule from one line back to four bullets with frontend-flavored examples (Props shape, component placement, state location). New typed escalations: - Add Escalation Response 2-4 (Dependency Version Uncertain) to the frontend task-executor so the new Reference Representativeness reference is no longer dangling. - Add Escalation Response 2-7 (Test Environment Not Ready) to both backend and frontend task-executor, and add test_environment_not_ready to the orchestration guide's allowed escalation_type list so the contract is consistent across the workflow. Prompt compression (~280 lines removed across agents/) without losing behavior-binding signals: - task-executor / task-executor-frontend: collapse File Scope Constraint, Similar Function/Component Duplication Check, Specific Utilization, Pre-implementation Verification, and Operation Verification while keeping Mandatory Judgment Criteria, Iron Rule, BLOCKING gates, and all escalation schemas intact. - quality-fixer / quality-fixer-frontend: replace verbose JSON examples with compact one-liners and the Intermediate Progress Report emoji template with a single prose line; restore the blocked response as two separate per-variant JSON blocks with anchor sample values to keep LLM field-filling reliable. - integration-test-reviewer / code-reviewer: tighten JSON examples and drop Special Considerations (low-binding generic guidance). Prompt-quality polish: - Reword negative-form instructions into positive form ("Route any new library/pattern decision through ..." instead of "Do not introduce a new library or pattern ..."). - Frame quality-fixer-frontend's "new library/pattern" rule against the agent's actual blocked output contract rather than a non-existent escalation path. - Disambiguate the Step3 "2 indicators" rule to "Exactly the pair (a+c) or (b+c) → Escalation; any other 2-indicator combination → Continue". - Restore concrete placeholder examples ("500 error" / "Button disabled" etc.) and a 1-line "placeholder-only or empty Investigation Targets do not trigger this gate" note that the compression had dropped. - Restore frontend Test Environment Check to the four-line Before/Check method/Available/Unavailable structure for parity with the backend agent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Sweep the touched agents for behavior-directing negative-form instructions and rewrite them positively while keeping line counts unchanged. - task-executor / task-executor-frontend: restate the Step 2 Completion Gate trigger as a single positive condition ("triggers only when the Investigation Targets section lists at least one concrete file path") in place of the "do not trigger" carve-out. - quality-fixer / quality-fixer-frontend: rename the Step 1 exception list from "NOT considered incomplete (do not flag)" to "Legitimate patterns (treat as complete; proceed to Step 2)" so the same items appear as a positive definition. - quality-fixer-frontend: drop the "instead of adopting it directly" tail on the new-library/pattern rule; the route-through-`blocked` directive already carries the intent. - code-reviewer: replace "Check error responses do not leak internal details" with "Check that error responses redact internal details (stack traces, internal paths, PII)" — positive verb plus concrete examples that also tighten BP-002 specificity. Definitional and contract-shaped negatives are kept intact (e.g., "always-passing assertions ... do not count as substantive evidence", "non-test verification is not subject to this check") because they read as "positive lead + reasoned NG", which is the project's accepted shape. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

shinpr and others added 3 commits May 26, 2026 21:38

chore: ignore local planning docs

71a8138

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

shinpr self-assigned this May 26, 2026

shinpr merged commit f196d37 into main May 26, 2026
1 check passed

shinpr deleted the refine/acceptance-evidence-discipline branch May 26, 2026 14:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tighten acceptance evidence discipline across agents#128

Tighten acceptance evidence discipline across agents#128
shinpr merged 4 commits into
mainfrom
refine/acceptance-evidence-discipline

shinpr commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shinpr commented May 26, 2026

Summary

What changes per agent

task-executor.md / task-executor-frontend.md

quality-fixer.md / quality-fixer-frontend.md

integration-test-reviewer.md

code-reviewer.md

skills/subagents-orchestration-guide/SKILL.md

Line-count impact

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`task-executor.md` / `task-executor-frontend.md`

`quality-fixer.md` / `quality-fixer-frontend.md`

`integration-test-reviewer.md`

`code-reviewer.md`

`skills/subagents-orchestration-guide/SKILL.md`