Skip to content

Tighten acceptance evidence discipline across agents#128

Merged
shinpr merged 4 commits into
mainfrom
refine/acceptance-evidence-discipline
May 26, 2026
Merged

Tighten acceptance evidence discipline across agents#128
shinpr merged 4 commits into
mainfrom
refine/acceptance-evidence-discipline

Conversation

@shinpr
Copy link
Copy Markdown
Owner

@shinpr shinpr commented May 26, 2026

Summary

  • Introduce acceptance evidence discipline so test results cited as evidence reflect real behavior — not zero-match selectors, skipped tests, TODO/placeholder bodies, or always-passing assertions. The rule is scoped to test evidence so non-test verification (build, typecheck, CLI) and refactor tasks without test evidence are unaffected.
  • Extend the task-executor frontend with sections that previously existed only on the backend variant: Test Environment Check, Reference Representativeness (with a concrete "feature area" definition), and the four-bullet Iron Rule. Add Dependency Version Uncertain (2-4) and Test Environment Not Ready (2-7) escalation schemas, and register test_environment_not_ready in the orchestration guide's allowed escalation_type list so the workflow contract stays consistent.
  • Compress prompt files by ~280 lines without losing behavior-binding signals: collapse repeated preambles, deadwood "Specific Boundary" recaps, illustrative bullet lists, verbose JSON examples, and ENFORCEMENT boilerplate; rewrite negative-form instructions in positive form; restore anchor values and disambiguation phrasing where the first pass over-compressed.

What changes per agent

task-executor.md / task-executor-frontend.md

  • Add runnableCheck.result field spec and an Exit Gate item that count only test runs cited as evidence; non-test verification is exempt.
  • Add Test Environment Check (frontend), Reference Representativeness (frontend), and expand the Iron Rule to four bullets (frontend). Both files now share the same structural sections, while domain vocabulary stays specific to their stack.
  • Add Escalation Response 2-4 Dependency Version Uncertain (frontend) and 2-7 Test Environment Not Ready (both files), with structured missingComponent + description fields so the orchestrator can parse the cause.
  • Compress: collapse File Scope Constraint to direct rules, fold Similar Function/Component Duplication Check indicators into one inline list with explicit "Exactly the pair (a+c) or (b+c) → Escalation" thresholds, drop deadwood preambles and the redundant Execution Principles footer.
  • Add Change Category-style stop conditions to the Iron Rule's escalation paths in positive form: "Route any new library/pattern decision through Escalation Response 2-4" instead of "Do not introduce a new library or pattern".

quality-fixer.md / quality-fixer-frontend.md

  • Add a conditional Substance check in Step 3 and an approved criterion that only apply when a test run is cited as evidence for the task's intended behavior.
  • Split the blocked JSON example into two per-variant blocks (specification conflict / missing prerequisites) so the LLM does not produce hybrid responses, and keep concrete sample values as anchors ("500 error" / "Button disabled" etc.).
  • Frontend: rewrite Frontend-Specific Quality Criteria to be repository-local (prefer dominant local patterns, mock at network/API boundaries, allow browser primitive doubles, exercise the component under test through real renders). The "new library/pattern" rule routes through the agent's actual blocked output contract.
  • Compress: drop low-binding Important Principles, generic Debugging Hints, the Fix Continuation Determination Conditions duplicate of Step 5, and a 17-line mermaid diagram that restated text rules.

integration-test-reviewer.md

  • Add a substantive-assertion criterion to Quality Assessment and a Hollow or Placeholder Assertion entry to Common Issues, limited to text-inspectable patterns.
  • Tighten the JSON schema example while keeping meaningful placeholder text.

code-reviewer.md

  • Extend Test Coverage for Acceptance Criteria to flag tests that grep finds but do not exercise the AC (skipped, TODO-only, or always-true assertions), with explicit examples and an absence-verification clarification.
  • Compress the schema and example sections; drop the generic Special Considerations block.

skills/subagents-orchestration-guide/SKILL.md

  • Add test_environment_not_ready to the task-executor escalation_type list so the new typed escalation has a documented downstream handler.

Line-count impact

File Before After Δ
agents/task-executor.md 444 406 -38
agents/task-executor-frontend.md 395 411 +16
agents/quality-fixer.md 330 230 -100
agents/quality-fixer-frontend.md 440 328 -112
agents/integration-test-reviewer.md 146 137 -9
agents/code-reviewer.md 275 243 -32
agents/ total 2030 1755 -275

All target files now sit at or below 450 lines; task-executor and quality-fixer are both well under the 400-line target.

Test plan

  • node scripts/sync-plugins.mjs regenerates the three plugin directories cleanly (handled by the pre-commit hook).
  • claude plugin validate passes for marketplace.json and each plugin manifest (handled by the pre-commit hook).
  • node scripts/check-skills-index.mjs reports all 11 skills consistent.
  • Run a sample task through task-executor and confirm the new Exit Gate item rejects zero-match / skipped / placeholder evidence while leaving non-test verification alone.
  • Run a sample frontend refactor task (no test evidence cited) and confirm the conditional substance check in quality-fixer-frontend does not falsely block it.
  • Trigger a missing-test-environment condition and confirm the agent returns escalation_type: "test_environment_not_ready" and that the orchestration guide recognizes it.

🤖 Generated with Claude Code

shinpr and others added 3 commits May 26, 2026 21:38
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add substance requirements so test results cited as evidence reflect
real behavior rather than zero-match selectors, skipped tests, TODO/
placeholder bodies, or always-passing assertions. Scope the rule to
test evidence so non-test verification (build, typecheck, CLI) and
refactor tasks without test evidence are unaffected. Include concrete
examples and clarify that intentional absence verification counts as
substantive.

- task-executor / task-executor-frontend: add runnableCheck.result
  field spec and an Exit Gate item scoped to test evidence.
- quality-fixer / quality-fixer-frontend: add a conditional substance
  check in Step 3 and an approved criterion that applies only when
  test runs are cited as evidence.
- integration-test-reviewer: add a substantive-assertion criterion to
  Quality Assessment and a Hollow or Placeholder Assertion entry to
  Common Issues, limited to text-inspectable patterns.
- code-reviewer: extend Test Coverage for Acceptance Criteria to flag
  tests that grep finds but do not exercise the AC.

Bump version to 0.20.1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-ups on top of the acceptance evidence discipline work:

Cross-platform parity (backend → frontend task-executor):
- Add Test Environment Check with project-configured toolchain detection,
  scoped to test toolchains the changed behavior depends on.
- Add Reference Representativeness with repository-local choice discipline
  and "feature area" definition (surrounding feature folder, or the nearest
  parent directory with siblings using the same concern).
- Expand Iron Rule from one line back to four bullets with frontend-flavored
  examples (Props shape, component placement, state location).

New typed escalations:
- Add Escalation Response 2-4 (Dependency Version Uncertain) to the frontend
  task-executor so the new Reference Representativeness reference is no
  longer dangling.
- Add Escalation Response 2-7 (Test Environment Not Ready) to both backend
  and frontend task-executor, and add test_environment_not_ready to the
  orchestration guide's allowed escalation_type list so the contract is
  consistent across the workflow.

Prompt compression (~280 lines removed across agents/) without losing
behavior-binding signals:
- task-executor / task-executor-frontend: collapse File Scope Constraint,
  Similar Function/Component Duplication Check, Specific Utilization,
  Pre-implementation Verification, and Operation Verification while keeping
  Mandatory Judgment Criteria, Iron Rule, BLOCKING gates, and all
  escalation schemas intact.
- quality-fixer / quality-fixer-frontend: replace verbose JSON examples
  with compact one-liners and the Intermediate Progress Report emoji
  template with a single prose line; restore the blocked response as two
  separate per-variant JSON blocks with anchor sample values to keep LLM
  field-filling reliable.
- integration-test-reviewer / code-reviewer: tighten JSON examples and
  drop Special Considerations (low-binding generic guidance).

Prompt-quality polish:
- Reword negative-form instructions into positive form
  ("Route any new library/pattern decision through ..." instead of
  "Do not introduce a new library or pattern ...").
- Frame quality-fixer-frontend's "new library/pattern" rule against the
  agent's actual blocked output contract rather than a non-existent
  escalation path.
- Disambiguate the Step3 "2 indicators" rule to "Exactly the pair (a+c)
  or (b+c) → Escalation; any other 2-indicator combination → Continue".
- Restore concrete placeholder examples ("500 error" / "Button disabled"
  etc.) and a 1-line "placeholder-only or empty Investigation Targets
  do not trigger this gate" note that the compression had dropped.
- Restore frontend Test Environment Check to the four-line
  Before/Check method/Available/Unavailable structure for parity with
  the backend agent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@shinpr shinpr self-assigned this May 26, 2026
Sweep the touched agents for behavior-directing negative-form instructions
and rewrite them positively while keeping line counts unchanged.

- task-executor / task-executor-frontend: restate the Step 2 Completion
  Gate trigger as a single positive condition ("triggers only when the
  Investigation Targets section lists at least one concrete file path")
  in place of the "do not trigger" carve-out.
- quality-fixer / quality-fixer-frontend: rename the Step 1 exception list
  from "NOT considered incomplete (do not flag)" to "Legitimate patterns
  (treat as complete; proceed to Step 2)" so the same items appear as a
  positive definition.
- quality-fixer-frontend: drop the "instead of adopting it directly" tail
  on the new-library/pattern rule; the route-through-`blocked` directive
  already carries the intent.
- code-reviewer: replace "Check error responses do not leak internal
  details" with "Check that error responses redact internal details
  (stack traces, internal paths, PII)" — positive verb plus concrete
  examples that also tighten BP-002 specificity.

Definitional and contract-shaped negatives are kept intact (e.g.,
"always-passing assertions ... do not count as substantive evidence",
"non-test verification is not subject to this check") because they read
as "positive lead + reasoned NG", which is the project's accepted shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@shinpr shinpr merged commit f196d37 into main May 26, 2026
1 check passed
@shinpr shinpr deleted the refine/acceptance-evidence-discipline branch May 26, 2026 14:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant