Tighten acceptance evidence discipline across agents#128
Merged
Conversation
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add substance requirements so test results cited as evidence reflect real behavior rather than zero-match selectors, skipped tests, TODO/ placeholder bodies, or always-passing assertions. Scope the rule to test evidence so non-test verification (build, typecheck, CLI) and refactor tasks without test evidence are unaffected. Include concrete examples and clarify that intentional absence verification counts as substantive. - task-executor / task-executor-frontend: add runnableCheck.result field spec and an Exit Gate item scoped to test evidence. - quality-fixer / quality-fixer-frontend: add a conditional substance check in Step 3 and an approved criterion that applies only when test runs are cited as evidence. - integration-test-reviewer: add a substantive-assertion criterion to Quality Assessment and a Hollow or Placeholder Assertion entry to Common Issues, limited to text-inspectable patterns. - code-reviewer: extend Test Coverage for Acceptance Criteria to flag tests that grep finds but do not exercise the AC. Bump version to 0.20.1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-ups on top of the acceptance evidence discipline work:
Cross-platform parity (backend → frontend task-executor):
- Add Test Environment Check with project-configured toolchain detection,
scoped to test toolchains the changed behavior depends on.
- Add Reference Representativeness with repository-local choice discipline
and "feature area" definition (surrounding feature folder, or the nearest
parent directory with siblings using the same concern).
- Expand Iron Rule from one line back to four bullets with frontend-flavored
examples (Props shape, component placement, state location).
New typed escalations:
- Add Escalation Response 2-4 (Dependency Version Uncertain) to the frontend
task-executor so the new Reference Representativeness reference is no
longer dangling.
- Add Escalation Response 2-7 (Test Environment Not Ready) to both backend
and frontend task-executor, and add test_environment_not_ready to the
orchestration guide's allowed escalation_type list so the contract is
consistent across the workflow.
Prompt compression (~280 lines removed across agents/) without losing
behavior-binding signals:
- task-executor / task-executor-frontend: collapse File Scope Constraint,
Similar Function/Component Duplication Check, Specific Utilization,
Pre-implementation Verification, and Operation Verification while keeping
Mandatory Judgment Criteria, Iron Rule, BLOCKING gates, and all
escalation schemas intact.
- quality-fixer / quality-fixer-frontend: replace verbose JSON examples
with compact one-liners and the Intermediate Progress Report emoji
template with a single prose line; restore the blocked response as two
separate per-variant JSON blocks with anchor sample values to keep LLM
field-filling reliable.
- integration-test-reviewer / code-reviewer: tighten JSON examples and
drop Special Considerations (low-binding generic guidance).
Prompt-quality polish:
- Reword negative-form instructions into positive form
("Route any new library/pattern decision through ..." instead of
"Do not introduce a new library or pattern ...").
- Frame quality-fixer-frontend's "new library/pattern" rule against the
agent's actual blocked output contract rather than a non-existent
escalation path.
- Disambiguate the Step3 "2 indicators" rule to "Exactly the pair (a+c)
or (b+c) → Escalation; any other 2-indicator combination → Continue".
- Restore concrete placeholder examples ("500 error" / "Button disabled"
etc.) and a 1-line "placeholder-only or empty Investigation Targets
do not trigger this gate" note that the compression had dropped.
- Restore frontend Test Environment Check to the four-line
Before/Check method/Available/Unavailable structure for parity with
the backend agent.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sweep the touched agents for behavior-directing negative-form instructions
and rewrite them positively while keeping line counts unchanged.
- task-executor / task-executor-frontend: restate the Step 2 Completion
Gate trigger as a single positive condition ("triggers only when the
Investigation Targets section lists at least one concrete file path")
in place of the "do not trigger" carve-out.
- quality-fixer / quality-fixer-frontend: rename the Step 1 exception list
from "NOT considered incomplete (do not flag)" to "Legitimate patterns
(treat as complete; proceed to Step 2)" so the same items appear as a
positive definition.
- quality-fixer-frontend: drop the "instead of adopting it directly" tail
on the new-library/pattern rule; the route-through-`blocked` directive
already carries the intent.
- code-reviewer: replace "Check error responses do not leak internal
details" with "Check that error responses redact internal details
(stack traces, internal paths, PII)" — positive verb plus concrete
examples that also tighten BP-002 specificity.
Definitional and contract-shaped negatives are kept intact (e.g.,
"always-passing assertions ... do not count as substantive evidence",
"non-test verification is not subject to this check") because they read
as "positive lead + reasoned NG", which is the project's accepted shape.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Dependency Version Uncertain(2-4) andTest Environment Not Ready(2-7) escalation schemas, and registertest_environment_not_readyin the orchestration guide's allowedescalation_typelist so the workflow contract stays consistent.What changes per agent
task-executor.md/task-executor-frontend.mdrunnableCheck.resultfield spec and an Exit Gate item that count only test runs cited as evidence; non-test verification is exempt.Test Environment Check(frontend),Reference Representativeness(frontend), and expand the Iron Rule to four bullets (frontend). Both files now share the same structural sections, while domain vocabulary stays specific to their stack.missingComponent+descriptionfields so the orchestrator can parse the cause.Change Category-style stop conditions to the Iron Rule's escalation paths in positive form: "Route any new library/pattern decision through Escalation Response 2-4" instead of "Do not introduce a new library or pattern".quality-fixer.md/quality-fixer-frontend.mdapprovedcriterion that only apply when a test run is cited as evidence for the task's intended behavior.blockedJSON example into two per-variant blocks (specification conflict / missing prerequisites) so the LLM does not produce hybrid responses, and keep concrete sample values as anchors ("500 error" / "Button disabled" etc.).blockedoutput contract.integration-test-reviewer.mdcode-reviewer.mdskills/subagents-orchestration-guide/SKILL.mdtest_environment_not_readyto the task-executor escalation_type list so the new typed escalation has a documented downstream handler.Line-count impact
agents/task-executor.mdagents/task-executor-frontend.mdagents/quality-fixer.mdagents/quality-fixer-frontend.mdagents/integration-test-reviewer.mdagents/code-reviewer.mdAll target files now sit at or below 450 lines;
task-executorandquality-fixerare both well under the 400-line target.Test plan
node scripts/sync-plugins.mjsregenerates the three plugin directories cleanly (handled by the pre-commit hook).claude plugin validatepasses formarketplace.jsonand each plugin manifest (handled by the pre-commit hook).node scripts/check-skills-index.mjsreports all 11 skills consistent.task-executorand confirm the new Exit Gate item rejects zero-match / skipped / placeholder evidence while leaving non-test verification alone.quality-fixer-frontenddoes not falsely block it.escalation_type: "test_environment_not_ready"and that the orchestration guide recognizes it.🤖 Generated with Claude Code