feat(validator): add SOFT eval-coverage check (check #8) by justinmclean · Pull Request #481 · apache/magpie

justinmclean · 2026-06-10T06:53:55Z

Summary

Every skill under skills/ must ship a matching behavioural eval suite under tools/skill-evals/evals//. The new validate_eval_coverage function flags missing suites as SOFT advisory violations so that in-flight eval PRs do not fail the gate while their branches are pending review.

Against the live repo the check correctly flags the two skills that currently have in-flight eval branches (pr-management-quick-merge and setup-status) and is silent on all others. 8 new test cases cover the happy path, the missing-eval path, missing-both-dirs paths, the soft-category membership, and the non-directory skip.

Addresses the Known Gap in specs/meta-and-quality-tooling.md: "Eval coverage is incomplete — skills added before the per-skill-eval convention have no suite." The check prevents future regressions.

Type of change

Skill change (.claude/skills/<name>/) — eval fixtures updated below
Tool / bridge contract (tools/<system>/*.md)
Python package (tools/*/ with pyproject.toml)
Groovy reference impl
Cross-cutting (RFC, AGENTS.md, sandbox, privacy-LLM)
Documentation (docs/, README.md, CONTRIBUTING.md)
Project template (projects/_template/)
CI / dev loop (prek, workflows, validators)
Other:

Test plan

prek run --all-files passes
For Python packages touched: uv run pytest / ruff check / mypy passes
For Groovy bridges touched: command-line invocation tested end-to-end
For skill changes: eval suite passes for the affected skill
(PYTHONPATH=tools/skill-evals/src python3 -m skill_evals.runner tools/skill-evals/evals/<skill>/)
For skill behaviour changes: a new or updated eval fixture is included in this PR
(a regression test for the bug fixed / the behaviour added — see CONTRIBUTING.md)
Other:

RFC-AI-0004 compliance

HITL — any new mutation is gated on explicit user confirmation
Sandbox — no new unrestricted host access; network reach declared in the adapter
Vendor neutrality — placeholders (<PROJECT>, <tracker>, <upstream>, <security-list>) used in all skill / tool prose (the check-placeholders prek hook is the mechanical gate)
Conversational + correctable — agentic-override path documented if behaviour is adopter-tunable
Write-access discipline — no autonomous outbound messages; drafts only, sent on confirmation
Privacy LLM — private content does not reach a non-approved LLM; redactor invoked where needed

Linked issues

Notes for reviewers (optional)

potiuk

cool :)

Every skill under skills/ must ship a matching behavioural eval suite under tools/skill-evals/evals/<slug>/. The new validate_eval_coverage function surfaces missing suites as SOFT advisory violations so that in-flight eval PRs do not fail the gate while their branches are pending review. Against the live repo the check correctly flags the two skills that currently have in-flight eval branches (pr-management-quick-merge and setup-status) and is silent on all others. 8 new test cases cover the happy path, the missing-eval path, missing-both-dirs paths, the soft-category membership, and the non-directory skip. Addresses the Known Gap in specs/meta-and-quality-tooling.md: "Eval coverage is incomplete — skills added before the per-skill-eval convention have no suite." The check prevents future regressions. Generated-by: Claude (Opus 4.7)

justinmclean · 2026-06-11T08:35:57Z

Correctness

[advisory] Check-number inconsistency: docstring:47 + test say "#8"; section comment:1660 says "check #9". Reconcile (see cross-PR note).
[advisory] validate_eval_coverage — evals_base.iterdir() / skills_base.iterdir() are unguarded against OSError/PermissionError; a restrictive CI runner would get an unhandled exception instead of a graceful violation. Wrap in try/except like collect_tool_python_files does.

Security / Conventions

No findings

potiuk · 2026-06-11T08:42:30Z

Rebasing it :)

potiuk · 2026-06-11T08:43:17Z

The cool thing with agent is that they even replace "eight" with "nine" while resolving this conflict :)

justinmclean self-assigned this Jun 10, 2026

potiuk approved these changes Jun 11, 2026

View reviewed changes

justinmclean mentioned this pull request Jun 11, 2026

feat(validator): enforce license headers on tool Python files (check #8) #474

Merged

15 tasks

potiuk force-pushed the check-eval-coverage branch from 3af6556 to b571a3a Compare June 11, 2026 08:44

potiuk merged commit 6866070 into apache:main Jun 11, 2026
26 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(validator): add SOFT eval-coverage check (check #8)#481

feat(validator): add SOFT eval-coverage check (check #8)#481
potiuk merged 1 commit into
apache:mainfrom
justinmclean:check-eval-coverage

justinmclean commented Jun 10, 2026

Uh oh!

potiuk left a comment

Uh oh!

justinmclean commented Jun 11, 2026

Uh oh!

potiuk commented Jun 11, 2026

Uh oh!

potiuk commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

justinmclean commented Jun 10, 2026

Summary

Type of change

Test plan

RFC-AI-0004 compliance

Linked issues

Notes for reviewers (optional)

Uh oh!

potiuk left a comment

Choose a reason for hiding this comment

Uh oh!

justinmclean commented Jun 11, 2026

Correctness

Security / Conventions

Uh oh!

potiuk commented Jun 11, 2026

Uh oh!

potiuk commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants