Skip to content

feat(validator): add SOFT eval-coverage check (check #8)#481

Merged
potiuk merged 1 commit into
apache:mainfrom
justinmclean:check-eval-coverage
Jun 11, 2026
Merged

feat(validator): add SOFT eval-coverage check (check #8)#481
potiuk merged 1 commit into
apache:mainfrom
justinmclean:check-eval-coverage

Conversation

@justinmclean

Copy link
Copy Markdown
Member

Summary

Every skill under skills/ must ship a matching behavioural eval suite under tools/skill-evals/evals//. The new validate_eval_coverage function flags missing suites as SOFT advisory violations so that in-flight eval PRs do not fail the gate while their branches are pending review.

Against the live repo the check correctly flags the two skills that currently have in-flight eval branches (pr-management-quick-merge and setup-status) and is silent on all others. 8 new test cases cover the happy path, the missing-eval path, missing-both-dirs paths, the soft-category membership, and the non-directory skip.

Addresses the Known Gap in specs/meta-and-quality-tooling.md: "Eval coverage is incomplete — skills added before the per-skill-eval convention have no suite." The check prevents future regressions.

Type of change

  • Skill change (.claude/skills/<name>/) — eval fixtures updated below
  • Tool / bridge contract (tools/<system>/*.md)
  • Python package (tools/*/ with pyproject.toml)
  • Groovy reference impl
  • Cross-cutting (RFC, AGENTS.md, sandbox, privacy-LLM)
  • Documentation (docs/, README.md, CONTRIBUTING.md)
  • Project template (projects/_template/)
  • CI / dev loop (prek, workflows, validators)
  • Other:

Test plan

  • prek run --all-files passes
  • For Python packages touched: uv run pytest / ruff check / mypy passes
  • For Groovy bridges touched: command-line invocation tested end-to-end
  • For skill changes: eval suite passes for the affected skill
    (PYTHONPATH=tools/skill-evals/src python3 -m skill_evals.runner tools/skill-evals/evals/<skill>/)
  • For skill behaviour changes: a new or updated eval fixture is included in this PR
    (a regression test for the bug fixed / the behaviour added — see CONTRIBUTING.md)
  • Other:

RFC-AI-0004 compliance

  • HITL — any new mutation is gated on explicit user confirmation
  • Sandbox — no new unrestricted host access; network reach declared in the adapter
  • Vendor neutrality — placeholders (<PROJECT>, <tracker>, <upstream>, <security-list>) used in all skill / tool prose (the check-placeholders prek hook is the mechanical gate)
  • Conversational + correctable — agentic-override path documented if behaviour is adopter-tunable
  • Write-access discipline — no autonomous outbound messages; drafts only, sent on confirmation
  • Privacy LLM — private content does not reach a non-approved LLM; redactor invoked where needed

Linked issues

Notes for reviewers (optional)

@justinmclean justinmclean self-assigned this Jun 10, 2026

@potiuk potiuk left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool :)

Every skill under skills/ must ship a matching behavioural eval suite
under tools/skill-evals/evals/<slug>/.  The new validate_eval_coverage
function surfaces missing suites as SOFT advisory violations so that
in-flight eval PRs do not fail the gate while their branches are pending
review.

Against the live repo the check correctly flags the two skills that
currently have in-flight eval branches (pr-management-quick-merge and
setup-status) and is silent on all others.  8 new test cases cover the
happy path, the missing-eval path, missing-both-dirs paths, the
soft-category membership, and the non-directory skip.

Addresses the Known Gap in specs/meta-and-quality-tooling.md:
"Eval coverage is incomplete — skills added before the per-skill-eval
convention have no suite."  The check prevents future regressions.

Generated-by: Claude (Opus 4.7)
@justinmclean

Copy link
Copy Markdown
Member Author

Correctness

[advisory] Check-number inconsistency: docstring:47 + test say "#8"; section comment:1660 says "check #9". Reconcile (see cross-PR note).
[advisory] validate_eval_coverage — evals_base.iterdir() / skills_base.iterdir() are unguarded against OSError/PermissionError; a restrictive CI runner would get an unhandled exception instead of a graceful violation. Wrap in try/except like collect_tool_python_files does.

Security / Conventions

No findings

@potiuk

potiuk commented Jun 11, 2026

Copy link
Copy Markdown
Member

Rebasing it :)

@potiuk

potiuk commented Jun 11, 2026

Copy link
Copy Markdown
Member

The cool thing with agent is that they even replace "eight" with "nine" while resolving this conflict :)

@potiuk potiuk force-pushed the check-eval-coverage branch from 3af6556 to b571a3a Compare June 11, 2026 08:44
@potiuk potiuk merged commit 6866070 into apache:main Jun 11, 2026
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants