feat(evals): add eval suite for setup-override-upstream skill#334
Conversation
15 cases across 4 steps covering the override-upstreaming workflow: pre-flight repo/drift checks, override selection dispatch, upstreamability classification, and PR confirmation gating. Generated-by: Claude (Opus 4.7)
de495d4 to
2015b29
Compare
potiuk
left a comment
There was a problem hiding this comment.
LGTM — 15 cases across the four highest-signal decision points in
setup-override-upstream (pre-flight repo/drift, override pick, upstreamability
classification, PR confirm). All four step-configs resolve to real SKILL.md
headings. The two prompt-injection cases (override-pick and upstreamability)
correctly preserve real-state output rather than the injected answer. Runner
walks all 15 cases without parse errors.
This review was drafted by an AI-assisted tool and confirmed by an Apache Steward
maintainer. The maintainer approving this PR has read the findings and signed off.
If something feels off, please reply on the PR and a maintainer will follow up.More on how Apache Steward handles maintainer review:
CONTRIBUTING.md.
Summary
15 cases across 4 steps covering the override-upstreaming workflow: pre-flight repo/drift checks, override selection dispatch, upstreamability classification, and PR confirmation gating.
Generated-by: Claude (Opus 4.7)
Type of change
.claude/skills/<name>/) — eval fixtures updated belowtools/<system>/*.md)tools/*/withpyproject.toml)docs/,README.md,CONTRIBUTING.md)projects/_template/)prek, workflows, validators)Test plan
prek run --all-filespassesuv run pytest/ruff check/mypypasses(
PYTHONPATH=tools/skill-evals/src python3 -m skill_evals.runner tools/skill-evals/evals/<skill>/)(a regression test for the bug fixed / the behaviour added — see CONTRIBUTING.md)