feat(mentoring): add pr-management-mentor intervention eval suite; mark Mentoring experimental#252
Merged
Merged
Conversation
Member
|
Hi @justinmclean — heads-up first: Now the actual reason for this ping: small conflict in
Could you rebase or merge |
Member
Author
|
Yep I can rebase for you |
…erimental Adds the missing `intervention` eval suite (8 cases) to the `pr-management-mentor` eval tree, covering steps 3–5 of the runtime loop: out-of-scope check, maintainer-engaged check, and trigger matching for all four templates plus the multi-trigger and no-trigger paths. Updates `docs/modes.md` to reflect the prototype skill that already shipped: Mentoring row moves from `proposed / 0 skills` to `experimental / 1 skill`, and the section body is rewritten to point at the live skill rather than the "lands in a follow-up PR" forward reference. Validation: test -f docs/mentoring/spec.md ✓ uv run --project tools/skill-validator skill-validate ✓ (no violations) Generated-by: Claude (Opus 4.7)
79fd114 to
87a43fa
Compare
Member
Author
|
@potiuk shoudl be good once the CI finishes |
potiuk
approved these changes
May 25, 2026
andreahlert
added a commit
to andreahlert/magpie
that referenced
this pull request
May 27, 2026
…ng row Address Justin's two open points on `process.md`: - Flowchart: S12, S13, and S14 are independent terminals from S11, not 12→13 / 14→13. Step 13's inputs (planning issue, [VOTE] and [RESULT] URLs, voter list, artefact list, promotion revision, [ANNOUNCE] URL) come from everything through Step 11; the archive sweep (12) and post-release snapshot bump (14) feed nothing into the audit log. - stateDiagram-v2: previously ended `announced → archived → [*]`, dropping 13 and 14 entirely. Added parallel branches `announced → audited` (Step 13) and `announced → bumped` (Step 14), each terminating at `[*]`, matching the flowchart. Also sync the README.md mentoring row with current `docs/modes.md` (experimental, 1 skill shipping) instead of the stale "proposed — not yet formally adopted" wording carried over from before apache#252.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds the missing
interventioneval suite (8 cases) to the existingpr-management-mentorskill's eval tree, covering the intervention-selectiondecision: the out-of-scope and maintainer-engaged checks, the four intervention
templates, the multi-trigger (
ask) path, the no-trigger (silent) path, andthe hand-off triggers.
Also syncs
docs/modes.md: the Mentoring row moves fromproposed / 0 skillsto
experimental / 1 skill, and the section points at the shipped skill ratherthan a forward reference.
Why
pr-management-mentorshipped without a matching eval suite for itsintervention-selection step, and the framework treats a skill without evals as
incomplete. This back-fills that coverage so the skill's decision logic is
pinned by fixtures.
Changes
tools/skill-evals/evals/pr-management-mentor/intervention/— 8-case evalsuite (system prompt, user-prompt template, case fixtures).
docs/modes.md— Mentoring row + skill table.Testing — and an issue the loop did not detect
The suite assembles cleanly and, after the fix below, an independent agent run
matches all 8 cases against their ground truth.
On the first test pass, case-4 (
why-pushback) failed. Itsexpected.jsonis
handoff— which is correct per the skill: the contributor argues afterthe agent already answered the "why" once, firing the skill's hand-off trigger
2 ("answer the why once, don't argue", defined in
hand-off.md). But theeval's own
system-prompt.mdnever encoded the hand-off triggers, so a modelfollowing the prompt returned
draft / 4instead. The build loop generated aneval whose ground truth assumed skill logic its own prompt left out — a
self-inconsistency the loop did not catch. Fixed here by adding the four
hand-off triggers (documented
4 → 3 → 1 → 2order) tosystem-prompt.md;re-running the independent check then passed 8/8.
Notes
SKILL.md) is changed.that the build loop needs a self-consistency check between an eval's fixtures
and the prompt it ships.