Skip to content

feat(mentoring): add pr-management-mentor intervention eval suite; mark Mentoring experimental#252

Merged
potiuk merged 2 commits into
apache:mainfrom
justinmclean:mentoring-prototype
May 25, 2026
Merged

feat(mentoring): add pr-management-mentor intervention eval suite; mark Mentoring experimental#252
potiuk merged 2 commits into
apache:mainfrom
justinmclean:mentoring-prototype

Conversation

@justinmclean

Copy link
Copy Markdown
Member

Generated by the spec-driven build loop. This eval suite and the docs
update were produced by an autonomous run of tools/spec-loop (./loop.sh
one work item, one branch, one PR). Authored by Claude (see the Generated-by
commit trailer) and reviewed + tested by a human before submission.

What

Adds the missing intervention eval suite (8 cases) to the existing
pr-management-mentor skill's eval tree, covering the intervention-selection
decision: the out-of-scope and maintainer-engaged checks, the four intervention
templates, the multi-trigger (ask) path, the no-trigger (silent) path, and
the hand-off triggers.

Also syncs docs/modes.md: the Mentoring row moves from proposed / 0 skills
to experimental / 1 skill, and the section points at the shipped skill rather
than a forward reference.

Why

pr-management-mentor shipped without a matching eval suite for its
intervention-selection step, and the framework treats a skill without evals as
incomplete. This back-fills that coverage so the skill's decision logic is
pinned by fixtures.

Changes

  • tools/skill-evals/evals/pr-management-mentor/intervention/ — 8-case eval
    suite (system prompt, user-prompt template, case fixtures).
  • docs/modes.md — Mentoring row + skill table.

Testing — and an issue the loop did not detect

The suite assembles cleanly and, after the fix below, an independent agent run
matches all 8 cases against their ground truth.

On the first test pass, case-4 (why-pushback) failed. Its expected.json
is handoff — which is correct per the skill: the contributor argues after
the agent already answered the "why" once, firing the skill's hand-off trigger
2
("answer the why once, don't argue", defined in hand-off.md). But the
eval's own system-prompt.md never encoded the hand-off triggers, so a model
following the prompt returned draft / 4 instead. The build loop generated an
eval whose ground truth assumed skill logic its own prompt left out — a
self-inconsistency the loop did not catch. Fixed here by adding the four
hand-off triggers (documented 4 → 3 → 1 → 2 order) to system-prompt.md;
re-running the independent check then passed 8/8.

Notes

  • Eval + docs only; no skill behaviour (SKILL.md) is changed.
  • The loop-detection gap is called out deliberately: it's a concrete data point
    that the build loop needs a self-consistency check between an eval's fixtures
    and the prompt it ships.

@justinmclean justinmclean self-assigned this May 24, 2026
@potiuk

potiuk commented May 25, 2026

Copy link
Copy Markdown
Member

Hi @justinmclean — heads-up first: main was just refreshed with pinned-action SHA bumps for actions/cache, github/codeql-action, zizmorcore/zizmor-action, and astral-sh/setup-uv. Those bumps had been blocked by a schema bug in .github/dependabot.yml (the github-actions ecosystem rejected semver-{major,minor,patch}-days cooldown keys), which is now fixed — see #257. A rebase will pull the new SHAs into your branch as a side effect; that's expected.

Now the actual reason for this ping: small conflict in docs/modes.md after the latest main:

  • Triage row count is now 13 and includes contributor-nomination in the experimental list (your branch still has 12 and the older list).
  • Mentoring row: your branch sets it to experimental | 1; main has proposed | 0. Your PR's whole point is to flip Mentoring to experimental with the new eval suite — so keep your value, but please double-check the skill-count math is right after picking up contributor-nomination on the Triage row.

Could you rebase or merge main in and resolve? Happy to push the fix myself if you'd rather not — let me know.

@justinmclean

Copy link
Copy Markdown
Member Author

Yep I can rebase for you

…erimental

Adds the missing `intervention` eval suite (8 cases) to the
`pr-management-mentor` eval tree, covering steps 3–5 of the runtime
loop: out-of-scope check, maintainer-engaged check, and trigger
matching for all four templates plus the multi-trigger and no-trigger
paths.

Updates `docs/modes.md` to reflect the prototype skill that already
shipped: Mentoring row moves from `proposed / 0 skills` to
`experimental / 1 skill`, and the section body is rewritten to point
at the live skill rather than the "lands in a follow-up PR" forward
reference.

Validation:
  test -f docs/mentoring/spec.md                          ✓
  uv run --project tools/skill-validator skill-validate   ✓ (no violations)

Generated-by: Claude (Opus 4.7)
@justinmclean justinmclean force-pushed the mentoring-prototype branch from 79fd114 to 87a43fa Compare May 25, 2026 06:14
@justinmclean justinmclean marked this pull request as ready for review May 25, 2026 06:14
@justinmclean

Copy link
Copy Markdown
Member Author

@potiuk shoudl be good once the CI finishes

@potiuk potiuk merged commit 347239e into apache:main May 25, 2026
13 checks passed
andreahlert added a commit to andreahlert/magpie that referenced this pull request May 27, 2026
…ng row

Address Justin's two open points on `process.md`:

- Flowchart: S12, S13, and S14 are independent terminals from S11,
  not 12→13 / 14→13. Step 13's inputs (planning issue, [VOTE] and
  [RESULT] URLs, voter list, artefact list, promotion revision,
  [ANNOUNCE] URL) come from everything through Step 11; the archive
  sweep (12) and post-release snapshot bump (14) feed nothing into
  the audit log.
- stateDiagram-v2: previously ended `announced → archived → [*]`,
  dropping 13 and 14 entirely. Added parallel branches
  `announced → audited` (Step 13) and `announced → bumped`
  (Step 14), each terminating at `[*]`, matching the flowchart.

Also sync the README.md mentoring row with current `docs/modes.md`
(experimental, 1 skill shipping) instead of the stale "proposed —
not yet formally adopted" wording carried over from before apache#252.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants