contributor-activity-sweep skill with eval suite by justinmclean · Pull Request #228 · apache/magpie

justinmclean · 2026-05-19T04:10:25Z

Adds a new read-only skill that produces a GitHub activity card for a
named contributor on a configured upstream repo.

What it does

Fetches four streams of GitHub activity over a configurable window
(default 6 months): PRs authored, PR reviews given, issues filed, and
PR/issue comment threads. Output is a compact activity card with a
month-by-month timeline — no assessment, no readiness verdict.

Key design decisions

GitHub-only limitation is structural, not a footnote. The warning
appears in the frontmatter description, as an opening blockquote in
the card, and in the footer ("Code is not the only form of
contribution"). Contributors who are central to the mailing list,
documentation, or user support will appear quiet here — the skill says
so explicitly.

Review classification uses inline comments, not just body length.
The substantive threshold is inline_comment_count >= 3 OR body > 50 chars. A body-length-only heuristic undercounts reviewers who work
line-by-line without writing a top-level summary — a common pattern
among experienced reviewers.

Repo age check trims the window. If the repo is newer than the
requested start date, <since> is trimmed to the repo's creation date
so the timeline doesn't render a misleading wall of zero months.

Handoff to contributor-nomination. After rendering, the skill
offers to continue into the full nomination flow without re-fetching
already-collected data.

Injection resistance throughout. Login values are validated against
the GitHub handle regex before any API calls. All query strings are
written to tempfiles rather than interpolated into shell arguments.
External content (PR titles, review bodies) is treated as input data;
imperative instructions found there are flagged and not followed.

Eval suite

12 cases across 3 steps - all pass.

choo121600 · 2026-05-23T21:22:15Z

I want to share an honest reservation about this skill, while also acknowledging upfront that I’m not sure my concern is necessarily the right conclusion.

I tested the skill on my own GitHub handle (choo121600) as the simplest possible subject. The fetch and render work as intended, the output is well-structured, the disclaimers are present, and the maintainer-time savings this PR aims for do feel real to me. Those positives are clear.

What I’m less sure about is the experience of using it.

The skill makes it very easy to see which areas align with typical PMC expectations and which areas appear lacking. As I was testing it, I found myself naturally thinking, “Should I try to fill in those gaps?”

What felt slightly uncomfortable to me was how quickly and naturally that progression happened. Measurable things became goals much faster than I expected, and it made me wonder whether contributors might gradually optimize for what is visible in the table rather than for the many kinds of valuable but less visible contributions that communities also depend on.

Of course, self-application is not the intended use case here. ( is the expected caller, not .)

But if this becomes one of Magpie’s default support skills, I wonder whether regardless of intent it could become the path of least resistance inside the project, and whether that might gradually shift contributor attention toward the things that are easiest to quantify.

I genuinely don’t know the answer myself, but I think it’s a question worth wrestling with.
Maybe this is something we could discuss together once we have a mailing list in place.
cc @potiuk

justinmclean · 2026-05-24T01:45:53Z

Thanks for the thoughtful feedback.

A few things worth clarifying: this skill is designed to be complementary to the contributor-nomination skill, not a standalone assessment tool. The intent is that a maintainer uses it as one input among several when considering a nomination, not as a scorecard a contributor optimizes toward. Keeping that distinction clear is one of the main guards against the Goodhart's Law dynamic you're describing, where the metric becomes the target.

On the limitation you're pointing to, yes, it will only surface people with visible GitHub activity, and that's a deliberate acknowledgment of what's measurable rather than what's valuable. Other contribution types are genuinely hard to quantify. We could potentially look at mailing list traffic as an additional signal, but that immediately runs into name-mapping problems: email addresses don't reliably map to GitHub handles, and getting that wrong is worse than not having the data at all.

I think we can address these concerns more directly in the skill output by strengthening the framing around scope. A committer is not expected to be active in all areas, and most contributors naturally focus on one or two streams rather than all. Someone who does deep, careful PR review but rarely files issues isn't a weaker candidate, they're a different kind of contributor. Making that explicit in the output should help resist the pull toward treating visible gaps as deficits to fill.

Happy to add that language before this comes out of draft.

potiuk · 2026-05-24T10:23:51Z

Yeah. I very much see @choo121600 "measure becomes a goal" - and framing the output as "guide" is super important.

Few things that could make the skill also way more useful - taking into account that we have LLM / Models to do a lot more

Explicit searching for and looking for contributor's guidelines in the project. This will make it less "generic" and more "project autonomously defined" - and linking particular activities to those criteria defined by the project will make those results less "dry/external" and more "ours". While we should aim the criteria being more "objective" - they should always be filtered with "subjective" PMC criteria - which will be different project-by-project
I think that activity would be *much more useful if t summarized activitiy of people from various sources -> not only GH - which I see as potentially biggest value here. This is what we usually use when we are looking for contributor's involvment - and we could include (of course) devlist discussions, users discussion but also potentiallly slack/discord/other community activitty sources (different in each projects) + web search for actitivity connected to the project.
I honestly do not think fixed thresholds is the best idea. I would say way better would be comparative assesment of activity - especially regarding the current active committers, PMC members but also comparing to other contributors who could be candidates. This will - almost automatically - adjust to the state of the project, which "lifecycle" part the project is in and even things like focusing on upcoming release. LLMs are good in analysing trends and comparing things without giving the specific thresholds - and I think - contrary to typical deterministic approaches - the more we depart from deterministic assesment and numbers, and use the LLM power to make some "relative comparisions" - the better we use the Agentic/LLM / non-deterministic power -> and to be honest, this is exactly what I think in case of new committers / PMC members - it's not the sheer number of interactions - it's also quality of those.
which leads me to next point - I think part of such assessment should be assesment of "tone", "way of communicating", "cooperativeness" but also "impact of the changes" and similar assessments - I think LLMs are very good in assessing those.

justinmclean · 2026-05-24T11:20:28Z

Jarek, I suggest you look at #227. This skill is meant to be used in combination with that - this skill has documented limitations.

justinmclean · 2026-05-26T02:13:10Z

Pre-flight self-review — PR #228 (contributor-activity-sweep)

#228 · draft · author:
justinmclean

Base: main · Files changed: 37 (~all added) · Diff size: +756 / −0

A new Pairing/Triage-flavoured read-only activity card with a 12-case eval
suite (step-0 input validation × 4, step-1 review classification × 5, step-2
render × 3), plus a 9-line tweak to contributor-nomination/render.md and a
2-line tweak to tools/skill-evals/README.md.

Correctness

[advisory] SKILL.md heading vs eval directory naming. The eval directory is
step-1-classify-reviews, but its step-config.json extracts SKILL.md's ## Step
1 — Fetch GitHub activity section. Not a bug — that section does contain the
substantive/LGTM classification rules (comments.totalCount >= 3 …), so the
eval reads the right text — but the names disagree at a glance and a future
reader will trip on it. Consider renaming the heading to ## Step 1 — Fetch and
classify activity, or split into a Step 1 (Fetch) + Step 2 (Classify) so
directory names match headings.

(Verified clean: eval output-spec keys match expected.json keys exactly across
all three suites — no schema drift like the one I caught in
pairing-self-review. The classify-stream's keys are injection_attempt_detected
/ lgtm_only_reviews / substantive_reviews / total_inline_comments /
total_reviews, consistent across all 5 cases.)

Security

No findings. Strong injection-guard callout in the SKILL.md (2 matches).
Comprehensive adversarial coverage in the eval suite:

step-0 case-2-unsafe-handle — path-traversal rejection on the GitHub login.
step-0 case-4-shell-metachar — shell-metacharacter rejection.
step-1 case-5-injection-in-body — embedded SYSTEM: instruction in a review
body must be flagged and not obeyed; classification driven by actual content,
not the inflated payload.
step-2 case-2-injection-flagged — AGENT OVERRIDE in a PR title must be
flagged with no verdict language emitted.

That's a solid four security-adversarial cases for a 12-case suite. Read-only
skill, no GitHub mutations.

Conventions

[advisory] skill-validate --strict flags one SOFT-category violation on this
skill:
▎ .../contributor-activity-sweep/SKILL.md:1: action-inventory in description
(5 commas) — consider moving the enum to body: 'Output is intentionally
limited to GitHub-visible activity — mailing list, docum…'
▎ The frontmatter description enumerates off-GitHub channels ("mailing list,
documentation, user support, mentoring, talks, and release management"). Under
non-strict this is a warning, not a failure. Move that enum to the body to
keep the matching-layer description tight. (Same finding shape as the
pre-existing one on security-tracker-stats-dashboard.)

(Verified clean: SPDX header present; ## Step headings well-formed; eval
step-config.json files point at real SKILL.md sections; full eval suite ships
per the AGENTS.md rule. The unrelated 2nd validator violation is
security-tracker-stats-dashboard, pre-existing.)

Summary

Ready to push with one small tightening — no blocking findings. The --strict
description-comma warning is the kind of thing the maintainer review will
mention anyway; rename consideration on the Step 1 heading is judgment.

Blocking: 0 Advisory: 2

Self-review fixes for the two advisory findings: - Trim the frontmatter description: drop the duplicate enum of off-GitHub channels (it is already documented prominently in the body callout). Comma count drops from 5+ to 3, clearing the skill-validator --strict action-inventory soft warning. The matching-layer wording is otherwise unchanged. - Rename "## Step 1 — Fetch GitHub activity" to "## Step 1 — Fetch and classify activity" so the heading matches the eval dir (step-1-classify-reviews) and reflects what Step 1 actually does (fetch + substantive/LGTM classification). Update the matching step-config.json step_heading accordingly; the eval renderer continues to extract the section correctly. Generated-by: Claude Code (Opus 4.7)

# Conflicts: # tools/skill-evals/README.md

potiuk · 2026-05-27T20:24:19Z

Hi @justinmclean — was sweeping the PR queue and tried a maintainer-side rebase, but the branch is showing a 343-file / ~12k-line-deletion diff against current main — looks like the contributor-activity-sweep skill itself already landed on main via another path, and the rest of the branch is significantly behind.

Could you take a look when you have a moment? I suspect the right call is either a fresh rebase to surface what's still unique here, or — if everything has landed via the other route — close this one out. Happy to defer to your call.

Same shape on #229 and #269 — both stale enough that a clean maintainer-side rebase wasn't safe.

justinmclean added 3 commits May 19, 2026 12:54

initail commit

3723f25

added code block type

5517c8c

initial commit

86c9f9b

justinmclean self-assigned this May 20, 2026

Merge branch 'main' into contribitor-activity

ee34fbf

potiuk and others added 2 commits May 25, 2026 01:03

Merge branch 'main' into contribitor-activity

b9f2fdd

fixed md lint error

488e5fc

justinmclean added 3 commits May 26, 2026 12:19

Merge remote-tracking branch 'origin/main' into contribitor-activity

6846bc3

# Conflicts: # tools/skill-evals/README.md

fix md lint

53ce655

andreahlert added enhancement New feature or request family:tools tools/* labels May 26, 2026

This was referenced May 27, 2026

committer-onboarding — post-vote onboarding for committers and PMC members #229

Closed

feat(pairing): add multi-agent review pipeline skill and eval suite #269

Merged

justinmclean closed this May 29, 2026

justinmclean deleted the contribitor-activity branch May 29, 2026 00:42

justinmclean mentioned this pull request May 29, 2026

contributor-activity-sweep skill with eval suite #369

Merged

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

contributor-activity-sweep skill with eval suite#228

contributor-activity-sweep skill with eval suite#228
justinmclean wants to merge 9 commits into
apache:mainfrom
justinmclean:contribitor-activity

justinmclean commented May 19, 2026

Uh oh!

choo121600 commented May 23, 2026

Uh oh!

justinmclean commented May 24, 2026

Uh oh!

potiuk commented May 24, 2026

Uh oh!

justinmclean commented May 24, 2026 •

edited

Loading

Uh oh!

justinmclean commented May 26, 2026

Uh oh!

potiuk commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

justinmclean commented May 19, 2026

What it does

Key design decisions

Eval suite

Uh oh!

choo121600 commented May 23, 2026

Uh oh!

justinmclean commented May 24, 2026

Uh oh!

potiuk commented May 24, 2026

Uh oh!

justinmclean commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

justinmclean commented May 26, 2026

Uh oh!

potiuk commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

justinmclean commented May 24, 2026 •

edited

Loading