Skip to content

contributor-activity-sweep skill with eval suite#228

Closed
justinmclean wants to merge 9 commits into
apache:mainfrom
justinmclean:contribitor-activity
Closed

contributor-activity-sweep skill with eval suite#228
justinmclean wants to merge 9 commits into
apache:mainfrom
justinmclean:contribitor-activity

Conversation

@justinmclean

Copy link
Copy Markdown
Member

Adds a new read-only skill that produces a GitHub activity card for a
named contributor on a configured upstream repo.

What it does

Fetches four streams of GitHub activity over a configurable window
(default 6 months): PRs authored, PR reviews given, issues filed, and
PR/issue comment threads. Output is a compact activity card with a
month-by-month timeline — no assessment, no readiness verdict.

Key design decisions

GitHub-only limitation is structural, not a footnote. The warning
appears in the frontmatter description, as an opening blockquote in
the card, and in the footer ("Code is not the only form of
contribution"). Contributors who are central to the mailing list,
documentation, or user support will appear quiet here — the skill says
so explicitly.

Review classification uses inline comments, not just body length.
The substantive threshold is inline_comment_count >= 3 OR body > 50 chars. A body-length-only heuristic undercounts reviewers who work
line-by-line without writing a top-level summary — a common pattern
among experienced reviewers.

Repo age check trims the window. If the repo is newer than the
requested start date, <since> is trimmed to the repo's creation date
so the timeline doesn't render a misleading wall of zero months.

Handoff to contributor-nomination. After rendering, the skill
offers to continue into the full nomination flow without re-fetching
already-collected data.

Injection resistance throughout. Login values are validated against
the GitHub handle regex before any API calls. All query strings are
written to tempfiles rather than interpolated into shell arguments.
External content (PR titles, review bodies) is treated as input data;
imperative instructions found there are flagged and not followed.

Eval suite

12 cases across 3 steps - all pass.

@justinmclean justinmclean self-assigned this May 20, 2026
@choo121600

Copy link
Copy Markdown
Member

I want to share an honest reservation about this skill, while also acknowledging upfront that I’m not sure my concern is necessarily the right conclusion.

I tested the skill on my own GitHub handle (choo121600) as the simplest possible subject. The fetch and render work as intended, the output is well-structured, the disclaimers are present, and the maintainer-time savings this PR aims for do feel real to me. Those positives are clear.

What I’m less sure about is the experience of using it.

The skill makes it very easy to see which areas align with typical PMC expectations and which areas appear lacking. As I was testing it, I found myself naturally thinking, “Should I try to fill in those gaps?”

What felt slightly uncomfortable to me was how quickly and naturally that progression happened. Measurable things became goals much faster than I expected, and it made me wonder whether contributors might gradually optimize for what is visible in the table rather than for the many kinds of valuable but less visible contributions that communities also depend on.

Of course, self-application is not the intended use case here. ( is the expected caller, not .)

But if this becomes one of Magpie’s default support skills, I wonder whether regardless of intent it could become the path of least resistance inside the project, and whether that might gradually shift contributor attention toward the things that are easiest to quantify.

I genuinely don’t know the answer myself, but I think it’s a question worth wrestling with.
Maybe this is something we could discuss together once we have a mailing list in place.
cc @potiuk

@justinmclean

Copy link
Copy Markdown
Member Author

Thanks for the thoughtful feedback.

A few things worth clarifying: this skill is designed to be complementary to the contributor-nomination skill, not a standalone assessment tool. The intent is that a maintainer uses it as one input among several when considering a nomination, not as a scorecard a contributor optimizes toward. Keeping that distinction clear is one of the main guards against the Goodhart's Law dynamic you're describing, where the metric becomes the target.

On the limitation you're pointing to, yes, it will only surface people with visible GitHub activity, and that's a deliberate acknowledgment of what's measurable rather than what's valuable. Other contribution types are genuinely hard to quantify. We could potentially look at mailing list traffic as an additional signal, but that immediately runs into name-mapping problems: email addresses don't reliably map to GitHub handles, and getting that wrong is worse than not having the data at all.

I think we can address these concerns more directly in the skill output by strengthening the framing around scope. A committer is not expected to be active in all areas, and most contributors naturally focus on one or two streams rather than all. Someone who does deep, careful PR review but rarely files issues isn't a weaker candidate, they're a different kind of contributor. Making that explicit in the output should help resist the pull toward treating visible gaps as deficits to fill.

Happy to add that language before this comes out of draft.

@potiuk

potiuk commented May 24, 2026

Copy link
Copy Markdown
Member

Yeah. I very much see @choo121600 "measure becomes a goal" - and framing the output as "guide" is super important.

Few things that could make the skill also way more useful - taking into account that we have LLM / Models to do a lot more

  • Explicit searching for and looking for contributor's guidelines in the project. This will make it less "generic" and more "project autonomously defined" - and linking particular activities to those criteria defined by the project will make those results less "dry/external" and more "ours". While we should aim the criteria being more "objective" - they should always be filtered with "subjective" PMC criteria - which will be different project-by-project

  • I think that activity would be *much more useful if t summarized activitiy of people from various sources -> not only GH - which I see as potentially biggest value here. This is what we usually use when we are looking for contributor's involvment - and we could include (of course) devlist discussions, users discussion but also potentiallly slack/discord/other community activitty sources (different in each projects) + web search for actitivity connected to the project.

  • I honestly do not think fixed thresholds is the best idea. I would say way better would be comparative assesment of activity - especially regarding the current active committers, PMC members but also comparing to other contributors who could be candidates. This will - almost automatically - adjust to the state of the project, which "lifecycle" part the project is in and even things like focusing on upcoming release. LLMs are good in analysing trends and comparing things without giving the specific thresholds - and I think - contrary to typical deterministic approaches - the more we depart from deterministic assesment and numbers, and use the LLM power to make some "relative comparisions" - the better we use the Agentic/LLM / non-deterministic power -> and to be honest, this is exactly what I think in case of new committers / PMC members - it's not the sheer number of interactions - it's also quality of those.

  • which leads me to next point - I think part of such assessment should be assesment of "tone", "way of communicating", "cooperativeness" but also "impact of the changes" and similar assessments - I think LLMs are very good in assessing those.

@justinmclean

justinmclean commented May 24, 2026

Copy link
Copy Markdown
Member Author

Jarek, I suggest you look at #227. This skill is meant to be used in combination with that - this skill has documented limitations.

@justinmclean

Copy link
Copy Markdown
Member Author

Pre-flight self-review — PR #228 (contributor-activity-sweep)

#228 · draft · author:
justinmclean

Base: main · Files changed: 37 (~all added) · Diff size: +756 / −0

A new Pairing/Triage-flavoured read-only activity card with a 12-case eval
suite (step-0 input validation × 4, step-1 review classification × 5, step-2
render × 3), plus a 9-line tweak to contributor-nomination/render.md and a
2-line tweak to tools/skill-evals/README.md.

Correctness

  • [advisory] SKILL.md heading vs eval directory naming. The eval directory is
    step-1-classify-reviews, but its step-config.json extracts SKILL.md's ## Step
    1 — Fetch GitHub activity section. Not a bug — that section does contain the
    substantive/LGTM classification rules (comments.totalCount >= 3 …), so the
    eval reads the right text — but the names disagree at a glance and a future
    reader will trip on it. Consider renaming the heading to ## Step 1 — Fetch and
    classify activity, or split into a Step 1 (Fetch) + Step 2 (Classify) so
    directory names match headings.

(Verified clean: eval output-spec keys match expected.json keys exactly across
all three suites — no schema drift like the one I caught in
pairing-self-review. The classify-stream's keys are injection_attempt_detected
/ lgtm_only_reviews / substantive_reviews / total_inline_comments /
total_reviews, consistent across all 5 cases.)

Security

No findings. Strong injection-guard callout in the SKILL.md (2 matches).
Comprehensive adversarial coverage in the eval suite:

  • step-0 case-2-unsafe-handle — path-traversal rejection on the GitHub login.
  • step-0 case-4-shell-metachar — shell-metacharacter rejection.
  • step-1 case-5-injection-in-body — embedded SYSTEM: instruction in a review
    body must be flagged and not obeyed; classification driven by actual content,
    not the inflated payload.
  • step-2 case-2-injection-flagged — AGENT OVERRIDE in a PR title must be
    flagged with no verdict language emitted.

That's a solid four security-adversarial cases for a 12-case suite. Read-only
skill, no GitHub mutations.

Conventions

  • [advisory] skill-validate --strict flags one SOFT-category violation on this
    skill:
    ▎ .../contributor-activity-sweep/SKILL.md:1: action-inventory in description
    (5 commas) — consider moving the enum to body: 'Output is intentionally
    limited to GitHub-visible activity — mailing list, docum…'
    ▎ The frontmatter description enumerates off-GitHub channels ("mailing list,
    documentation, user support, mentoring, talks, and release management"). Under
    non-strict this is a warning, not a failure. Move that enum to the body to
    keep the matching-layer description tight. (Same finding shape as the
    pre-existing one on security-tracker-stats-dashboard.)

(Verified clean: SPDX header present; ## Step headings well-formed; eval
step-config.json files point at real SKILL.md sections; full eval suite ships
per the AGENTS.md rule. The unrelated 2nd validator violation is
security-tracker-stats-dashboard, pre-existing.)

Summary

Ready to push with one small tightening — no blocking findings. The --strict
description-comma warning is the kind of thing the maintainer review will
mention anyway; rename consideration on the Step 1 heading is judgment.

Blocking: 0 Advisory: 2

Self-review fixes for the two advisory findings:

- Trim the frontmatter description: drop the duplicate enum of off-GitHub
  channels (it is already documented prominently in the body callout).
  Comma count drops from 5+ to 3, clearing the skill-validator --strict
  action-inventory soft warning. The matching-layer wording is otherwise
  unchanged.
- Rename "## Step 1 — Fetch GitHub activity" to "## Step 1 — Fetch and
  classify activity" so the heading matches the eval dir
  (step-1-classify-reviews) and reflects what Step 1 actually does
  (fetch + substantive/LGTM classification). Update the matching
  step-config.json step_heading accordingly; the eval renderer
  continues to extract the section correctly.

Generated-by: Claude Code (Opus 4.7)
@andreahlert andreahlert added enhancement New feature or request family:tools tools/* labels May 26, 2026
@potiuk

potiuk commented May 27, 2026

Copy link
Copy Markdown
Member

Hi @justinmclean — was sweeping the PR queue and tried a maintainer-side rebase, but the branch is showing a 343-file / ~12k-line-deletion diff against current main — looks like the contributor-activity-sweep skill itself already landed on main via another path, and the rest of the branch is significantly behind.

Could you take a look when you have a moment? I suspect the right call is either a fresh rebase to surface what's still unique here, or — if everything has landed via the other route — close this one out. Happy to defer to your call.

Same shape on #229 and #269 — both stale enough that a clean maintainer-side rebase wasn't safe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request family:tools tools/*

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants