contributor-activity-sweep skill with eval suite#228
Conversation
|
I want to share an honest reservation about this skill, while also acknowledging upfront that I’m not sure my concern is necessarily the right conclusion. I tested the skill on my own GitHub handle (choo121600) as the simplest possible subject. The fetch and render work as intended, the output is well-structured, the disclaimers are present, and the maintainer-time savings this PR aims for do feel real to me. Those positives are clear. What I’m less sure about is the experience of using it. The skill makes it very easy to see which areas align with typical PMC expectations and which areas appear lacking. As I was testing it, I found myself naturally thinking, “Should I try to fill in those gaps?” What felt slightly uncomfortable to me was how quickly and naturally that progression happened. Measurable things became goals much faster than I expected, and it made me wonder whether contributors might gradually optimize for what is visible in the table rather than for the many kinds of valuable but less visible contributions that communities also depend on. Of course, self-application is not the intended use case here. ( is the expected caller, not .) But if this becomes one of Magpie’s default support skills, I wonder whether regardless of intent it could become the path of least resistance inside the project, and whether that might gradually shift contributor attention toward the things that are easiest to quantify. I genuinely don’t know the answer myself, but I think it’s a question worth wrestling with. |
|
Thanks for the thoughtful feedback. A few things worth clarifying: this skill is designed to be complementary to the contributor-nomination skill, not a standalone assessment tool. The intent is that a maintainer uses it as one input among several when considering a nomination, not as a scorecard a contributor optimizes toward. Keeping that distinction clear is one of the main guards against the Goodhart's Law dynamic you're describing, where the metric becomes the target. On the limitation you're pointing to, yes, it will only surface people with visible GitHub activity, and that's a deliberate acknowledgment of what's measurable rather than what's valuable. Other contribution types are genuinely hard to quantify. We could potentially look at mailing list traffic as an additional signal, but that immediately runs into name-mapping problems: email addresses don't reliably map to GitHub handles, and getting that wrong is worse than not having the data at all. I think we can address these concerns more directly in the skill output by strengthening the framing around scope. A committer is not expected to be active in all areas, and most contributors naturally focus on one or two streams rather than all. Someone who does deep, careful PR review but rarely files issues isn't a weaker candidate, they're a different kind of contributor. Making that explicit in the output should help resist the pull toward treating visible gaps as deficits to fill. Happy to add that language before this comes out of draft. |
|
Yeah. I very much see @choo121600 "measure becomes a goal" - and framing the output as "guide" is super important. Few things that could make the skill also way more useful - taking into account that we have LLM / Models to do a lot more
|
|
Jarek, I suggest you look at #227. This skill is meant to be used in combination with that - this skill has documented limitations. |
|
Pre-flight self-review — PR #228 (contributor-activity-sweep) #228 · draft · author: Base: main · Files changed: 37 (~all added) · Diff size: +756 / −0 A new Pairing/Triage-flavoured read-only activity card with a 12-case eval Correctness
(Verified clean: eval output-spec keys match expected.json keys exactly across Security No findings. Strong injection-guard callout in the SKILL.md (2 matches).
That's a solid four security-adversarial cases for a 12-case suite. Read-only Conventions
(Verified clean: SPDX header present; ## Step headings well-formed; eval Summary Ready to push with one small tightening — no blocking findings. The --strict Blocking: 0 Advisory: 2 |
Self-review fixes for the two advisory findings: - Trim the frontmatter description: drop the duplicate enum of off-GitHub channels (it is already documented prominently in the body callout). Comma count drops from 5+ to 3, clearing the skill-validator --strict action-inventory soft warning. The matching-layer wording is otherwise unchanged. - Rename "## Step 1 — Fetch GitHub activity" to "## Step 1 — Fetch and classify activity" so the heading matches the eval dir (step-1-classify-reviews) and reflects what Step 1 actually does (fetch + substantive/LGTM classification). Update the matching step-config.json step_heading accordingly; the eval renderer continues to extract the section correctly. Generated-by: Claude Code (Opus 4.7)
# Conflicts: # tools/skill-evals/README.md
|
Hi @justinmclean — was sweeping the PR queue and tried a maintainer-side rebase, but the branch is showing a 343-file / ~12k-line-deletion diff against current Could you take a look when you have a moment? I suspect the right call is either a fresh rebase to surface what's still unique here, or — if everything has landed via the other route — close this one out. Happy to defer to your call. Same shape on #229 and #269 — both stale enough that a clean maintainer-side rebase wasn't safe. |
Adds a new read-only skill that produces a GitHub activity card for a
named contributor on a configured upstream repo.
What it does
Fetches four streams of GitHub activity over a configurable window
(default 6 months): PRs authored, PR reviews given, issues filed, and
PR/issue comment threads. Output is a compact activity card with a
month-by-month timeline — no assessment, no readiness verdict.
Key design decisions
GitHub-only limitation is structural, not a footnote. The warning
appears in the frontmatter description, as an opening blockquote in
the card, and in the footer ("Code is not the only form of
contribution"). Contributors who are central to the mailing list,
documentation, or user support will appear quiet here — the skill says
so explicitly.
Review classification uses inline comments, not just body length.
The substantive threshold is
inline_comment_count >= 3 OR body > 50 chars. A body-length-only heuristic undercounts reviewers who workline-by-line without writing a top-level summary — a common pattern
among experienced reviewers.
Repo age check trims the window. If the repo is newer than the
requested start date,
<since>is trimmed to the repo's creation dateso the timeline doesn't render a misleading wall of zero months.
Handoff to contributor-nomination. After rendering, the skill
offers to continue into the full nomination flow without re-fetching
already-collected data.
Injection resistance throughout. Login values are validated against
the GitHub handle regex before any API calls. All query strings are
written to tempfiles rather than interpolated into shell arguments.
External content (PR titles, review bodies) is treated as input data;
imperative instructions found there are flagged and not followed.
Eval suite
12 cases across 3 steps - all pass.