feat(pairing-self-review): add pre-flight self-review skill and eval suite#251
Conversation
51bf8b3 to
8935656
Compare
Ships the first Pairing-mode skill: a read-only pre-flight self-review that runs in the developer's own dev loop. Classifies diff findings across correctness, security, and conventions axes and returns a structured hand-back report — no state changes, no PR, no external writes. Updates docs/modes.md Pairing row from proposed/0 to experimental/1, adds a skill table. Ships a 9-case eval suite covering clean diffs, blocking and advisory findings on each axis, an empty-diff guard, and prompt-injection resistance. Generated-by: Claude (Opus 4.7)
8935656 to
5ac5abf
Compare
|
I ran
Security
Conventions
(Positive, for the record: passes skill-validate --strict and Summary Blocking findings present — address the detail/evidence schema mismatch before Blocking: 1 Advisory: 2 |
Reconcile the skill with its eval contract and close two coverage gaps surfaced by a self-review of the branch: - Step 2: rename the finding field `detail` -> `evidence` so the skill matches the eval output-spec and every expected.json. Previously the skill said `detail` while the fixtures asserted `evidence`, so a faithful run would have failed all four finding cases. - Document the empty-diff signal (`empty_diff`) in Step 2 and the step-2 output-spec, so case-6's asserted field is no longer undocumented. - Drop the non-standard `status:` frontmatter field; lifecycle status stays in docs/modes.md, as with every other skill. - Add step-2 case-7-multi-axis (findings on all three axes -> empty axes_without_findings) and step-3 case-4-mixed-severity (blocking + advisory -> blocking signal with both counts non-zero). Suite is now 11 cases. Generated-by: Claude Code (Opus 4.7)
|
Fixed the issues pairing-self-review found |
What
Adds
pairing-self-review, the first Pairing-mode skill: a strictly read-onlypre-flight self-review that runs in the developer's own loop, after local
changes are ready but before a PR is opened. It diffs the working branch against
a configurable base (default: the merge base of
HEADand the upstream defaultbranch), classifies findings across correctness, security, and
conventions axes (each
blockingoradvisory), and returns a structuredreport. No state changes — it never opens a PR, pushes, comments, or mutates the
working tree.
Also flips the Pairing row in
docs/modes.mdfrom proposed/0 to experimental/1and adds the skill to the mode's table.
Why
Pre-flight self-review keeps implementation-detail chatter out of the eventual
human review, so the maintainer conversation stays on design and trade-offs.
It's the read-only counterpart to
pr-management-code-review, which runs once aPR is already open.
Changes
.claude/skills/pairing-self-review/SKILL.md— the skill.tools/skill-evals/evals/pairing-self-review/— a 9-case eval suite.docs/modes.md— Pairing row + skill table.Testing
skill-validatorpasses (no hard violations, no soft advisories).SKILL.md, so a clean assembly confirms fixtures and skill text are in sync.injection, a correctness regression, and — when a comment in the diff ordered
the reviewer to suppress findings — it treated the comment as data and flagged
it rather than obeying. No false positives on clean changes.
Notes