feat(preflight-audit): CLI for measuring bulk-mode skip-rate by potiuk · Pull Request #418 · apache/magpie

potiuk · 2026-05-31T14:03:27Z

Summary

The bulk-mode pre-flight classifier (security-issue-sync → bulk-mode.md) has a rule table that evolves as we learn how real adopter trackers behave. Each rule change needs a before / after measurement to know whether the change helped (more skips) or hurt (false-positive skips). The recent classifier-tuning PR (#416) was driven by a one-off /tmp/ script — easy to lose, hard to reproduce, not CI-friendly.

tools/preflight-audit/ promotes that pattern into a permanent tool.

preflight-audit classify --repo <r> --issues 1,2,3 [--now ISO]
preflight-audit classify --load resp.json --now ISO [--json]

Live mode shells out to gh api graphql (same aliased multi-field query the skill builds inline).
Replay mode reads a pre-fetched response — deterministic, network-free, suitable for CI eval fixtures.
--bot-logins extends bot-equivalent detection for adopters with personal-account bots.
Default output is a grouped table; --json for machine-readable.

How the rules stay in sync

The classifier in src/preflight_audit/classifier.py is the executable spec of the rule table in bulk-mode.md. Both must be edited in lock-step — a PR that changes one should change the other.

37 unit tests cover each rule with a focused positive + negative case, the skill-marker detection (including parametrized edge cases), and classify_response end-to-end.

Intended workflow when changing a rule

Run preflight-audit against your tracker → capture before.
Edit the rule table in bulk-mode.md AND the matching condition in classifier.py.
Re-run → capture after.
Cite both numbers in the PR body.

Test plan

37 unit tests — green
ruff check / ruff format / mypy — green
prek (markdownlint, typos, format, validators) — green
lychee on the new README — clean
skill-and-tool-validate — no new violations
CI lychee + tests-ok on this PR

🤖 Generated with Claude Code

…t real or replayed tracker state The bulk-mode pre-flight classifier (security-issue-sync § bulk-mode.md) has a rule table that evolves as we learn how real adopter trackers behave. Each rule change needs a before / after measurement to know whether the change helped (more skips) or hurt (false-positive skips). The recent classifier-tuning PR was driven by a one-off `/tmp/` script — easy to lose, hard to reproduce, can't be wired into CI. `tools/preflight-audit/` promotes that pattern into a permanent tool: preflight-audit classify --repo <r> --issues 1,2,3 [--now ISO] preflight-audit classify --load resp.json --now ISO [--json] - Live mode shells out to `gh api graphql` (same aliased multi-field query as the skill). - Replay mode reads a pre-fetched response from disk — deterministic, network-free, suitable for CI eval fixtures. - Default output is the same grouped table as the original dry-run script; `--json` gives machine-readable output. - `--bot-logins` lets adopters extend the bot-equivalent detection for personal-account bots. The classifier in `src/preflight_audit/classifier.py` is the **executable spec** of the rule table in `security-issue-sync/bulk-mode.md`. Both must be edited in lock-step — a PR that changes one should change the other. 37 unit tests cover each rule with a focused positive + negative case, the skill-marker detection (including parametrized edge cases for the marker-match precision), and `classify_response` end-to-end. The intended workflow when changing a rule: 1. Run preflight-audit before the change to capture skip-rate. 2. Edit the rule table in `bulk-mode.md` AND the matching condition in `classifier.py`. 3. Re-run to capture the after. 4. Cite both numbers in the PR body. The tool itself produces no GitHub state changes — it is read-only by design. Capability declared as `capability:stats`.

…ifier rule (#423) PR #418 shipped the preflight-audit CLI with a replay mode but no fixture exercising the full classifier end-to-end. This adds: - `tests/fixtures/synthetic_workspace_sweep.json` — 12-issue GraphQL response, one issue per rule path (Rule 1 dispatch, Rule 1-yields-then-Rule-7, Rule 2 dispatch-urgent, Rules 3-7 skip-noop, GitHub-App bot login, personal-bot needing override, fall-through dispatch, recently-closed dispatch). Each issue node carries a `_purpose` annotation documenting which rule it should land on. - `tests/test_eval_replay.py` — drives `classify_response` against the fixture with a pinned `now` (2026-06-01T12:00:00Z) and asserts: 1. The full per-decision bucket distribution (positional identifiers per bucket). 2. The same distribution under `extra_bot_logins` — one issue migrates from dispatch to skip-noop with the override. 3. Per-issue assertions with reason-substring matches, keeping the fixture's `_purpose` annotations in lock-step with the classifier behaviour. 4. A skip-rate floor (≥30%) matching the real-world target after #416's rule tuning. A rule change that alters the distribution fails one of the asserts; the diff in the failing assertion tells the reviewer how the rule affects coverage before they ever look at real adopter data. The eval is deterministic (no live `gh` calls, fixed `now`) so CI runs it in milliseconds. This closes the tune-then-verify loop one more rung up — PR #416 used a one-off `/tmp/` script, PR #418 promoted it to a CLI, and this PR locks the rule behaviour into the test suite.

potiuk force-pushed the feat-preflight-audit-tool branch from 40e525c to 6be4311 Compare May 31, 2026 14:06

potiuk merged commit 67a2d53 into apache:main May 31, 2026
16 checks passed

This was referenced May 31, 2026

refactor: uv workspace — DRY pre-commit + CI matrix from single SoT #419

Merged

test(preflight-audit): replay-mode eval fixture for the classifier #423

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(preflight-audit): CLI for measuring bulk-mode skip-rate#418

feat(preflight-audit): CLI for measuring bulk-mode skip-rate#418
potiuk merged 1 commit into
apache:mainfrom
potiuk:feat-preflight-audit-tool

potiuk commented May 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

potiuk commented May 31, 2026

Summary

How the rules stay in sync

Intended workflow when changing a rule

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant