feat(preflight-audit): CLI for measuring bulk-mode skip-rate#418
Merged
Conversation
…t real or replayed tracker state The bulk-mode pre-flight classifier (security-issue-sync § bulk-mode.md) has a rule table that evolves as we learn how real adopter trackers behave. Each rule change needs a before / after measurement to know whether the change helped (more skips) or hurt (false-positive skips). The recent classifier-tuning PR was driven by a one-off `/tmp/` script — easy to lose, hard to reproduce, can't be wired into CI. `tools/preflight-audit/` promotes that pattern into a permanent tool: preflight-audit classify --repo <r> --issues 1,2,3 [--now ISO] preflight-audit classify --load resp.json --now ISO [--json] - Live mode shells out to `gh api graphql` (same aliased multi-field query as the skill). - Replay mode reads a pre-fetched response from disk — deterministic, network-free, suitable for CI eval fixtures. - Default output is the same grouped table as the original dry-run script; `--json` gives machine-readable output. - `--bot-logins` lets adopters extend the bot-equivalent detection for personal-account bots. The classifier in `src/preflight_audit/classifier.py` is the **executable spec** of the rule table in `security-issue-sync/bulk-mode.md`. Both must be edited in lock-step — a PR that changes one should change the other. 37 unit tests cover each rule with a focused positive + negative case, the skill-marker detection (including parametrized edge cases for the marker-match precision), and `classify_response` end-to-end. The intended workflow when changing a rule: 1. Run preflight-audit before the change to capture skip-rate. 2. Edit the rule table in `bulk-mode.md` AND the matching condition in `classifier.py`. 3. Re-run to capture the after. 4. Cite both numbers in the PR body. The tool itself produces no GitHub state changes — it is read-only by design. Capability declared as `capability:stats`.
40e525c to
6be4311
Compare
This was referenced May 31, 2026
potiuk
added a commit
that referenced
this pull request
May 31, 2026
…ifier rule (#423) PR #418 shipped the preflight-audit CLI with a replay mode but no fixture exercising the full classifier end-to-end. This adds: - `tests/fixtures/synthetic_workspace_sweep.json` — 12-issue GraphQL response, one issue per rule path (Rule 1 dispatch, Rule 1-yields-then-Rule-7, Rule 2 dispatch-urgent, Rules 3-7 skip-noop, GitHub-App bot login, personal-bot needing override, fall-through dispatch, recently-closed dispatch). Each issue node carries a `_purpose` annotation documenting which rule it should land on. - `tests/test_eval_replay.py` — drives `classify_response` against the fixture with a pinned `now` (2026-06-01T12:00:00Z) and asserts: 1. The full per-decision bucket distribution (positional identifiers per bucket). 2. The same distribution under `extra_bot_logins` — one issue migrates from dispatch to skip-noop with the override. 3. Per-issue assertions with reason-substring matches, keeping the fixture's `_purpose` annotations in lock-step with the classifier behaviour. 4. A skip-rate floor (≥30%) matching the real-world target after #416's rule tuning. A rule change that alters the distribution fails one of the asserts; the diff in the failing assertion tells the reviewer how the rule affects coverage before they ever look at real adopter data. The eval is deterministic (no live `gh` calls, fixed `now`) so CI runs it in milliseconds. This closes the tune-then-verify loop one more rung up — PR #416 used a one-off `/tmp/` script, PR #418 promoted it to a CLI, and this PR locks the rule behaviour into the test suite.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The bulk-mode pre-flight classifier (
security-issue-sync→bulk-mode.md) has a rule table that evolves as we learn how real adopter trackers behave. Each rule change needs a before / after measurement to know whether the change helped (more skips) or hurt (false-positive skips). The recent classifier-tuning PR (#416) was driven by a one-off/tmp/script — easy to lose, hard to reproduce, not CI-friendly.tools/preflight-audit/promotes that pattern into a permanent tool.gh api graphql(same aliased multi-field query the skill builds inline).--bot-loginsextends bot-equivalent detection for adopters with personal-account bots.--jsonfor machine-readable.How the rules stay in sync
The classifier in
src/preflight_audit/classifier.pyis the executable spec of the rule table inbulk-mode.md. Both must be edited in lock-step — a PR that changes one should change the other.37 unit tests cover each rule with a focused positive + negative case, the skill-marker detection (including parametrized edge cases), and
classify_responseend-to-end.Intended workflow when changing a rule
preflight-auditagainst your tracker → capture before.bulk-mode.mdAND the matching condition inclassifier.py.Test plan
ruff check/ruff format/mypy— greenprek(markdownlint, typos, format, validators) — greenlycheeon the new README — cleanskill-and-tool-validate— no new violations🤖 Generated with Claude Code