Skip to content

feat(preflight-audit): CLI for measuring bulk-mode skip-rate#418

Merged
potiuk merged 1 commit into
apache:mainfrom
potiuk:feat-preflight-audit-tool
May 31, 2026
Merged

feat(preflight-audit): CLI for measuring bulk-mode skip-rate#418
potiuk merged 1 commit into
apache:mainfrom
potiuk:feat-preflight-audit-tool

Conversation

@potiuk

@potiuk potiuk commented May 31, 2026

Copy link
Copy Markdown
Member

Summary

The bulk-mode pre-flight classifier (security-issue-syncbulk-mode.md) has a rule table that evolves as we learn how real adopter trackers behave. Each rule change needs a before / after measurement to know whether the change helped (more skips) or hurt (false-positive skips). The recent classifier-tuning PR (#416) was driven by a one-off /tmp/ script — easy to lose, hard to reproduce, not CI-friendly.

tools/preflight-audit/ promotes that pattern into a permanent tool.

preflight-audit classify --repo <r> --issues 1,2,3 [--now ISO]
preflight-audit classify --load resp.json --now ISO [--json]
  • Live mode shells out to gh api graphql (same aliased multi-field query the skill builds inline).
  • Replay mode reads a pre-fetched response — deterministic, network-free, suitable for CI eval fixtures.
  • --bot-logins extends bot-equivalent detection for adopters with personal-account bots.
  • Default output is a grouped table; --json for machine-readable.

How the rules stay in sync

The classifier in src/preflight_audit/classifier.py is the executable spec of the rule table in bulk-mode.md. Both must be edited in lock-step — a PR that changes one should change the other.

37 unit tests cover each rule with a focused positive + negative case, the skill-marker detection (including parametrized edge cases), and classify_response end-to-end.

Intended workflow when changing a rule

  1. Run preflight-audit against your tracker → capture before.
  2. Edit the rule table in bulk-mode.md AND the matching condition in classifier.py.
  3. Re-run → capture after.
  4. Cite both numbers in the PR body.

Test plan

  • 37 unit tests — green
  • ruff check / ruff format / mypy — green
  • prek (markdownlint, typos, format, validators) — green
  • lychee on the new README — clean
  • skill-and-tool-validate — no new violations
  • CI lychee + tests-ok on this PR

🤖 Generated with Claude Code

…t real or replayed tracker state

The bulk-mode pre-flight classifier
(security-issue-sync § bulk-mode.md) has a rule table that
evolves as we learn how real adopter trackers behave. Each rule
change needs a before / after measurement to know whether the
change helped (more skips) or hurt (false-positive skips). The
recent classifier-tuning PR was driven by a one-off `/tmp/`
script — easy to lose, hard to reproduce, can't be wired into
CI.

`tools/preflight-audit/` promotes that pattern into a permanent
tool:

  preflight-audit classify --repo <r> --issues 1,2,3 [--now ISO]
  preflight-audit classify --load resp.json --now ISO [--json]

- Live mode shells out to `gh api graphql` (same aliased
  multi-field query as the skill).
- Replay mode reads a pre-fetched response from disk —
  deterministic, network-free, suitable for CI eval fixtures.
- Default output is the same grouped table as the original
  dry-run script; `--json` gives machine-readable output.
- `--bot-logins` lets adopters extend the bot-equivalent
  detection for personal-account bots.

The classifier in `src/preflight_audit/classifier.py` is the
**executable spec** of the rule table in
`security-issue-sync/bulk-mode.md`. Both must be edited in
lock-step — a PR that changes one should change the other. 37
unit tests cover each rule with a focused positive + negative
case, the skill-marker detection (including parametrized edge
cases for the marker-match precision), and `classify_response`
end-to-end.

The intended workflow when changing a rule:

1. Run preflight-audit before the change to capture skip-rate.
2. Edit the rule table in `bulk-mode.md` AND the matching
   condition in `classifier.py`.
3. Re-run to capture the after.
4. Cite both numbers in the PR body.

The tool itself produces no GitHub state changes — it is
read-only by design. Capability declared as `capability:stats`.
@potiuk potiuk force-pushed the feat-preflight-audit-tool branch from 40e525c to 6be4311 Compare May 31, 2026 14:06
@potiuk potiuk merged commit 67a2d53 into apache:main May 31, 2026
16 checks passed
potiuk added a commit that referenced this pull request May 31, 2026
…ifier rule (#423)

PR #418 shipped the preflight-audit CLI with a replay mode but no
fixture exercising the full classifier end-to-end. This adds:

- `tests/fixtures/synthetic_workspace_sweep.json` — 12-issue
  GraphQL response, one issue per rule path (Rule 1 dispatch,
  Rule 1-yields-then-Rule-7, Rule 2 dispatch-urgent, Rules 3-7
  skip-noop, GitHub-App bot login, personal-bot needing
  override, fall-through dispatch, recently-closed dispatch).
  Each issue node carries a `_purpose` annotation documenting
  which rule it should land on.

- `tests/test_eval_replay.py` — drives `classify_response`
  against the fixture with a pinned `now` (2026-06-01T12:00:00Z)
  and asserts:
  1. The full per-decision bucket distribution (positional
     identifiers per bucket).
  2. The same distribution under `extra_bot_logins` — one issue
     migrates from dispatch to skip-noop with the override.
  3. Per-issue assertions with reason-substring matches, keeping
     the fixture's `_purpose` annotations in lock-step with the
     classifier behaviour.
  4. A skip-rate floor (≥30%) matching the real-world target
     after #416's rule tuning.

A rule change that alters the distribution fails one of the
asserts; the diff in the failing assertion tells the reviewer
how the rule affects coverage before they ever look at real
adopter data. The eval is deterministic (no live `gh` calls,
fixed `now`) so CI runs it in milliseconds.

This closes the tune-then-verify loop one more rung up — PR
#416 used a one-off `/tmp/` script, PR #418 promoted it to a
CLI, and this PR locks the rule behaviour into the test suite.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant