F-022: perf(services): RegexSet first-pass for secret scanning by Sephyi · Pull Request #25 · Sephyi/commitbee

Sephyi · 2026-04-22T19:51:23Z

Summary

perf(services): RegexSet first-pass for secret scanning.

Audit context

Closes audit entry F-022 from #3.

Verification

cargo fmt --check
cargo clippy --all-targets --all-features -- -D warnings
cargo test --all-targets

Note: one pre-existing test porcelain_exits_within_timeout_with_no_staged_changes is a known macOS cold-start flake that reproduces on unmodified development — unrelated to this change.

Secret scanning previously ran each of the 24 built-in regexes against every added diff line, making the cost O(N x 24 x L) per scan. The common case (no secret on the line) paid the full per-pattern overhead. Bundle the patterns into a new `PatternSet` that pairs the `Vec<SecretPattern>` with a `regex::RegexSet` built from the same sources. Per-line scanning now asks the aggregated automaton first; only lines that actually hit a pattern resolve an index back to the owning `SecretPattern`. `RegexSet::matches().iter()` yields indices in ascending order, so taking `.next()` preserves the prior "first pattern wins" semantics — output shape of `SecretMatch` is unchanged. The `RegexSet` is compiled once per `build_patterns` call (and cached via `LazyLock` for the default pattern set), so build cost is amortised. All 38 existing safety tests plus the three proptests continue to pass. Closes audit entry F-022 from #3.

Copilot

Pull request overview

Introduces a RegexSet-backed first-pass filter for secret scanning to reduce per-line scanning overhead and address audit finding F-022.

Changes:

Add PatternSet to bundle secret patterns with a compiled RegexSet for fast first-pass matching.
Update build_patterns/DEFAULT_PATTERNS to produce and cache a PatternSet instead of a raw Vec<SecretPattern>.
Refactor both staged-diff and full-diff scanners to use a shared first_match helper.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-22T19:55:57Z

+    /// Since every individual [`Regex`] in `patterns` is already known to
+    /// compile, the combined [`RegexSet`] compile is expected to succeed; if
+    /// it somehow does not (size limits, etc.), we fall back to an empty
+    /// [`RegexSet`] and the scanner still works correctly — it just loses
+    /// the first-pass speedup and runs the per-pattern check on every line.
+    fn from_patterns(patterns: Vec<SecretPattern>) -> Self {
+        // `Regex::as_str` returns the exact source the regex was built from,
+        // so feeding the same strings into `RegexSet` yields identical
+        // semantics to the per-pattern `Regex::is_match` fallback.
+        let sources: Vec<&str> = patterns.iter().map(|p| p.regex.as_str()).collect();
+        let set = RegexSet::new(&sources).unwrap_or_else(|_| RegexSet::empty());
+        Self { patterns, set }
+    }


PatternSet::from_patterns falls back to RegexSet::empty() when RegexSet::new fails, but the rest of the scanner now relies exclusively on RegexSet via first_match. In that failure mode (e.g., size limits with custom patterns), scanning will return false negatives for all secrets. Preserve correctness by recording the compile failure (e.g., Option<RegexSet> / Result) and falling back to per-pattern Regex::is_match when the set is unavailable (or when patterns is non-empty but the set has length 0 due to a compile fallback).

Copilot · 2026-04-22T19:55:57Z

+/// A bundle of secret-detection patterns together with a [`RegexSet`] over
+/// the same patterns for a fast first-pass filter.
+///
+/// Per-line scanning calls [`RegexSet::matches`] first; only on hits does it
+/// fall back to individual [`Regex::is_match`] on the matching pattern. This
+/// turns the common no-match case from `O(N × 24 × L)` into
+/// `O(N × L)` (one aggregated automaton pass) plus an `O(1)` membership check.
+pub struct PatternSet {


The doc for PatternSet/from_patterns describes a per-pattern Regex::is_match fallback after RegexSet::matches, but first_match currently returns the first RegexSet hit without any per-pattern verification, and there is no fallback path when the set can’t be built. Please align the documentation with the actual algorithm, or implement the described fallback behavior (especially for the RegexSet compile-failure case).

Copilot · 2026-04-22T19:55:57Z

+/// [`RegexSet::matches`] as a single-automaton first-pass filter — the
+/// common case (no match) returns without touching any individual
+/// [`Regex`]. `SetMatches::iter()` yields indices in ascending order, so
+/// taking `.next()` preserves the previous "first pattern wins" semantics.
+fn first_match<'a>(line: &str, patterns: &'a PatternSet) -> Option<&'a SecretPattern> {
+    let idx = patterns.set.matches(line).iter().next()?;
+    Some(&patterns.patterns[idx])


first_match uses only patterns.set.matches(line) to decide a match. If the RegexSet was not successfully compiled (currently represented as RegexSet::empty()), this will always return None even when individual patterns would match. Consider making first_match explicitly fall back to scanning patterns.patterns with Regex::is_match when the set is unavailable/empty while patterns is non-empty.

Suggested change

/// [`RegexSet::matches`] as a single-automaton first-pass filter — the

/// common case (no match) returns without touching any individual

/// [`Regex`]. `SetMatches::iter()` yields indices in ascending order, so

/// taking `.next()` preserves the previous "first pattern wins" semantics.

fn first_match<'a>(line: &str, patterns: &'a PatternSet) -> Option<&'a SecretPattern> {

let idx = patterns.set.matches(line).iter().next()?;

Some(&patterns.patterns[idx])

/// [`RegexSet::matches`] as a single-automaton first-pass filter, then

/// falls back to checking individual [`Regex`] values if the set yields

/// no match. This preserves correctness if the set is unavailable while

/// keeping the fast path for the common case. `SetMatches::iter()`

/// yields indices in ascending order, so taking `.next()` preserves the

/// previous "first pattern wins" semantics.

fn first_match<'a>(line: &str, patterns: &'a PatternSet) -> Option<&'a SecretPattern> {

if let Some(idx) = patterns.set.matches(line).iter().next() {

return Some(&patterns.patterns[idx]);

}

patterns

.patterns

.iter()

.find(|pattern| pattern.regex.is_match(line))

Copilot · 2026-04-22T19:55:57Z

-                    });
-                    break; // One match per line is enough
-                }
+            if let Some(pat) = first_match(line, patterns) {


In scan_for_secrets_with_patterns, first_match is called with the full diff line (including the leading '+'), while scan_full_diff_with_patterns strips the '+' before matching. This inconsistency can cause anchored custom patterns (e.g. ^TOKEN_...) to never match in the per-file API. Consider matching against &line[1..] here as well to align semantics across both scanners.

Suggested change

if let Some(pat) = first_match(line, patterns) {

let content = &line[1..];

if let Some(pat) = first_match(content, patterns) {

Copilot AI review requested due to automatic review settings April 22, 2026 19:51

Sephyi added the audit Codebase audit cleanup (issue #3) label Apr 22, 2026

Sephyi self-assigned this Apr 22, 2026

Copilot started reviewing on behalf of Sephyi April 22, 2026 19:52 View session

Copilot AI reviewed Apr 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

F-022: perf(services): RegexSet first-pass for secret scanning#25

F-022: perf(services): RegexSet first-pass for secret scanning#25
Sephyi wants to merge 1 commit intodevelopmentfrom
audit/f-022-regexset-secret-scanning

Sephyi commented Apr 22, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	if let Some(pat) = first_match(line, patterns) {
	let content = &line[1..];
	if let Some(pat) = first_match(content, patterns) {

Uh oh!

Conversation

Sephyi commented Apr 22, 2026

Summary

Audit context

Verification

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants