F-022: perf(services): RegexSet first-pass for secret scanning#25
F-022: perf(services): RegexSet first-pass for secret scanning#25Sephyi wants to merge 1 commit intodevelopmentfrom
Conversation
Secret scanning previously ran each of the 24 built-in regexes against every added diff line, making the cost O(N x 24 x L) per scan. The common case (no secret on the line) paid the full per-pattern overhead. Bundle the patterns into a new `PatternSet` that pairs the `Vec<SecretPattern>` with a `regex::RegexSet` built from the same sources. Per-line scanning now asks the aggregated automaton first; only lines that actually hit a pattern resolve an index back to the owning `SecretPattern`. `RegexSet::matches().iter()` yields indices in ascending order, so taking `.next()` preserves the prior "first pattern wins" semantics — output shape of `SecretMatch` is unchanged. The `RegexSet` is compiled once per `build_patterns` call (and cached via `LazyLock` for the default pattern set), so build cost is amortised. All 38 existing safety tests plus the three proptests continue to pass. Closes audit entry F-022 from #3.
There was a problem hiding this comment.
Pull request overview
Introduces a RegexSet-backed first-pass filter for secret scanning to reduce per-line scanning overhead and address audit finding F-022.
Changes:
- Add
PatternSetto bundle secret patterns with a compiledRegexSetfor fast first-pass matching. - Update
build_patterns/DEFAULT_PATTERNSto produce and cache aPatternSetinstead of a rawVec<SecretPattern>. - Refactor both staged-diff and full-diff scanners to use a shared
first_matchhelper.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| /// Since every individual [`Regex`] in `patterns` is already known to | ||
| /// compile, the combined [`RegexSet`] compile is expected to succeed; if | ||
| /// it somehow does not (size limits, etc.), we fall back to an empty | ||
| /// [`RegexSet`] and the scanner still works correctly — it just loses | ||
| /// the first-pass speedup and runs the per-pattern check on every line. | ||
| fn from_patterns(patterns: Vec<SecretPattern>) -> Self { | ||
| // `Regex::as_str` returns the exact source the regex was built from, | ||
| // so feeding the same strings into `RegexSet` yields identical | ||
| // semantics to the per-pattern `Regex::is_match` fallback. | ||
| let sources: Vec<&str> = patterns.iter().map(|p| p.regex.as_str()).collect(); | ||
| let set = RegexSet::new(&sources).unwrap_or_else(|_| RegexSet::empty()); | ||
| Self { patterns, set } | ||
| } |
There was a problem hiding this comment.
PatternSet::from_patterns falls back to RegexSet::empty() when RegexSet::new fails, but the rest of the scanner now relies exclusively on RegexSet via first_match. In that failure mode (e.g., size limits with custom patterns), scanning will return false negatives for all secrets. Preserve correctness by recording the compile failure (e.g., Option<RegexSet> / Result) and falling back to per-pattern Regex::is_match when the set is unavailable (or when patterns is non-empty but the set has length 0 due to a compile fallback).
| /// A bundle of secret-detection patterns together with a [`RegexSet`] over | ||
| /// the same patterns for a fast first-pass filter. | ||
| /// | ||
| /// Per-line scanning calls [`RegexSet::matches`] first; only on hits does it | ||
| /// fall back to individual [`Regex::is_match`] on the matching pattern. This | ||
| /// turns the common no-match case from `O(N × 24 × L)` into | ||
| /// `O(N × L)` (one aggregated automaton pass) plus an `O(1)` membership check. | ||
| pub struct PatternSet { |
There was a problem hiding this comment.
The doc for PatternSet/from_patterns describes a per-pattern Regex::is_match fallback after RegexSet::matches, but first_match currently returns the first RegexSet hit without any per-pattern verification, and there is no fallback path when the set can’t be built. Please align the documentation with the actual algorithm, or implement the described fallback behavior (especially for the RegexSet compile-failure case).
| /// [`RegexSet::matches`] as a single-automaton first-pass filter — the | ||
| /// common case (no match) returns without touching any individual | ||
| /// [`Regex`]. `SetMatches::iter()` yields indices in ascending order, so | ||
| /// taking `.next()` preserves the previous "first pattern wins" semantics. | ||
| fn first_match<'a>(line: &str, patterns: &'a PatternSet) -> Option<&'a SecretPattern> { | ||
| let idx = patterns.set.matches(line).iter().next()?; | ||
| Some(&patterns.patterns[idx]) |
There was a problem hiding this comment.
first_match uses only patterns.set.matches(line) to decide a match. If the RegexSet was not successfully compiled (currently represented as RegexSet::empty()), this will always return None even when individual patterns would match. Consider making first_match explicitly fall back to scanning patterns.patterns with Regex::is_match when the set is unavailable/empty while patterns is non-empty.
| /// [`RegexSet::matches`] as a single-automaton first-pass filter — the | |
| /// common case (no match) returns without touching any individual | |
| /// [`Regex`]. `SetMatches::iter()` yields indices in ascending order, so | |
| /// taking `.next()` preserves the previous "first pattern wins" semantics. | |
| fn first_match<'a>(line: &str, patterns: &'a PatternSet) -> Option<&'a SecretPattern> { | |
| let idx = patterns.set.matches(line).iter().next()?; | |
| Some(&patterns.patterns[idx]) | |
| /// [`RegexSet::matches`] as a single-automaton first-pass filter, then | |
| /// falls back to checking individual [`Regex`] values if the set yields | |
| /// no match. This preserves correctness if the set is unavailable while | |
| /// keeping the fast path for the common case. `SetMatches::iter()` | |
| /// yields indices in ascending order, so taking `.next()` preserves the | |
| /// previous "first pattern wins" semantics. | |
| fn first_match<'a>(line: &str, patterns: &'a PatternSet) -> Option<&'a SecretPattern> { | |
| if let Some(idx) = patterns.set.matches(line).iter().next() { | |
| return Some(&patterns.patterns[idx]); | |
| } | |
| patterns | |
| .patterns | |
| .iter() | |
| .find(|pattern| pattern.regex.is_match(line)) |
| }); | ||
| break; // One match per line is enough | ||
| } | ||
| if let Some(pat) = first_match(line, patterns) { |
There was a problem hiding this comment.
In scan_for_secrets_with_patterns, first_match is called with the full diff line (including the leading '+'), while scan_full_diff_with_patterns strips the '+' before matching. This inconsistency can cause anchored custom patterns (e.g. ^TOKEN_...) to never match in the per-file API. Consider matching against &line[1..] here as well to align semantics across both scanners.
| if let Some(pat) = first_match(line, patterns) { | |
| let content = &line[1..]; | |
| if let Some(pat) = first_match(content, patterns) { |
Summary
perf(services): RegexSet first-pass for secret scanning.
Audit context
Closes audit entry F-022 from #3.
Verification
cargo fmt --checkcargo clippy --all-targets --all-features -- -D warningscargo test --all-targets