Skip to content

F-022: perf(services): RegexSet first-pass for secret scanning#25

Open
Sephyi wants to merge 1 commit intodevelopmentfrom
audit/f-022-regexset-secret-scanning
Open

F-022: perf(services): RegexSet first-pass for secret scanning#25
Sephyi wants to merge 1 commit intodevelopmentfrom
audit/f-022-regexset-secret-scanning

Conversation

@Sephyi
Copy link
Copy Markdown
Owner

@Sephyi Sephyi commented Apr 22, 2026

Summary

perf(services): RegexSet first-pass for secret scanning.

Audit context

Closes audit entry F-022 from #3.

Verification

  • cargo fmt --check
  • cargo clippy --all-targets --all-features -- -D warnings
  • cargo test --all-targets

Note: one pre-existing test porcelain_exits_within_timeout_with_no_staged_changes is a known macOS cold-start flake that reproduces on unmodified development — unrelated to this change.

Secret scanning previously ran each of the 24 built-in regexes against
every added diff line, making the cost O(N x 24 x L) per scan. The
common case (no secret on the line) paid the full per-pattern overhead.

Bundle the patterns into a new `PatternSet` that pairs the
`Vec<SecretPattern>` with a `regex::RegexSet` built from the same
sources. Per-line scanning now asks the aggregated automaton first; only
lines that actually hit a pattern resolve an index back to the owning
`SecretPattern`. `RegexSet::matches().iter()` yields indices in
ascending order, so taking `.next()` preserves the prior
"first pattern wins" semantics — output shape of `SecretMatch` is
unchanged.

The `RegexSet` is compiled once per `build_patterns` call (and cached
via `LazyLock` for the default pattern set), so build cost is amortised.
All 38 existing safety tests plus the three proptests continue to pass.

Closes audit entry F-022 from #3.
Copilot AI review requested due to automatic review settings April 22, 2026 19:51
@Sephyi Sephyi added the audit Codebase audit cleanup (issue #3) label Apr 22, 2026
@Sephyi Sephyi self-assigned this Apr 22, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces a RegexSet-backed first-pass filter for secret scanning to reduce per-line scanning overhead and address audit finding F-022.

Changes:

  • Add PatternSet to bundle secret patterns with a compiled RegexSet for fast first-pass matching.
  • Update build_patterns/DEFAULT_PATTERNS to produce and cache a PatternSet instead of a raw Vec<SecretPattern>.
  • Refactor both staged-diff and full-diff scanners to use a shared first_match helper.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/services/safety.rs
Comment on lines +62 to +74
/// Since every individual [`Regex`] in `patterns` is already known to
/// compile, the combined [`RegexSet`] compile is expected to succeed; if
/// it somehow does not (size limits, etc.), we fall back to an empty
/// [`RegexSet`] and the scanner still works correctly — it just loses
/// the first-pass speedup and runs the per-pattern check on every line.
fn from_patterns(patterns: Vec<SecretPattern>) -> Self {
// `Regex::as_str` returns the exact source the regex was built from,
// so feeding the same strings into `RegexSet` yields identical
// semantics to the per-pattern `Regex::is_match` fallback.
let sources: Vec<&str> = patterns.iter().map(|p| p.regex.as_str()).collect();
let set = RegexSet::new(&sources).unwrap_or_else(|_| RegexSet::empty());
Self { patterns, set }
}
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PatternSet::from_patterns falls back to RegexSet::empty() when RegexSet::new fails, but the rest of the scanner now relies exclusively on RegexSet via first_match. In that failure mode (e.g., size limits with custom patterns), scanning will return false negatives for all secrets. Preserve correctness by recording the compile failure (e.g., Option<RegexSet> / Result) and falling back to per-pattern Regex::is_match when the set is unavailable (or when patterns is non-empty but the set has length 0 due to a compile fallback).

Copilot uses AI. Check for mistakes.
Comment thread src/services/safety.rs
Comment on lines +26 to +33
/// A bundle of secret-detection patterns together with a [`RegexSet`] over
/// the same patterns for a fast first-pass filter.
///
/// Per-line scanning calls [`RegexSet::matches`] first; only on hits does it
/// fall back to individual [`Regex::is_match`] on the matching pattern. This
/// turns the common no-match case from `O(N × 24 × L)` into
/// `O(N × L)` (one aggregated automaton pass) plus an `O(1)` membership check.
pub struct PatternSet {
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc for PatternSet/from_patterns describes a per-pattern Regex::is_match fallback after RegexSet::matches, but first_match currently returns the first RegexSet hit without any per-pattern verification, and there is no fallback path when the set can’t be built. Please align the documentation with the actual algorithm, or implement the described fallback behavior (especially for the RegexSet compile-failure case).

Copilot uses AI. Check for mistakes.
Comment thread src/services/safety.rs
Comment on lines +255 to +261
/// [`RegexSet::matches`] as a single-automaton first-pass filter — the
/// common case (no match) returns without touching any individual
/// [`Regex`]. `SetMatches::iter()` yields indices in ascending order, so
/// taking `.next()` preserves the previous "first pattern wins" semantics.
fn first_match<'a>(line: &str, patterns: &'a PatternSet) -> Option<&'a SecretPattern> {
let idx = patterns.set.matches(line).iter().next()?;
Some(&patterns.patterns[idx])
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first_match uses only patterns.set.matches(line) to decide a match. If the RegexSet was not successfully compiled (currently represented as RegexSet::empty()), this will always return None even when individual patterns would match. Consider making first_match explicitly fall back to scanning patterns.patterns with Regex::is_match when the set is unavailable/empty while patterns is non-empty.

Suggested change
/// [`RegexSet::matches`] as a single-automaton first-pass filter — the
/// common case (no match) returns without touching any individual
/// [`Regex`]. `SetMatches::iter()` yields indices in ascending order, so
/// taking `.next()` preserves the previous "first pattern wins" semantics.
fn first_match<'a>(line: &str, patterns: &'a PatternSet) -> Option<&'a SecretPattern> {
let idx = patterns.set.matches(line).iter().next()?;
Some(&patterns.patterns[idx])
/// [`RegexSet::matches`] as a single-automaton first-pass filter, then
/// falls back to checking individual [`Regex`] values if the set yields
/// no match. This preserves correctness if the set is unavailable while
/// keeping the fast path for the common case. `SetMatches::iter()`
/// yields indices in ascending order, so taking `.next()` preserves the
/// previous "first pattern wins" semantics.
fn first_match<'a>(line: &str, patterns: &'a PatternSet) -> Option<&'a SecretPattern> {
if let Some(idx) = patterns.set.matches(line).iter().next() {
return Some(&patterns.patterns[idx]);
}
patterns
.patterns
.iter()
.find(|pattern| pattern.regex.is_match(line))

Copilot uses AI. Check for mistakes.
Comment thread src/services/safety.rs
});
break; // One match per line is enough
}
if let Some(pat) = first_match(line, patterns) {
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In scan_for_secrets_with_patterns, first_match is called with the full diff line (including the leading '+'), while scan_full_diff_with_patterns strips the '+' before matching. This inconsistency can cause anchored custom patterns (e.g. ^TOKEN_...) to never match in the per-file API. Consider matching against &line[1..] here as well to align semantics across both scanners.

Suggested change
if let Some(pat) = first_match(line, patterns) {
let content = &line[1..];
if let Some(pat) = first_match(content, patterns) {

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

audit Codebase audit cleanup (issue #3)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants