feat: string-value scanning in objectReplacer with scanStringValues opt-out#300
Conversation
Scans string values on non-sensitive-key fields for embedded sensitive patterns (e.g. message: 'api_key=hunter2' is now masked). Uses a module-level regex cache and OR pre-filter to minimise per-call cost. Adds scanStringValues option (default: true) to opt out and recover pre-feature performance. Expands benchmark suite with before/after pairs and worst-case workloads. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Plus Run ID: 📒 Files selected for processing (4)
📝 WalkthroughWalkthroughAdds a ChangesString-value scanning feature
Sequence Diagram(s)sequenceDiagram
participant objectReplacer
participant buildStringScanRegexes
participant stringScanCache
participant scanStringValue
objectReplacer->>buildStringScanRegexes: request regexes for current options/matchers
buildStringScanRegexes->>stringScanCache: lookup by config key
alt cache hit
stringScanCache-->>buildStringScanRegexes: return cached regexes
else cache miss
buildStringScanRegexes->>buildStringScanRegexes: compute preFilter + replacement/removal regexes
buildStringScanRegexes->>stringScanCache: store regexes by config key
end
buildStringScanRegexes-->>objectReplacer: preFilter and regexes
objectReplacer->>scanStringValue: apply preFilter and regexes to string value
scanStringValue-->>objectReplacer: return masked/removed string
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
Coverage Report
File Coverage
|
||||||||||||||||||||||||||||||||||||||||||||||||||
ops/s alone doesn't communicate how long a single call takes, which is what matters for request pipeline impact. Each throughput figure now includes the mean call time in µs or ms. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- formEncodedMatcher now uses [^\n&]* so it stops at newlines instead of consuming them, preserving stack traces and other multiline content - zero-width lookahead (?=\n|$) in the mask terminator keeps the newline in output so lines that follow a matched field are not lost - string-scan regex cache (stringScanCache) now caps at 10 entries; LRU eviction via Map insertion order prevents unbounded memory growth - adds 4 new matcher tests covering multiline masking and removal - adds stack-trace preservation test to objectReplacer suite - adds cache eviction correctness test Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…cale workloads - adds cold start vs warm cache comparison - adds removeMatches overhead across object, array, and string workloads - adds large 10KB non-sensitive string field benchmark - adds array-of-strings (100 log lines) benchmark - adds form-encoded and escaped JSON string input benchmarks - adds deeply nested object with many safe strings benchmark - extends simple and complex array suites to 1M items Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…on gotchas - adds docs/performance.md with sortable overhead bar chart, throughput line chart, full benchmark tables, cold start cost, removeMatches overhead, string workload table, high pattern count table, and production gotcha notes (LRU cache memory growth, form-encoded multiline safety) - updates README performance section with 10KB string row, 1M array row, and a reference link to the new docs/performance.md Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/replacers.ts (1)
38-43:⚠️ Potential issue | 🟠 Major | ⚡ Quick winCache key can collide for distinct custom matcher closures.
Line 38 uses
m.toString()in the cache key. Different matcher instances can share identical source text but produce different regexes via captured state, causing incorrect cache hits and wrong sanitization behaviour.Suggested fix
+const matcherIds = new WeakMap<DataSanitizationMatcher, number>(); +let nextMatcherId = 0; + +const getMatcherId = (matcher: DataSanitizationMatcher): number => { + const existing = matcherIds.get(matcher); + if (existing !== undefined) return existing; + const id = nextMatcherId++; + matcherIds.set(matcher, id); + return id; +}; + const buildStringScanRegexes = ( matchers: DataSanitizationMatcher[], patterns: string[], removeMatches: boolean, ): StringScanRegexes => { - const key = - matchers.map((m) => m.toString()).join('\x00') + - '\x01' + - patterns.join('\x00') + - '\x01' + - removeMatches; + const key = JSON.stringify({ + matcherIds: matchers.map(getMatcherId), + patterns, + removeMatches, + });🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/replacers.ts` around lines 38 - 43, The cache key uses matchers.map(m => m.toString()), which can collide for different matcher closures; change the key construction to incorporate a stable, unique identifier per matcher instead of m.toString(): for RegExp matchers include their source and flags (m.source + '/' + m.flags) and for function/closure matchers assign a persistent id via a WeakMap (e.g., matcherIds) that increments when a matcher is first seen, and use that id in the key along with patterns and removeMatches; update the code that builds key (the const key in this module) to use these identifiers so different closure instances won't collide.src/matchers.ts (1)
59-70:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winCRLF newline pairs are not fully preserved.
The updated pattern stops at
\nbut still consumes\rin\r\ninputs, which can alter multiline content on Windows-style line endings.Suggested fix
- const fieldValue = '[^\\n&]*'; + const fieldValue = '[^\\r\\n&]*'; @@ - const maskField = `(${fieldPrefix})${fieldValue}(&|(?=\\n|$))`; + const maskField = `(${fieldPrefix})${fieldValue}(&|(?=\\r?\\n|$))`;🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/matchers.ts` around lines 59 - 70, The regex currently treats newline as \n only so CR in CRLF is consumed; change fieldValue from '[^\\n&]*' to exclude CR as well ('[^\\r\\n&]*') and update the mask and remove patterns to use a CRLF-aware lookahead: use (?=\\r?\\n|$) for maskField and ensure removeLeadingField/removeField rely on the updated fieldValue so they won't capture a stray \r; update the references to fieldValue, maskField, removeLeadingField, and removeField (keeping MATCHER_FLAGS) accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@bench/sanitize-data.bench.ts`:
- Line 263: Update the misleading size comment "Sizes: 1k / 10k / 100k / 1M /
10M" to reflect the actual benchmark coverage (up to 1,000,000 only) by removing
the 10M entry or replacing it with the correct value (e.g., "Sizes: 1k / 10k /
100k / 1M"); locate the comment text in sanitize-data.bench.ts and edit the
comment so it accurately matches the suite sizes used.
---
Outside diff comments:
In `@src/matchers.ts`:
- Around line 59-70: The regex currently treats newline as \n only so CR in CRLF
is consumed; change fieldValue from '[^\\n&]*' to exclude CR as well
('[^\\r\\n&]*') and update the mask and remove patterns to use a CRLF-aware
lookahead: use (?=\\r?\\n|$) for maskField and ensure
removeLeadingField/removeField rely on the updated fieldValue so they won't
capture a stray \r; update the references to fieldValue, maskField,
removeLeadingField, and removeField (keeping MATCHER_FLAGS) accordingly.
In `@src/replacers.ts`:
- Around line 38-43: The cache key uses matchers.map(m => m.toString()), which
can collide for different matcher closures; change the key construction to
incorporate a stable, unique identifier per matcher instead of m.toString(): for
RegExp matchers include their source and flags (m.source + '/' + m.flags) and
for function/closure matchers assign a persistent id via a WeakMap (e.g.,
matcherIds) that increments when a matcher is first seen, and use that id in the
key along with patterns and removeMatches; update the code that builds key (the
const key in this module) to use these identifiers so different closure
instances won't collide.
🪄 Autofix (Beta)
✅ Autofix completed
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 15ad5a5e-d6cc-4edf-b929-b188d240222a
📒 Files selected for processing (7)
README.mdbench/sanitize-data.bench.tsdocs/performance.mdsrc/matchers.tssrc/replacers.tstest/matchers.test.tstest/replacers.test.ts
|
Note Autofix is a beta feature. Expect some limitations and changes as we gather feedback and continue to improve it. Fixes Applied SuccessfullyFixed 1 file(s) based on 1 unresolved review comment. Files modified:
Commit: The changes have been pushed to the Time taken: |
Fixed 1 file(s) based on 1 unresolved review comment. Co-authored-by: CodeRabbit <noreply@coderabbit.ai>
- formEncodedMatcher fieldValue now excludes \r ([^\r\n&]*) so CR is not consumed as part of the matched value in CRLF input; the mask lookahead uses (?=\r?\n|$) to match either LF or CRLF line endings - string-scan cache key now uses per-matcher WeakMap integer IDs instead of m.toString(); closures with identical source text but different captured state no longer hash to the same key - adds two CRLF matcher tests (stop-at-CR, mask-and-preserve) - adds a closure-collision regression test: primes cache with matcherA, verifies matcherB independently masks its own prefix Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
objectReplacer: string values on non-sensitive-key fields are now scanned for embedded sensitive patterns by default (e.g.{ message: 'api_key=hunter2' }→{ message: 'api_key=**********' })scanStringValuesoption (defaulttrue) to opt out and recover pre-feature performancescanStringValues: true/falsepaired comparisons, many-embedded-matches worst case, and wide-object + high-pattern-count casesscanStringValuestradeoffPerformance impact (Apple M-series, Node.js 22)
scanStringValues: truescanStringValues: falseTest plan
yarn test— 93 tests passingyarn lint— cleanyarn format:check— cleanyarn build— cleanyarn bench— all 14 benchmark cases run and print resultsscanStringValues: falserestores pre-feature key-masking behaviour without value scanningChecklist
scanStringValueshas no effect on string input (documented in TSDoc and README)🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Documentation
Bug Fixes
Tests