Skip to content

feat: string-value scanning in objectReplacer with scanStringValues opt-out#300

Merged
ioncache merged 7 commits into
mainfrom
worktree-feat+string-value-scanning
May 22, 2026
Merged

feat: string-value scanning in objectReplacer with scanStringValues opt-out#300
ioncache merged 7 commits into
mainfrom
worktree-feat+string-value-scanning

Conversation

@ioncache
Copy link
Copy Markdown
Owner

@ioncache ioncache commented May 22, 2026

Summary

  • Adds string-value scanning to objectReplacer: string values on non-sensitive-key fields are now scanned for embedded sensitive patterns by default (e.g. { message: 'api_key=hunter2' }{ message: 'api_key=**********' })
  • Adds scanStringValues option (default true) to opt out and recover pre-feature performance
  • Uses a module-level regex cache and OR pre-filter to minimise per-call cost — first call with a given config pays construction; all subsequent calls are near-free
  • Expands benchmark suite from 4 to 14 cases, including scanStringValues: true/false paired comparisons, many-embedded-matches worst case, and wide-object + high-pattern-count cases
  • Updates README Performance section with full before/after throughput tables and documents the scanStringValues tradeoff

Performance impact (Apple M-series, Node.js 22)

Workload scanStringValues: true scanStringValues: false
Shallow object (4 fields) ~249,000 ops/s ~558,000 ops/s
Deeply nested (5 levels) ~208,000 ops/s ~389,000 ops/s
Object with embedded credential ~106,000 ops/s ~399,000 ops/s
Many embedded matches (20 fields) ~13,000 ops/s
Large flat object (50 fields) ~72,000 ops/s ~91,000 ops/s
Large array (1,000 items) ~2,200 ops/s ~2,400 ops/s

Test plan

  • yarn test — 93 tests passing
  • yarn lint — clean
  • yarn format:check — clean
  • yarn build — clean
  • yarn bench — all 14 benchmark cases run and print results
  • Verify scanStringValues: false restores pre-feature key-masking behaviour without value scanning
  • Verify embedded credential in string value is masked by default

Checklist

  • New option documented in README Options table and TSDoc
  • Performance section updated with full before/after tables
  • scanStringValues has no effect on string input (documented in TSDoc and README)
  • Module-level cache keyed on matchers + patterns + removeMatches

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added scanStringValues option (default: true) to scan and sanitize embedded sensitive patterns inside string values on non-sensitive object keys; can be disabled to recover pre-scan object performance.
  • Documentation

    • Added Performance section with benchmark comparisons, guidance, and bench invocation; clarified object vs string input behaviour and string-scan overhead.
  • Bug Fixes

    • Fixed form-encoded masking to preserve newlines and subsequent content.
  • Tests

    • Added unit and extensive benchmark tests covering string-value scanning, removal, custom matchers, nesting, arrays and cache-eviction scenarios.

Review Change Stack

Scans string values on non-sensitive-key fields for embedded sensitive
patterns (e.g. message: 'api_key=hunter2' is now masked). Uses a
module-level regex cache and OR pre-filter to minimise per-call cost.
Adds scanStringValues option (default: true) to opt out and recover
pre-feature performance. Expands benchmark suite with before/after
pairs and worst-case workloads.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 22, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 13b982bc-5e47-4969-afdc-e91bde4d35c7

📥 Commits

Reviewing files that changed from the base of the PR and between 0c80927 and 059d82f.

📒 Files selected for processing (4)
  • src/matchers.ts
  • src/replacers.ts
  • test/matchers.test.ts
  • test/replacers.test.ts

📝 Walkthrough

Walkthrough

Adds a scanStringValues option (default true) to object sanitization to scan string values on non-sensitive keys for embedded sensitive patterns, with a cached regex builder, matcher newline handling, expanded benchmarks, docs, and tests.

Changes

String-value scanning feature

Layer / File(s) Summary
Type contract for string-value scanning option
src/types.ts
DataSanitizationReplacerOptions gains optional scanStringValues?: boolean to control string-value scanning.
Public API and README updates
src/index.ts, README.md
JSDoc and README examples updated to mention string-scan options; README adds Performance TOC entry and points to docs/performance.md.
Core regex cache and objectReplacer refactor
src/replacers.ts
Adds StringScanRegexes, a capped config-keyed cache and buildStringScanRegexes; refactors objectReplacer to precompute preFilter + matcher-driven regexes and run scanStringValue (mask/remove) on non-sensitive-key string values when enabled; updates docs/examples.
Form-encoded matcher newline handling
src/matchers.ts
Narrow fieldValue to exclude CR/LF and use a lookahead so masking/removal preserves subsequent lines and does not consume newlines.
Benchmark suite expansion
bench/sanitize-data.bench.ts
Expanded benchmarks to compare scanStringValues: true vs false across many workloads: shallow/deep objects, logs, large flat objects, arrays up to 1M items, custom-patterns, warm/cold cache, and removeMatches matrix.
Performance documentation
docs/performance.md
New performance doc with throughput data, scan overhead analysis, array-scaling notes, cold-start vs warm-cache guidance, removeMatches overhead, cache growth/eviction details, form-encoded multiline notes, and yarn bench instructions.
Matcher and replacer test suites
test/matchers.test.ts, test/replacers.test.ts
Add form-encoded multiline and CRLF tests; add headerMatcher helper and objectReplacer string-value scanning tests covering masking, removal, custom matchers, disabled scanning, stack-trace handling, cache-collision, and cache-eviction.

Sequence Diagram(s)

sequenceDiagram
  participant objectReplacer
  participant buildStringScanRegexes
  participant stringScanCache
  participant scanStringValue
  objectReplacer->>buildStringScanRegexes: request regexes for current options/matchers
  buildStringScanRegexes->>stringScanCache: lookup by config key
  alt cache hit
    stringScanCache-->>buildStringScanRegexes: return cached regexes
  else cache miss
    buildStringScanRegexes->>buildStringScanRegexes: compute preFilter + replacement/removal regexes
    buildStringScanRegexes->>stringScanCache: store regexes by config key
  end
  buildStringScanRegexes-->>objectReplacer: preFilter and regexes
  objectReplacer->>scanStringValue: apply preFilter and regexes to string value
  scanStringValue-->>objectReplacer: return masked/removed string
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

🐰 I hop through keys and strings with care,
Hunting secrets hidden in text laid bare,
Caches kept tidy, regexes primed to scan,
Benches hum loudly for each workload's plan,
Masks and removals tidy traces by hand.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding string-value scanning capability to objectReplacer with a scanStringValues option for opt-out.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch worktree-feat+string-value-scanning

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 22, 2026

Coverage Report

Status Category Percentage Covered / Total
🔵 Lines 100% (🎯 100%) 154 / 154
🔵 Statements 100% (🎯 100%) 157 / 157
🔵 Functions 100% (🎯 100%) 21 / 21
🔵 Branches 100% (🎯 100%) 90 / 90
File Coverage
File Stmts Branches Functions Lines Uncovered Lines
Changed Files
src/index.ts 100% 100% 100% 100%
src/matchers.ts 100% 100% 100% 100%
src/replacers.ts 100% 100% 100% 100%
Generated in workflow #198 for commit 059d82f by the Vitest Coverage Report Action

ioncache and others added 4 commits May 22, 2026 15:04
ops/s alone doesn't communicate how long a single call takes, which is
what matters for request pipeline impact. Each throughput figure now
includes the mean call time in µs or ms.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- formEncodedMatcher now uses [^\n&]* so it stops at newlines instead of
  consuming them, preserving stack traces and other multiline content
- zero-width lookahead (?=\n|$) in the mask terminator keeps the newline
  in output so lines that follow a matched field are not lost
- string-scan regex cache (stringScanCache) now caps at 10 entries; LRU
  eviction via Map insertion order prevents unbounded memory growth
- adds 4 new matcher tests covering multiline masking and removal
- adds stack-trace preservation test to objectReplacer suite
- adds cache eviction correctness test

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…cale workloads

- adds cold start vs warm cache comparison
- adds removeMatches overhead across object, array, and string workloads
- adds large 10KB non-sensitive string field benchmark
- adds array-of-strings (100 log lines) benchmark
- adds form-encoded and escaped JSON string input benchmarks
- adds deeply nested object with many safe strings benchmark
- extends simple and complex array suites to 1M items

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…on gotchas

- adds docs/performance.md with sortable overhead bar chart, throughput
  line chart, full benchmark tables, cold start cost, removeMatches overhead,
  string workload table, high pattern count table, and production gotcha notes
  (LRU cache memory growth, form-encoded multiline safety)
- updates README performance section with 10KB string row, 1M array row,
  and a reference link to the new docs/performance.md

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/replacers.ts (1)

38-43: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Cache key can collide for distinct custom matcher closures.

Line 38 uses m.toString() in the cache key. Different matcher instances can share identical source text but produce different regexes via captured state, causing incorrect cache hits and wrong sanitization behaviour.

Suggested fix
+const matcherIds = new WeakMap<DataSanitizationMatcher, number>();
+let nextMatcherId = 0;
+
+const getMatcherId = (matcher: DataSanitizationMatcher): number => {
+  const existing = matcherIds.get(matcher);
+  if (existing !== undefined) return existing;
+  const id = nextMatcherId++;
+  matcherIds.set(matcher, id);
+  return id;
+};
+
 const buildStringScanRegexes = (
   matchers: DataSanitizationMatcher[],
   patterns: string[],
   removeMatches: boolean,
 ): StringScanRegexes => {
-  const key =
-    matchers.map((m) => m.toString()).join('\x00') +
-    '\x01' +
-    patterns.join('\x00') +
-    '\x01' +
-    removeMatches;
+  const key = JSON.stringify({
+    matcherIds: matchers.map(getMatcherId),
+    patterns,
+    removeMatches,
+  });
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/replacers.ts` around lines 38 - 43, The cache key uses matchers.map(m =>
m.toString()), which can collide for different matcher closures; change the key
construction to incorporate a stable, unique identifier per matcher instead of
m.toString(): for RegExp matchers include their source and flags (m.source + '/'
+ m.flags) and for function/closure matchers assign a persistent id via a
WeakMap (e.g., matcherIds) that increments when a matcher is first seen, and use
that id in the key along with patterns and removeMatches; update the code that
builds key (the const key in this module) to use these identifiers so different
closure instances won't collide.
src/matchers.ts (1)

59-70: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

CRLF newline pairs are not fully preserved.

The updated pattern stops at \n but still consumes \r in \r\n inputs, which can alter multiline content on Windows-style line endings.

Suggested fix
-  const fieldValue = '[^\\n&]*';
+  const fieldValue = '[^\\r\\n&]*';
@@
-  const maskField = `(${fieldPrefix})${fieldValue}(&|(?=\\n|$))`;
+  const maskField = `(${fieldPrefix})${fieldValue}(&|(?=\\r?\\n|$))`;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/matchers.ts` around lines 59 - 70, The regex currently treats newline as
\n only so CR in CRLF is consumed; change fieldValue from '[^\\n&]*' to exclude
CR as well ('[^\\r\\n&]*') and update the mask and remove patterns to use a
CRLF-aware lookahead: use (?=\\r?\\n|$) for maskField and ensure
removeLeadingField/removeField rely on the updated fieldValue so they won't
capture a stray \r; update the references to fieldValue, maskField,
removeLeadingField, and removeField (keeping MATCHER_FLAGS) accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@bench/sanitize-data.bench.ts`:
- Line 263: Update the misleading size comment "Sizes: 1k / 10k / 100k / 1M /
10M" to reflect the actual benchmark coverage (up to 1,000,000 only) by removing
the 10M entry or replacing it with the correct value (e.g., "Sizes: 1k / 10k /
100k / 1M"); locate the comment text in sanitize-data.bench.ts and edit the
comment so it accurately matches the suite sizes used.

---

Outside diff comments:
In `@src/matchers.ts`:
- Around line 59-70: The regex currently treats newline as \n only so CR in CRLF
is consumed; change fieldValue from '[^\\n&]*' to exclude CR as well
('[^\\r\\n&]*') and update the mask and remove patterns to use a CRLF-aware
lookahead: use (?=\\r?\\n|$) for maskField and ensure
removeLeadingField/removeField rely on the updated fieldValue so they won't
capture a stray \r; update the references to fieldValue, maskField,
removeLeadingField, and removeField (keeping MATCHER_FLAGS) accordingly.

In `@src/replacers.ts`:
- Around line 38-43: The cache key uses matchers.map(m => m.toString()), which
can collide for different matcher closures; change the key construction to
incorporate a stable, unique identifier per matcher instead of m.toString(): for
RegExp matchers include their source and flags (m.source + '/' + m.flags) and
for function/closure matchers assign a persistent id via a WeakMap (e.g.,
matcherIds) that increments when a matcher is first seen, and use that id in the
key along with patterns and removeMatches; update the code that builds key (the
const key in this module) to use these identifiers so different closure
instances won't collide.
🪄 Autofix (Beta)

✅ Autofix completed


ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 15ad5a5e-d6cc-4edf-b929-b188d240222a

📥 Commits

Reviewing files that changed from the base of the PR and between f21a09a and 9339106.

📒 Files selected for processing (7)
  • README.md
  • bench/sanitize-data.bench.ts
  • docs/performance.md
  • src/matchers.ts
  • src/replacers.ts
  • test/matchers.test.ts
  • test/replacers.test.ts

Comment thread bench/sanitize-data.bench.ts Outdated
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 22, 2026

Note

Autofix is a beta feature. Expect some limitations and changes as we gather feedback and continue to improve it.

Fixes Applied Successfully

Fixed 1 file(s) based on 1 unresolved review comment.

Files modified:

  • bench/sanitize-data.bench.ts

Commit: 0c8092717d59301d8dfc533ce10024055fcc24cf

The changes have been pushed to the worktree-feat+string-value-scanning branch.

Time taken: 1m 45s

coderabbitai Bot and others added 2 commits May 22, 2026 21:28
Fixed 1 file(s) based on 1 unresolved review comment.

Co-authored-by: CodeRabbit <noreply@coderabbit.ai>
- formEncodedMatcher fieldValue now excludes \r ([^\r\n&]*) so CR is not
  consumed as part of the matched value in CRLF input; the mask lookahead
  uses (?=\r?\n|$) to match either LF or CRLF line endings
- string-scan cache key now uses per-matcher WeakMap integer IDs instead of
  m.toString(); closures with identical source text but different captured
  state no longer hash to the same key
- adds two CRLF matcher tests (stop-at-CR, mask-and-preserve)
- adds a closure-collision regression test: primes cache with matcherA,
  verifies matcherB independently masks its own prefix

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ioncache ioncache merged commit 269ca48 into main May 22, 2026
5 checks passed
@ioncache ioncache deleted the worktree-feat+string-value-scanning branch May 22, 2026 21:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant