feat: string-value scanning in objectReplacer with scanStringValues opt-out by ioncache · Pull Request #300 · ioncache/data-sanitization

ioncache · 2026-05-22T19:00:38Z

Summary

Adds string-value scanning to objectReplacer: string values on non-sensitive-key fields are now scanned for embedded sensitive patterns by default (e.g. { message: 'api_key=hunter2' } → { message: 'api_key=**********' })
Adds scanStringValues option (default true) to opt out and recover pre-feature performance
Uses a module-level regex cache and OR pre-filter to minimise per-call cost — first call with a given config pays construction; all subsequent calls are near-free
Expands benchmark suite from 4 to 14 cases, including scanStringValues: true/false paired comparisons, many-embedded-matches worst case, and wide-object + high-pattern-count cases
Updates README Performance section with full before/after throughput tables and documents the scanStringValues tradeoff

Performance impact (Apple M-series, Node.js 22)

Workload	`scanStringValues: true`	`scanStringValues: false`
Shallow object (4 fields)	~249,000 ops/s	~558,000 ops/s
Deeply nested (5 levels)	~208,000 ops/s	~389,000 ops/s
Object with embedded credential	~106,000 ops/s	~399,000 ops/s
Many embedded matches (20 fields)	~13,000 ops/s	—
Large flat object (50 fields)	~72,000 ops/s	~91,000 ops/s
Large array (1,000 items)	~2,200 ops/s	~2,400 ops/s

Test plan

yarn test — 93 tests passing
yarn lint — clean
yarn format:check — clean
yarn build — clean
yarn bench — all 14 benchmark cases run and print results
Verify scanStringValues: false restores pre-feature key-masking behaviour without value scanning
Verify embedded credential in string value is masked by default

Checklist

New option documented in README Options table and TSDoc
Performance section updated with full before/after tables
scanStringValues has no effect on string input (documented in TSDoc and README)
Module-level cache keyed on matchers + patterns + removeMatches

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Added scanStringValues option (default: true) to scan and sanitize embedded sensitive patterns inside string values on non-sensitive object keys; can be disabled to recover pre-scan object performance.
Documentation
- Added Performance section with benchmark comparisons, guidance, and bench invocation; clarified object vs string input behaviour and string-scan overhead.
Bug Fixes
- Fixed form-encoded masking to preserve newlines and subsequent content.
Tests
- Added unit and extensive benchmark tests covering string-value scanning, removal, custom matchers, nesting, arrays and cache-eviction scenarios.

Scans string values on non-sensitive-key fields for embedded sensitive patterns (e.g. message: 'api_key=hunter2' is now masked). Uses a module-level regex cache and OR pre-filter to minimise per-call cost. Adds scanStringValues option (default: true) to opt out and recover pre-feature performance. Expands benchmark suite with before/after pairs and worst-case workloads. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai · 2026-05-22T19:00:51Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 13b982bc-5e47-4969-afdc-e91bde4d35c7

📥 Commits

Reviewing files that changed from the base of the PR and between 0c80927 and 059d82f.

📒 Files selected for processing (4)

src/matchers.ts
src/replacers.ts
test/matchers.test.ts
test/replacers.test.ts

📝 Walkthrough

Walkthrough

Adds a scanStringValues option (default true) to object sanitization to scan string values on non-sensitive keys for embedded sensitive patterns, with a cached regex builder, matcher newline handling, expanded benchmarks, docs, and tests.

Changes

String-value scanning feature

Layer / File(s)	Summary
Type contract for string-value scanning option `src/types.ts`	`DataSanitizationReplacerOptions` gains optional `scanStringValues?: boolean` to control string-value scanning.
Public API and README updates `src/index.ts`, `README.md`	JSDoc and README examples updated to mention string-scan options; README adds Performance TOC entry and points to `docs/performance.md`.
Core regex cache and objectReplacer refactor `src/replacers.ts`	Adds `StringScanRegexes`, a capped config-keyed cache and `buildStringScanRegexes`; refactors `objectReplacer` to precompute `preFilter` + matcher-driven regexes and run `scanStringValue` (mask/remove) on non-sensitive-key string values when enabled; updates docs/examples.
Form-encoded matcher newline handling `src/matchers.ts`	Narrow `fieldValue` to exclude CR/LF and use a lookahead so masking/removal preserves subsequent lines and does not consume newlines.
Benchmark suite expansion `bench/sanitize-data.bench.ts`	Expanded benchmarks to compare `scanStringValues: true` vs `false` across many workloads: shallow/deep objects, logs, large flat objects, arrays up to 1M items, custom-patterns, warm/cold cache, and `removeMatches` matrix.
Performance documentation `docs/performance.md`	New performance doc with throughput data, scan overhead analysis, array-scaling notes, cold-start vs warm-cache guidance, `removeMatches` overhead, cache growth/eviction details, form-encoded multiline notes, and `yarn bench` instructions.
Matcher and replacer test suites `test/matchers.test.ts`, `test/replacers.test.ts`	Add form-encoded multiline and CRLF tests; add `headerMatcher` helper and `objectReplacer` string-value scanning tests covering masking, removal, custom matchers, disabled scanning, stack-trace handling, cache-collision, and cache-eviction.

Sequence Diagram(s)

sequenceDiagram
  participant objectReplacer
  participant buildStringScanRegexes
  participant stringScanCache
  participant scanStringValue
  objectReplacer->>buildStringScanRegexes: request regexes for current options/matchers
  buildStringScanRegexes->>stringScanCache: lookup by config key
  alt cache hit
    stringScanCache-->>buildStringScanRegexes: return cached regexes
  else cache miss
    buildStringScanRegexes->>buildStringScanRegexes: compute preFilter + replacement/removal regexes
    buildStringScanRegexes->>stringScanCache: store regexes by config key
  end
  buildStringScanRegexes-->>objectReplacer: preFilter and regexes
  objectReplacer->>scanStringValue: apply preFilter and regexes to string value
  scanStringValue-->>objectReplacer: return masked/removed string

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

ioncache/data-sanitization#282: Related changes to objectReplacer sanitization flow previously introduced.
ioncache/data-sanitization#299: Prior benchmark and README Performance work touching the same bench/docs files.
ioncache/data-sanitization#283: Related form-encoded matcher changes and test coverage.

🐰 I hop through keys and strings with care,
Hunting secrets hidden in text laid bare,
Caches kept tidy, regexes primed to scan,
Benches hum loudly for each workload's plan,
Masks and removals tidy traces by hand.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely summarizes the main change: adding string-value scanning capability to objectReplacer with a scanStringValues option for opt-out.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch worktree-feat+string-value-scanning

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-05-22T19:01:04Z

Coverage Report

Status	Category	Percentage	Covered / Total
🔵	Lines	100% (🎯 100%)	154 / 154
🔵	Statements	100% (🎯 100%)	157 / 157
🔵	Functions	100% (🎯 100%)	21 / 21
🔵	Branches	100% (🎯 100%)	90 / 90

File Coverage

File	Stmts	Branches	Functions	Lines
Changed Files
src/index.ts	100%	100%	100%	100%
src/matchers.ts	100%	100%	100%	100%
src/replacers.ts	100%	100%	100%	100%

Generated in workflow #198 for commit 059d82f by the Vitest Coverage Report Action

ops/s alone doesn't communicate how long a single call takes, which is what matters for request pipeline impact. Each throughput figure now includes the mean call time in µs or ms. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- formEncodedMatcher now uses [^\n&]* so it stops at newlines instead of consuming them, preserving stack traces and other multiline content - zero-width lookahead (?=\n|$) in the mask terminator keeps the newline in output so lines that follow a matched field are not lost - string-scan regex cache (stringScanCache) now caps at 10 entries; LRU eviction via Map insertion order prevents unbounded memory growth - adds 4 new matcher tests covering multiline masking and removal - adds stack-trace preservation test to objectReplacer suite - adds cache eviction correctness test Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…cale workloads - adds cold start vs warm cache comparison - adds removeMatches overhead across object, array, and string workloads - adds large 10KB non-sensitive string field benchmark - adds array-of-strings (100 log lines) benchmark - adds form-encoded and escaped JSON string input benchmarks - adds deeply nested object with many safe strings benchmark - extends simple and complex array suites to 1M items Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…on gotchas - adds docs/performance.md with sortable overhead bar chart, throughput line chart, full benchmark tables, cold start cost, removeMatches overhead, string workload table, high pattern count table, and production gotcha notes (LRU cache memory growth, form-encoded multiline safety) - updates README performance section with 10KB string row, 1M array row, and a reference link to the new docs/performance.md Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

src/replacers.ts (1)

38-43: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Cache key can collide for distinct custom matcher closures.

Line 38 uses m.toString() in the cache key. Different matcher instances can share identical source text but produce different regexes via captured state, causing incorrect cache hits and wrong sanitization behaviour.

Suggested fix

+const matcherIds = new WeakMap<DataSanitizationMatcher, number>();
+let nextMatcherId = 0;
+
+const getMatcherId = (matcher: DataSanitizationMatcher): number => {
+  const existing = matcherIds.get(matcher);
+  if (existing !== undefined) return existing;
+  const id = nextMatcherId++;
+  matcherIds.set(matcher, id);
+  return id;
+};
+
 const buildStringScanRegexes = (
   matchers: DataSanitizationMatcher[],
   patterns: string[],
   removeMatches: boolean,
 ): StringScanRegexes => {
-  const key =
-    matchers.map((m) => m.toString()).join('\x00') +
-    '\x01' +
-    patterns.join('\x00') +
-    '\x01' +
-    removeMatches;
+  const key = JSON.stringify({
+    matcherIds: matchers.map(getMatcherId),
+    patterns,
+    removeMatches,
+  });

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/replacers.ts` around lines 38 - 43, The cache key uses matchers.map(m =>
m.toString()), which can collide for different matcher closures; change the key
construction to incorporate a stable, unique identifier per matcher instead of
m.toString(): for RegExp matchers include their source and flags (m.source + '/'
+ m.flags) and for function/closure matchers assign a persistent id via a
WeakMap (e.g., matcherIds) that increments when a matcher is first seen, and use
that id in the key along with patterns and removeMatches; update the code that
builds key (the const key in this module) to use these identifiers so different
closure instances won't collide.

src/matchers.ts (1)

59-70: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

CRLF newline pairs are not fully preserved.

The updated pattern stops at \n but still consumes \r in \r\n inputs, which can alter multiline content on Windows-style line endings.

Suggested fix

-  const fieldValue = '[^\\n&]*';
+  const fieldValue = '[^\\r\\n&]*';
@@
-  const maskField = `(${fieldPrefix})${fieldValue}(&|(?=\\n|$))`;
+  const maskField = `(${fieldPrefix})${fieldValue}(&|(?=\\r?\\n|$))`;

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/matchers.ts` around lines 59 - 70, The regex currently treats newline as
\n only so CR in CRLF is consumed; change fieldValue from '[^\\n&]*' to exclude
CR as well ('[^\\r\\n&]*') and update the mask and remove patterns to use a
CRLF-aware lookahead: use (?=\\r?\\n|$) for maskField and ensure
removeLeadingField/removeField rely on the updated fieldValue so they won't
capture a stray \r; update the references to fieldValue, maskField,
removeLeadingField, and removeField (keeping MATCHER_FLAGS) accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@bench/sanitize-data.bench.ts`:
- Line 263: Update the misleading size comment "Sizes: 1k / 10k / 100k / 1M /
10M" to reflect the actual benchmark coverage (up to 1,000,000 only) by removing
the 10M entry or replacing it with the correct value (e.g., "Sizes: 1k / 10k /
100k / 1M"); locate the comment text in sanitize-data.bench.ts and edit the
comment so it accurately matches the suite sizes used.

---

Outside diff comments:
In `@src/matchers.ts`:
- Around line 59-70: The regex currently treats newline as \n only so CR in CRLF
is consumed; change fieldValue from '[^\\n&]*' to exclude CR as well
('[^\\r\\n&]*') and update the mask and remove patterns to use a CRLF-aware
lookahead: use (?=\\r?\\n|$) for maskField and ensure
removeLeadingField/removeField rely on the updated fieldValue so they won't
capture a stray \r; update the references to fieldValue, maskField,
removeLeadingField, and removeField (keeping MATCHER_FLAGS) accordingly.

In `@src/replacers.ts`:
- Around line 38-43: The cache key uses matchers.map(m => m.toString()), which
can collide for different matcher closures; change the key construction to
incorporate a stable, unique identifier per matcher instead of m.toString(): for
RegExp matchers include their source and flags (m.source + '/' + m.flags) and
for function/closure matchers assign a persistent id via a WeakMap (e.g.,
matcherIds) that increments when a matcher is first seen, and use that id in the
key along with patterns and removeMatches; update the code that builds key (the
const key in this module) to use these identifiers so different closure
instances won't collide.

🪄 Autofix (Beta)

✅ Autofix completed

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 15ad5a5e-d6cc-4edf-b929-b188d240222a

📥 Commits

Reviewing files that changed from the base of the PR and between f21a09a and 9339106.

📒 Files selected for processing (7)

README.md
bench/sanitize-data.bench.ts
docs/performance.md
src/matchers.ts
src/replacers.ts
test/matchers.test.ts
test/replacers.test.ts

coderabbitai · 2026-05-22T21:26:26Z

Note

Autofix is a beta feature. Expect some limitations and changes as we gather feedback and continue to improve it.

Fixes Applied Successfully

Fixed 1 file(s) based on 1 unresolved review comment.

Files modified:

bench/sanitize-data.bench.ts

Commit: 0c8092717d59301d8dfc533ce10024055fcc24cf

The changes have been pushed to the worktree-feat+string-value-scanning branch.

Time taken: 1m 45s

Fixed 1 file(s) based on 1 unresolved review comment. Co-authored-by: CodeRabbit <noreply@coderabbit.ai>

- formEncodedMatcher fieldValue now excludes \r ([^\r\n&]*) so CR is not consumed as part of the matched value in CRLF input; the mask lookahead uses (?=\r?\n|$) to match either LF or CRLF line endings - string-scan cache key now uses per-matcher WeakMap integer IDs instead of m.toString(); closures with identical source text but different captured state no longer hash to the same key - adds two CRLF matcher tests (stop-at-CR, mask-and-preserve) - adds a closure-collision regression test: primes cache with matcherA, verifies matcherB independently masks its own prefix Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ioncache and others added 4 commits May 22, 2026 15:04

coderabbitai Bot reviewed May 22, 2026

View reviewed changes

Comment thread bench/sanitize-data.bench.ts Outdated

coderabbitai Bot and others added 2 commits May 22, 2026 21:28

fix: apply CodeRabbit auto-fixes

0c80927

Fixed 1 file(s) based on 1 unresolved review comment. Co-authored-by: CodeRabbit <noreply@coderabbit.ai>

ioncache merged commit 269ca48 into main May 22, 2026
5 checks passed

ioncache deleted the worktree-feat+string-value-scanning branch May 22, 2026 21:37

ioncache added a commit that referenced this pull request May 22, 2026

chore: mark string-value scanning roadmap items complete (#300)

2f4b4bc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: string-value scanning in objectReplacer with scanStringValues opt-out#300

feat: string-value scanning in objectReplacer with scanStringValues opt-out#300
ioncache merged 7 commits into
mainfrom
worktree-feat+string-value-scanning

ioncache commented May 22, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 22, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Uh oh!

github-actions Bot commented May 22, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

coderabbitai Bot commented May 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ioncache commented May 22, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Performance impact (Apple M-series, Node.js 22)

Test plan

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Uh oh!

github-actions Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coverage Report

Uh oh!

coderabbitai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fixes Applied Successfully

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ioncache commented May 22, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 22, 2026 •

edited

Loading

github-actions Bot commented May 22, 2026 •

edited

Loading

coderabbitai Bot left a comment •

edited

Loading

coderabbitai Bot commented May 22, 2026 •

edited

Loading