Skip to content

Latest commit

 

History

History
443 lines (389 loc) · 15.4 KB

File metadata and controls

443 lines (389 loc) · 15.4 KB

Performance

sanitizeData is designed for in-process sanitization of log payloads, request/response objects, and similar data before they leave your application. It is not designed for streaming pipelines or bulk batch processing of large files.

All numbers below are rough throughput on a modern laptop (Apple M-series, Node.js 22). Run the suite yourself with yarn bench.

String-value scanning overhead

String-value scanning (scanStringValues: true, the default) checks every non-sensitive string field for embedded patterns using a fast OR pre-filter before running the full regex suite. The pre-filter cost is low even when no pattern matches, but it is not zero — the overhead scales with the length and quantity of non-sensitive string values in the input.

The chart below shows the throughput reduction from enabling scanning relative to disabling it, sorted from highest to lowest overhead:

xychart-beta
    title "scanStringValues overhead by workload (sorted)"
    x-axis ["Log stack hit", "10KB string", "Log embed", "Arr-of-strs", "Shallow", "Log stack miss", "Nested", "Flat 1-key", "Flat 5-key", "Arrays"]
    y-axis "overhead pct" 0 --> 100
    bar [88, 68, 66, 47, 18, 18, 14, 10, 9, 3]
Loading

Key observations:

  • Log objects with long strings pay the most — a stack trace containing embedded credentials incurs ~88% overhead from the full regex suite running on a long string. A clean stack trace (pre-filter fast-exit) still incurs ~18% from the pre-filter scan alone.
  • 10KB non-sensitive string values incur ~68% overhead — the pre-filter must scan the full length even when it exits immediately with no match.
  • Array-of-strings fields (e.g. 100 log lines) pay ~47% — per-item pre-filter cost accumulates across all array elements.
  • Small shallow objects pay ~18% overhead — visible but sub-millisecond (~0.002 ms/call).
  • Large flat objects pay ~9–10% — scanning 45–49 non-sensitive fields costs less per field than scanning fewer long fields.
  • Arrays pay only ~1–5% — the per-item pre-filter cost is negligible compared to the work of traversing each item.

Array scaling

Array throughput scales nearly linearly with item count. The chart below shows items processed per second (ops/s × items/call) across four sizes for simple items (3 fields, 1 sensitive key), with scan enabled and disabled:

xychart-beta
    title "Array throughput items per second thousands"
    x-axis ["1k items", "10k items", "100k items", "1M items"]
    y-axis "items per sec thousands" 0 --> 2400
    line [2161, 2150, 1850, 1700]
    line [2272, 2180, 1890, 1800]
Loading

The two lines are scan enabled (lower) and scan disabled (upper). They are nearly indistinguishable — the ~1–5% gap is smaller than benchmark noise at this scale. The slight drop at 100k and 1M items reflects GC pressure from the large input array, not algorithmic degradation.

Object workload benchmarks

Rough throughput on a modern laptop (Apple M-series, Node.js 22):

Workload Case scanStringValues: true scanStringValues: false scan overhead
ops/s ms/call ops/s ms/call
Shallow object (4 fields) 1 sensitive key ~464,000~0.002 ~563,000~0.002 ~18%
4 sensitive keys (all) ~494,000~0.002
Deeply nested (5 levels) multiple sensitive keys ~311,000~0.003 ~362,000~0.003 ~14%
Log object (5 fields) embedded credential in string value ~138,000~0.007 ~407,000~0.002 ~66%
stack trace with embedded credentials ~46,000~0.022 ~387,000~0.003 ~88%
clean stack trace (pre-filter fast-exit) ~318,000~0.003 ~387,000~0.003 ~18%
Many embedded matches (21 fields) 20 string values all containing a pattern ~14,000~0.072
Large flat object (50 fields) 1 sensitive key ~82,000~0.012 ~91,000~0.011 ~10%
5 sensitive keys ~81,000~0.012 ~89,000~0.011 ~9%
Object with 10KB string field 1 sensitive key + 10KB non-sensitive value ~200,000~0.005 ~619,000~0.002 ~68%
array-of-strings field (100 clean log lines) ~223,000~0.004 ~425,000~0.002 ~47%
Deeply nested (5 × 10 safe strings) 5 levels, 10 non-sensitive string fields each ~30,000~0.033 ~32,000~0.031 ~6%
Array — simple items
(3 fields: 1 sensitive)
1,000 items ~2,161~0.46 ~2,272~0.44 ~5%
10,000 items ~215~4.7 ~218~4.6 ~1%
100,000 items ~18~54 ~19~53 ~2%
1,000,000 items ~1.7~574 ~1.8~552 ~4%
Array — complex items
(10 fields: 5 sensitive)
1,000 items ~590~1.69 ~565~1.77 ~0%
10,000 items ~55~18.1 ~58~17.2 ~5%
100,000 items ~5.3~191 ~5.3~187 ~0%
1,000,000 items ~0.50~2,015 ~0.50~1,982 ~2%

The "Many embedded matches" case is the worst case: every scanned string value actually contains a pattern and runs the full regex suite.

Set scanStringValues: false to recover the pre-scanning performance when you control your data structure and know sensitive values only appear on sensitive-named keys.

Cold start cost

On first call with a given set of options, sanitizeData compiles and caches the regex set for that configuration. Subsequent calls with the same options reuse the cache and pay no compile cost.

Case ops/s ms/call
Warm cache (same options each call) ~451,000 ~0.002
Cold start (unique options per call) ~14,000 ~0.070

The first call is ~32× slower than a warm call due to regex compilation. In steady-state server usage this cost is paid once per process lifetime and is negligible. It becomes visible only in tests or scripts that create many distinct option configurations (e.g. per-request custom patterns).

See Cache memory growth below for the memory implication of many distinct configurations.

removeMatches overhead

removeMatches: true deletes matched fields from objects and matched key=value pairs from strings instead of masking them. The cost is similar to masking for objects but slightly higher for string inputs due to regex replacement pattern differences.

Workload mask (default) remove remove overhead
ops/s ms/call ops/s ms/call
Shallow object (4 fields, 1 sensitive) ~440,000~0.002 ~441,000~0.002 ~0%
Large flat object (50 fields, 1 sensitive) ~80,000~0.013 ~77,000~0.013 ~3%
Array (1,000 items, 1 sensitive key) ~2,132~0.47 ~2,167~0.46 ~0%
Form-encoded string ~104,000~0.010 ~81,000~0.012 ~22%

For objects, removal and masking are nearly equivalent — both write a result object with the same traversal cost. For strings, removal is 10–20% slower because the match-and-remove regex path involves different replacement semantics than the $1<mask>$2 substitution.

String workloads

String input always scans the full string regardless of scanStringValues. The option only affects the object traversal path.

Workload ops/s ms/call remove ops/s
Long JSON string (50 sensitive key/value pairs) ~6,989 ~0.143
Form-encoded string (1 sensitive field) ~102,000 ~0.010 ~84,000
Escaped JSON string (1 sensitive field) ~91,000 ~0.011 ~69,000

Parser-first JSON strings

When parseJsonStrings: true is set, string inputs that are valid JSON objects or arrays are parsed and sanitized via the object path rather than the regex path. The parse-and-re-serialize overhead is offset by the fact that the object traversal is faster than running each pattern against every matcher across the full string. The key correctness advantage is that numeric-typed sensitive fields (e.g. {"password":12345}) are masked with numericMask — the default regex path cannot detect or replace bare numeric values in strings.

Workload parseJsonStrings: false (default) parseJsonStrings: true speedup
ops/s ms/call ops/s ms/call
Small JSON string (5 fields, 1 sensitive) ~78,073~0.0128 ~312,452~0.0032 ~4.0×
Large JSON string (50 fields, 5 sensitive string + 5 sensitive numeric) ~17,608~0.0568 ~58,763~0.0170 ~3.3×

The large input case also demonstrates the correctness benefit: with parseJsonStrings enabled, numeric token_N fields are correctly masked with numericMask, whereas the default regex path leaves them unmasked.

parseJsonStrings and scanStringValues interaction

Both options interact on JSON string input. scanStringValues has no effect when parseJsonStrings is disabled — string input goes through the regex path, which does not use scanStringValues. When parseJsonStrings is enabled, string input is parsed to an object first; scanStringValues then applies normally on the object path.

The chart below uses a representative 15-field log payload: 6 sensitive-named fields, 1 field with an embedded credential in a non-sensitive key, 1 stack trace, and 7 safe fields. The upper line is scanStringValues: false; the lower line is scanStringValues: true.

xychart-beta
    title "parseJsonStrings x scanStringValues interaction (15-field log payload, ops/s)"
    x-axis ["parseJsonStrings off", "parseJsonStrings on"]
    y-axis "ops/s" 0 --> 200000
    line [43000, 92000]
    line [43000, 181000]
Loading

The lines start at the same point — scanStringValues makes no difference on the regex path. They diverge when parseJsonStrings is on and the object path is active. The embedded-credential field and stack trace add scanStringValues overhead on the object path, explaining the ~2× gap between the two parseJsonStrings: true cases.

Option combination ops/s ms/call
parseJsonStrings: false, scanStringValues: true (default) ~43,000 ~0.023
parseJsonStrings: false, scanStringValues: false ~43,000 ~0.023
parseJsonStrings: true, scanStringValues: true ~92,000 ~0.011
parseJsonStrings: true, scanStringValues: false ~181,000 ~0.0055

High pattern counts

Pattern count affects object workloads proportionally when scanStringValues: true. With default patterns disabled:

Workload ops/s ms/call
50-field object, 50 custom patterns (no string match) ~22,000 ~0.046
3-field object, 50 custom patterns (no string match) ~55,000 ~0.018
3-field object, 50 custom patterns (string value hits) ~18,000 ~0.056

Production gotchas

Cache memory growth

sanitizeData caches compiled regex sets in a module-level LRU Map keyed by the full option fingerprint (matchers + patterns + removeMatches flag). The cache holds at most 10 entries; when full, the least-recently-used entry is evicted to make room for the new one.

In steady-state usage — a fixed configuration, possibly with a static list of customPatterns — the cache stays at 1–3 entries and this is not a concern.

If customPatterns vary per call (e.g. injected from user input or request data), entries will cycle through the cache and every call will pay the cold-start regex compilation cost (~32× slower than a warm call). In that scenario, prebuild the options object once (or a small set of them) and reuse it across calls. Or set scanStringValues: false, which bypasses the cache entirely.

Form-encoded matcher and multiline strings

The built-in form-encoded matcher uses [^\n&]* to match a field value — stopping at either an & delimiter or a newline. This means content on lines after a matched value is preserved:

Input:  "Error: auth failed — api_key=hunter2\n    at foo (bar.js:10)"
Output: "Error: auth failed — api_key=**********\n    at foo (bar.js:10)"

Stack traces and other multiline fields are safe to scan.

Running the benchmarks

yarn bench

Benchmarks live in bench/sanitize-data.bench.ts.