Skip to content

perf: prime the row split cache from the engine's multiline loop (#120)#125

Merged
stevehansen merged 2 commits into
masterfrom
fix/prime-rawsplit-cache-120
May 16, 2026
Merged

perf: prime the row split cache from the engine's multiline loop (#120)#125
stevehansen merged 2 commits into
masterfrom
fix/prime-rawsplit-cache-120

Conversation

@stevehansen
Copy link
Copy Markdown
Owner

Summary

  • Engine's multiline continuation loop computes options.Splitter.Split per iteration to check for unterminated quotes, then discards the final split. The yielded row would lazily Split again on first field access — one redundant Split per multiline-enabled row.
  • Extend IRowFactory<TRow>.Create with an IList<MemoryText>? rawSplit parameter; each factory assigns it to the row's rawSplitLine field. Engine threads the captured split from both the header-init pre-pass and the per-row multiline branch.
  • Non-multiline paths pass null and the row's existing lazy split path is unchanged.

Resolves #120.

What changed

  • Csv/CsvReader.Engine.cs: IRowFactory<TRow>.Create gains IList<MemoryText>? rawSplit parameter. All four factories prime the row's rawSplitLine when non-null. Both Enumerate and EnumerateAsync declare rawSplit at iteration scope and capture from both multiline branches.
  • Csv/CsvReader.cs: rawSplitLine field on ReadLine, ReadLineSpan, ReadLineSpanOptimized moves from private to internal (sibling-nested factory structs don't get private access; sibling-class private access only applies to types nested within the field's owning type).
  • Csv/CsvReader.FromMemory.cs: same for ReadLineFromMemory.
  • The row classes themselves are already internal sealed; InternalsVisibleTo(\"Csv.Tests\") is in place from StringHelpers.cs. No public surface widened.

Perf delta

Path Before After
Multiline-enabled record, single physical line 1 Split in engine (check) + 1 Split on consumer's first field access 1 Split in engine, cache primed; no resplit
Multiline-enabled record, N physical lines N+1 Splits in engine + 1 Split on consumer's first field access N+1 Splits in engine, cache primed; no resplit
Non-multiline record 0 engine Splits + 1 lazy Split on first field access unchanged

Test plan

  • dotnet build clean on netstandard2.0, net8.0, net9.0
  • dotnet test — 169/169 passing (excluding the pre-existing flaky Memory_AllocationComparison GC-ratio test)
  • Existing multiline tests in Csv.Tests/EngineUnificationTests.cs, Csv.Tests/Tests.cs, and Csv.Tests/IssuesTests.cs continue to pass — they cover correctness; the perf change is unobservable without benchmarks (no dedicated regression test added)

🤖 Generated with Claude Code

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes the CSV reading process by allowing pre-split line data to be passed directly to row factories, reducing redundant splitting operations. The IRowFactory.Create method and its various implementations were updated to accept an optional rawSplit parameter, and internal fields were adjusted to store this data. Review feedback suggests further performance improvements by passing the expected column count as an initial capacity to the Split method in both synchronous and asynchronous enumeration paths to avoid unnecessary list reallocations.

Comment thread Csv/CsvReader.Engine.cs Outdated
if (options.AllowNewLineInEnclosedFieldValues && !isFirstDataLineInHeaderAbsentMode)
{
var rawSplit = options.Splitter.Split(line, options);
rawSplit = options.Splitter.Split(line, options);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Since the engine already knows the expected column count from the headers array at this point, passing it as the initial capacity to the Split method can avoid unnecessary reallocations of the internal list buffer. This aligns with the performance goals of this pull request.

                    rawSplit = options.Splitter.Split(line, options, headers.Length);

Comment thread Csv/CsvReader.Engine.cs Outdated
if (options.AllowNewLineInEnclosedFieldValues && !isFirstDataLineInHeaderAbsentMode)
{
var rawSplit = options.Splitter.Split(line, options);
rawSplit = options.Splitter.Split(line, options);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the synchronous Enumerate method, passing headers.Length as the initial capacity here will optimize the list allocation for the split results in the asynchronous path.

                    rawSplit = options.Splitter.Split(line, options, headers.Length);

The engine's multiline continuation loop computes options.Splitter.Split
per iteration to check for unterminated quotes, but the final split was
discarded -- the yielded row would lazily Split again on first field
access. One redundant Split per row whenever AllowNewLineInEnclosedFieldValues
is enabled.

Fix: extend IRowFactory<TRow>.Create with an IList<MemoryText>? rawSplit
parameter. Each factory assigns it to the row's rawSplitLine field
directly. Engine declares rawSplit at the iteration scope, populates it
in the header-init multiline pre-pass and the per-row multiline branch,
and passes it through to factory.Create. Non-multiline paths pass null
and the row's lazy split path is unchanged.

The rawSplitLine field on ReadLine/ReadLineSpan/ReadLineSpanOptimized/
ReadLineFromMemory moves from private to internal (the factory structs
are siblings, not nested-in, so sibling private access doesn't apply).
The classes themselves are already internal sealed; InternalsVisibleTo
("Csv.Tests") is in place. No public surface widened.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@stevehansen stevehansen force-pushed the fix/prime-rawsplit-cache-120 branch from e66f7f9 to 08f7cc5 Compare May 16, 2026 11:30
In the multiline yield loop, both Enumerate and EnumerateAsync know
the column count from `headers` before re-splitting. Thread it through
to Splitter.Split as the initial list capacity so we avoid the
List<T> grow-and-copy on the hot path.

Per-row pre-sizing only — the header-init split (where headers don't
exist yet) still uses the default capacity.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@stevehansen
Copy link
Copy Markdown
Owner Author

Folded in @gemini-code-assist's capacity-hint suggestion in 76f983c — passes headers.Length to Splitter.Split in both the sync and async yield-loop re-splits.

@gemini-code-assist
Copy link
Copy Markdown

Thanks for the update, @stevehansen. That looks like a solid refinement to ensure the Splitter.Split calls are consistently optimized with the headers.Length hint across both sync and async paths. This should effectively minimize redundant work as intended.

@stevehansen stevehansen merged commit 7e4106b into master May 16, 2026
3 checks passed
@stevehansen stevehansen deleted the fix/prime-rawsplit-cache-120 branch May 16, 2026 11:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Follow-ups from #118: prime row split cache and check only last field for unterminated quotes

1 participant