perf: prime the row split cache from the engine's multiline loop (#120) by stevehansen · Pull Request #125 · stevehansen/csv

stevehansen · 2026-05-16T11:27:44Z

Summary

Engine's multiline continuation loop computes options.Splitter.Split per iteration to check for unterminated quotes, then discards the final split. The yielded row would lazily Split again on first field access — one redundant Split per multiline-enabled row.
Extend IRowFactory<TRow>.Create with an IList<MemoryText>? rawSplit parameter; each factory assigns it to the row's rawSplitLine field. Engine threads the captured split from both the header-init pre-pass and the per-row multiline branch.
Non-multiline paths pass null and the row's existing lazy split path is unchanged.

Resolves #120.

What changed

Csv/CsvReader.Engine.cs: IRowFactory<TRow>.Create gains IList<MemoryText>? rawSplit parameter. All four factories prime the row's rawSplitLine when non-null. Both Enumerate and EnumerateAsync declare rawSplit at iteration scope and capture from both multiline branches.
Csv/CsvReader.cs: rawSplitLine field on ReadLine, ReadLineSpan, ReadLineSpanOptimized moves from private to internal (sibling-nested factory structs don't get private access; sibling-class private access only applies to types nested within the field's owning type).
Csv/CsvReader.FromMemory.cs: same for ReadLineFromMemory.
The row classes themselves are already internal sealed; InternalsVisibleTo(\"Csv.Tests\") is in place from StringHelpers.cs. No public surface widened.

Perf delta

Path	Before	After
Multiline-enabled record, single physical line	1 Split in engine (check) + 1 Split on consumer's first field access	1 Split in engine, cache primed; no resplit
Multiline-enabled record, N physical lines	N+1 Splits in engine + 1 Split on consumer's first field access	N+1 Splits in engine, cache primed; no resplit
Non-multiline record	0 engine Splits + 1 lazy Split on first field access	unchanged

Test plan

dotnet build clean on netstandard2.0, net8.0, net9.0
dotnet test — 169/169 passing (excluding the pre-existing flaky Memory_AllocationComparison GC-ratio test)
Existing multiline tests in Csv.Tests/EngineUnificationTests.cs, Csv.Tests/Tests.cs, and Csv.Tests/IssuesTests.cs continue to pass — they cover correctness; the perf change is unobservable without benchmarks (no dedicated regression test added)

🤖 Generated with Claude Code

gemini-code-assist

Code Review

This pull request optimizes the CSV reading process by allowing pre-split line data to be passed directly to row factories, reducing redundant splitting operations. The IRowFactory.Create method and its various implementations were updated to accept an optional rawSplit parameter, and internal fields were adjusted to store this data. Review feedback suggests further performance improvements by passing the expected column count as an initial capacity to the Split method in both synchronous and asynchronous enumeration paths to avoid unnecessary list reallocations.

gemini-code-assist · 2026-05-16T11:30:00Z

                if (options.AllowNewLineInEnclosedFieldValues && !isFirstDataLineInHeaderAbsentMode)
                {
-                    var rawSplit = options.Splitter.Split(line, options);
+                    rawSplit = options.Splitter.Split(line, options);


Since the engine already knows the expected column count from the headers array at this point, passing it as the initial capacity to the Split method can avoid unnecessary reallocations of the internal list buffer. This aligns with the performance goals of this pull request.

rawSplit = options.Splitter.Split(line, options, headers.Length);

gemini-code-assist · 2026-05-16T11:30:00Z

                if (options.AllowNewLineInEnclosedFieldValues && !isFirstDataLineInHeaderAbsentMode)
                {
-                    var rawSplit = options.Splitter.Split(line, options);
+                    rawSplit = options.Splitter.Split(line, options);


Similar to the synchronous Enumerate method, passing headers.Length as the initial capacity here will optimize the list allocation for the split results in the asynchronous path.

rawSplit = options.Splitter.Split(line, options, headers.Length);

The engine's multiline continuation loop computes options.Splitter.Split per iteration to check for unterminated quotes, but the final split was discarded -- the yielded row would lazily Split again on first field access. One redundant Split per row whenever AllowNewLineInEnclosedFieldValues is enabled. Fix: extend IRowFactory<TRow>.Create with an IList<MemoryText>? rawSplit parameter. Each factory assigns it to the row's rawSplitLine field directly. Engine declares rawSplit at the iteration scope, populates it in the header-init multiline pre-pass and the per-row multiline branch, and passes it through to factory.Create. Non-multiline paths pass null and the row's lazy split path is unchanged. The rawSplitLine field on ReadLine/ReadLineSpan/ReadLineSpanOptimized/ ReadLineFromMemory moves from private to internal (the factory structs are siblings, not nested-in, so sibling private access doesn't apply). The classes themselves are already internal sealed; InternalsVisibleTo ("Csv.Tests") is in place. No public surface widened. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

In the multiline yield loop, both Enumerate and EnumerateAsync know the column count from `headers` before re-splitting. Thread it through to Splitter.Split as the initial list capacity so we avoid the List<T> grow-and-copy on the hot path. Per-row pre-sizing only — the header-init split (where headers don't exist yet) still uses the default capacity. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

stevehansen · 2026-05-16T11:42:01Z

Folded in @gemini-code-assist's capacity-hint suggestion in 76f983c — passes headers.Length to Splitter.Split in both the sync and async yield-loop re-splits.

gemini-code-assist · 2026-05-16T11:43:22Z

Thanks for the update, @stevehansen. That looks like a solid refinement to ensure the Splitter.Split calls are consistently optimized with the headers.Length hint across both sync and async paths. This should effectively minimize redundant work as intended.

gemini-code-assist Bot reviewed May 16, 2026

View reviewed changes

stevehansen force-pushed the fix/prime-rawsplit-cache-120 branch from e66f7f9 to 08f7cc5 Compare May 16, 2026 11:30

stevehansen merged commit 7e4106b into master May 16, 2026
3 checks passed

stevehansen deleted the fix/prime-rawsplit-cache-120 branch May 16, 2026 11:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: prime the row split cache from the engine's multiline loop (#120)#125

perf: prime the row split cache from the engine's multiline loop (#120)#125
stevehansen merged 2 commits into
masterfrom
fix/prime-rawsplit-cache-120

stevehansen commented May 16, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 16, 2026

Uh oh!

gemini-code-assist Bot May 16, 2026

Uh oh!

stevehansen commented May 16, 2026

Uh oh!

gemini-code-assist Bot commented May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

stevehansen commented May 16, 2026

Summary

What changed

Perf delta

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

stevehansen commented May 16, 2026

Uh oh!

gemini-code-assist Bot commented May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant