perf: prime the row split cache from the engine's multiline loop (#120)#125
Conversation
There was a problem hiding this comment.
Code Review
This pull request optimizes the CSV reading process by allowing pre-split line data to be passed directly to row factories, reducing redundant splitting operations. The IRowFactory.Create method and its various implementations were updated to accept an optional rawSplit parameter, and internal fields were adjusted to store this data. Review feedback suggests further performance improvements by passing the expected column count as an initial capacity to the Split method in both synchronous and asynchronous enumeration paths to avoid unnecessary list reallocations.
| if (options.AllowNewLineInEnclosedFieldValues && !isFirstDataLineInHeaderAbsentMode) | ||
| { | ||
| var rawSplit = options.Splitter.Split(line, options); | ||
| rawSplit = options.Splitter.Split(line, options); |
There was a problem hiding this comment.
Since the engine already knows the expected column count from the headers array at this point, passing it as the initial capacity to the Split method can avoid unnecessary reallocations of the internal list buffer. This aligns with the performance goals of this pull request.
rawSplit = options.Splitter.Split(line, options, headers.Length);| if (options.AllowNewLineInEnclosedFieldValues && !isFirstDataLineInHeaderAbsentMode) | ||
| { | ||
| var rawSplit = options.Splitter.Split(line, options); | ||
| rawSplit = options.Splitter.Split(line, options); |
The engine's multiline continuation loop computes options.Splitter.Split
per iteration to check for unterminated quotes, but the final split was
discarded -- the yielded row would lazily Split again on first field
access. One redundant Split per row whenever AllowNewLineInEnclosedFieldValues
is enabled.
Fix: extend IRowFactory<TRow>.Create with an IList<MemoryText>? rawSplit
parameter. Each factory assigns it to the row's rawSplitLine field
directly. Engine declares rawSplit at the iteration scope, populates it
in the header-init multiline pre-pass and the per-row multiline branch,
and passes it through to factory.Create. Non-multiline paths pass null
and the row's lazy split path is unchanged.
The rawSplitLine field on ReadLine/ReadLineSpan/ReadLineSpanOptimized/
ReadLineFromMemory moves from private to internal (the factory structs
are siblings, not nested-in, so sibling private access doesn't apply).
The classes themselves are already internal sealed; InternalsVisibleTo
("Csv.Tests") is in place. No public surface widened.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
e66f7f9 to
08f7cc5
Compare
In the multiline yield loop, both Enumerate and EnumerateAsync know the column count from `headers` before re-splitting. Thread it through to Splitter.Split as the initial list capacity so we avoid the List<T> grow-and-copy on the hot path. Per-row pre-sizing only — the header-init split (where headers don't exist yet) still uses the default capacity. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Folded in @gemini-code-assist's capacity-hint suggestion in 76f983c — passes |
|
Thanks for the update, @stevehansen. That looks like a solid refinement to ensure the |
Summary
options.Splitter.Splitper iteration to check for unterminated quotes, then discards the final split. The yielded row would lazilySplitagain on first field access — one redundantSplitper multiline-enabled row.IRowFactory<TRow>.Createwith anIList<MemoryText>? rawSplitparameter; each factory assigns it to the row'srawSplitLinefield. Engine threads the captured split from both the header-init pre-pass and the per-row multiline branch.nulland the row's existing lazy split path is unchanged.Resolves #120.
What changed
Csv/CsvReader.Engine.cs:IRowFactory<TRow>.CreategainsIList<MemoryText>? rawSplitparameter. All four factories prime the row'srawSplitLinewhen non-null. BothEnumerateandEnumerateAsyncdeclarerawSplitat iteration scope and capture from both multiline branches.Csv/CsvReader.cs:rawSplitLinefield onReadLine,ReadLineSpan,ReadLineSpanOptimizedmoves fromprivatetointernal(sibling-nested factory structs don't get private access; sibling-class private access only applies to types nested within the field's owning type).Csv/CsvReader.FromMemory.cs: same forReadLineFromMemory.internal sealed;InternalsVisibleTo(\"Csv.Tests\")is in place fromStringHelpers.cs. No public surface widened.Perf delta
Test plan
dotnet buildclean on netstandard2.0, net8.0, net9.0dotnet test— 169/169 passing (excluding the pre-existing flakyMemory_AllocationComparisonGC-ratio test)Csv.Tests/EngineUnificationTests.cs,Csv.Tests/Tests.cs, andCsv.Tests/IssuesTests.cscontinue to pass — they cover correctness; the perf change is unobservable without benchmarks (no dedicated regression test added)🤖 Generated with Claude Code