refactor: unify the four CsvReader read loops behind one JIT-devirtualized engine (#118)#121
Conversation
…lized engine (#118) Collapse the five duplicated read-loop implementations (Read / ReadAsSpan / ReadAsync / ReadFromMemoryOptimized / ReadFromMemory) into a single internal Enumerate<TSource, TFactory, TRow> + EnumerateAsync<...> engine. TSource and TFactory are generic struct constraints (struct, ILineSource / IAsyncLineSource / IRowFactory<TRow>), so the JIT specializes one monomorphized native body per concrete (source, factory) pair with no virtual dispatch on the per-row hot path. Correctness: the HeaderAbsent + multiline-first-record pre-pass now applies uniformly across all five paths. Previously ReadAsync, ReadFromMemoryOptimized, and ReadFromMemory silently miscounted columns when HeaderMode = HeaderAbsent and the first record contained an embedded newline inside a quoted field. Pinned by the cross-path regression test in Csv.Tests/EngineUnificationTests.cs. Performance: TextReader-backed paths thread the natural-string-form alongside the ReadOnlyMemory<char> view via ILineSource.TryReadLine's out parameter, so StringRowFactory / SpanRowFactory pass the original string straight into the row ctor instead of paying new string(span) per row. MemoryReaderLineSource.Concat delegates to StringHelpers.Concat (single allocation) instead of the rent-then-allocate anti-pattern. EnumerateAsync plumbs CancellationToken all the way to TextReader.ReadLineAsync(ct) -- the async path had no cancellation support before this change. Public API unchanged. CsvLineSplitter, CsvWriter, CsvOptions, and the row classes are untouched except for a private -> internal accessibility shift on the row classes required by the new internal factories (InternalsVisibleTo("Csv.Tests") already in place). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request unifies the CSV parsing logic by introducing a centralized engine using ILineSource and IRowFactory abstractions, which replaces duplicated implementations across various reading paths. It also adds a comprehensive test suite to ensure consistency. Feedback identifies critical bugs in TryReadLine implementations where empty lines were incorrectly terminating enumeration, and suggests performance optimizations for string concatenation and multiline field detection in .NET 8+ environments.
| if (newlineIndex == -1) | ||
| { | ||
| line = csv.Slice(position); | ||
| position = csv.Length; | ||
| return !line.IsEmpty; | ||
| } | ||
|
|
||
| var lineLength = newlineIndex; | ||
| var slice = csv.Slice(position, lineLength); | ||
|
|
||
| position += lineLength; | ||
| if (position < csv.Length) | ||
| { | ||
| var ch = csv.Span[position]; | ||
| if (ch == '\r' || ch == '\n') | ||
| { | ||
| position++; | ||
| if (position < csv.Length && ch == '\r' && csv.Span[position] == '\n') | ||
| position++; | ||
| } | ||
| } | ||
|
|
||
| if (slice.IsEmpty) | ||
| { | ||
| line = default; | ||
| return false; | ||
| } | ||
|
|
||
| line = slice; | ||
| return true; |
There was a problem hiding this comment.
The current implementation of TryReadLine in MemorySliceLineSource incorrectly returns false when it encounters an empty line (where slice.IsEmpty is true). This causes the parser to stop prematurely even if there is more data in the source. In CSV parsing, an empty line should be treated as a valid (though empty) record, or skipped based on CsvOptions.SkipRow, but it should not terminate the enumeration unless the end of the source has been reached.
Additionally, the check at line 138 should always return true because line 125 already handles the EOF condition.
if (newlineIndex == -1)
{
line = csv.Slice(position);
position = csv.Length;
return true;
}
var lineLength = newlineIndex;
var slice = csv.Slice(position, lineLength);
position += lineLength;
if (position < csv.Length)
{
var ch = csv.Span[position];
if (ch == '
' || ch == '
')
{
position++;
if (position < csv.Length && ch == '
' && csv.Span[position] == '
')
position++;
}
}
line = slice;
return true;| line = csv.ReadLine(ref position); | ||
| return !line.IsEmpty; |
There was a problem hiding this comment.
| public MemoryText Concat(MemoryText head, string newLine, MemoryText tail, out string? combined) | ||
| { | ||
| combined = null; | ||
|
|
||
| var separator = newLine.AsMemory(); | ||
| var totalLength = head.Length + separator.Length + tail.Length; | ||
| var buffer = memoryOptions.CharArrayPool.Rent(totalLength); | ||
|
|
||
| try | ||
| { | ||
| var span = buffer.AsSpan(); | ||
| head.Span.CopyTo(span); | ||
| separator.Span.CopyTo(span.Slice(head.Length)); | ||
| tail.Span.CopyTo(span.Slice(head.Length + separator.Length)); | ||
|
|
||
| var result = new char[totalLength]; | ||
| span.Slice(0, totalLength).CopyTo(result); | ||
| return result.AsMemory(); | ||
| } | ||
| finally | ||
| { | ||
| memoryOptions.CharArrayPool.Return(buffer); | ||
| } | ||
| } |
There was a problem hiding this comment.
As noted in the PR description, the current pooling logic in Concat is inefficient because it still performs a heap allocation for the final array. Since this code is already scoped to .NET 8+, you can simplify this significantly and improve performance by using string.Concat with spans, which performs a single allocation and avoids the overhead of renting/returning from a pool.
public MemoryText Concat(MemoryText head, string newLine, MemoryText tail, out string? combined)
{
combined = string.Concat(head.Span, newLine.AsSpan(), tail.Span);
return combined.AsMemory();
}| public MemoryText Concat(MemoryText head, string newLine, MemoryText tail, out string? combined) | ||
| { | ||
| combined = null; | ||
| return StringHelpers.Concat(head, newLine, tail); | ||
| } |
There was a problem hiding this comment.
For consistency and better performance in .NET 8+, consider using string.Concat with spans here as well, similar to the other ILineSource implementations.
public MemoryText Concat(MemoryText head, string newLine, MemoryText tail, out string? combined)
{
combined = string.Concat(head.Span, newLine.AsSpan(), tail.Span);
return combined.AsMemory();
}| { | ||
| var splitLine = options.Splitter.Split(line, options); | ||
|
|
||
| while (splitLine.Any(f => CsvLineSplitter.IsUnterminatedQuotedValue(f.AsSpan(), options))) |
There was a problem hiding this comment.
When checking for unterminated quoted values in a multiline field, only the last field of the split line can be unterminated. Using splitLine.Any(...) is less efficient as it checks every field. Checking only the last element is sufficient.
while (splitLine.Count > 0 && CsvLineSplitter.IsUnterminatedQuotedValue(splitLine[splitLine.Count - 1].AsSpan(), options))| if (options.AllowNewLineInEnclosedFieldValues && !isFirstDataLineInHeaderAbsentMode) | ||
| { | ||
| var rawSplit = options.Splitter.Split(line, options); | ||
| while (rawSplit.Any(f => CsvLineSplitter.IsUnterminatedQuotedValue(f.AsSpan(), options))) |
There was a problem hiding this comment.
…continuation By RFC 4180, only the last field of a split line can be unterminated -- an unterminated quote in any earlier field would have been swallowed by Split into a single multi-comma field. Replace splitLine.Any(...) with a direct check of splitLine[Count-1] in both sync and async engine bodies (header pre-pass and per-row loop, 4 spots). Drops the System.Linq import in the engine file. Closes the deferred should-fix item from the #118 review and the corresponding part of #120. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks for the review @gemini-code-assist. Dispositions: High-priority items 1 & 2 (empty-line termination in Medium-priority item 3 ( Medium-priority item 4 ( Medium-priority items 5 & 6 (only check the last field for unterminated quotes) — applied in 84e2c96. Replaced |
|
Thanks for the detailed breakdown, @stevehansen. It's good to see the rationale behind the prioritization and the confirmation that the identified items are either pre-existing or tracked in separate issues. The application of the fix for the unterminated quote check in the header pre-pass and per-row loop is a solid improvement. I'm satisfied with this resolution. |
Summary
Read,ReadAsSpan,ReadAsync,ReadFromMemoryOptimized,ReadFromMemory) into a single internalEnumerate<TSource, TFactory, TRow>+EnumerateAsync<...>engine usingwhere TSource : struct, ILineSource/IAsyncLineSourceandwhere TFactory : struct, IRowFactory<TRow>constraints so the JIT specializes one monomorphized native body per concrete (source, factory) pair — no virtual dispatch on the per-row hot path.ReadAsync,ReadFromMemoryOptimized, andReadFromMemorypreviously skipped the HeaderAbsent + multiline-first-record pre-pass thatReadandReadAsSpanperformed. With the unified engine the pre-pass applies to all five paths.CancellationTokenthroughEnumerateAsync→TextReader.ReadLineAsync(ct). The async path had no cancellation support before.CsvLineSplitter,CsvWriter,CsvOptions, and the row classes are untouched except for aprivate→internalaccessibility shift on the row classes required by the new internal factories (InternalsVisibleTo(\"Csv.Tests\")already in place).Resolves #118.
What's in the diff
Csv/CsvReader.Engine.cs(new) — 3 interfaces (ILineSource,IAsyncLineSource,IRowFactory<TRow>), 4 struct line sources (TextReaderLineSource,AsyncTextReaderLineSource,MemorySliceLineSource,MemoryReaderLineSource), 4 readonly-struct factories (StringRowFactory,SpanRowFactory,OptimizedRowFactory,MemoryRowFactory), and the two generic iterator methods.Csv/CsvReader.cs— four*Implmethods reduced to one-line delegates;ConcatenateMemoryandReadLineOptimizedremoved.Csv/CsvReader.FromMemory.cs—ReadFromMemoryreduced to a one-line delegate.Csv.Tests/EngineUnificationTests.cs(new) — 17 cross-path tests covering skip/header/alias/duplicate matrix, multiline correctness (including the HeaderAbsent-first-record regression case), per-path contracts, and an allocation-parity smoke test.Performance notes
ILineSource.TryReadLineandConcatthread the natural-string-form alongside theReadOnlyMemory<char>view viaout string? lineString/out string? combined.StringRowFactoryandSpanRowFactorypass the original string straight into the row ctor instead of payingnew string(span)per row.MemoryReaderLineSource.Concatdelegates toStringHelpers.Concat(single allocation, matches pre-refactorReadFromMemorybehavior).MemorySliceLineSource.Concatis the verbatim port of the pre-existingConcatenateMemory; the pool-then-allocate anti-pattern there is tracked separately in ConcatenateMemory rents from CharArrayPool but allocates a fresh char[] anyway, defeating the pool #119.Debug.Assert(options.Splitter == null, ...)guards against concurrentCsvOptionsreuse (already documented as unsupported).Behavior change to call out in CHANGELOG
ReadAsync,ReadFromMemoryOptimized, andReadFromMemorynow apply the HeaderAbsent + multiline-first-record pre-pass thatReadandReadAsSpanalready performed. Strictly a correctness improvement.Test plan
dotnet buildclean on netstandard2.0, net8.0, net9.0dotnet test— 169/169 passing (excluding the pre-existing flakyMemory_AllocationComparisonGC-ratio test inPerformanceTests.cs:150which is unrelated to this PR)When_HeaderAbsentAndMultilineInFirstRecord_Then_AllPathsProduceCorrectColumnCountpins the new uniform behavior across all five pathsEnumerate<TextReaderLineSource, StringRowFactory, ReadLine>to confirm nocallvirtonTryReadLine/Concat/Create(recommended but not gating)[MemoryDiagnoser]allocation-parity run forRead,ReadAsSpan,ReadAsync,ReadFromMemoryOptimized,ReadFromMemoryvs pre-refactor baseline (recommended but not gating)Follow-ups (filed as separate issues)
ConcatenateMemoryrents from the pool but allocates a freshchar[]anyway, defeating the pool. Preserved verbatim by this PR; tracked for future cleanup.// TODO: only check the last partoptimization. Both touch row-class internals scoped out of RFC: Unify the four CsvReader read loops behind one JIT-devirtualized engine #118.🤖 Generated with Claude Code