refactor: unify the four CsvReader read loops behind one JIT-devirtualized engine (#118) by stevehansen · Pull Request #121 · stevehansen/csv

stevehansen · 2026-05-16T10:52:40Z

Summary

Collapses the five duplicated read-loop implementations (Read, ReadAsSpan, ReadAsync, ReadFromMemoryOptimized, ReadFromMemory) into a single internal Enumerate<TSource, TFactory, TRow> + EnumerateAsync<...> engine using where TSource : struct, ILineSource / IAsyncLineSource and where TFactory : struct, IRowFactory<TRow> constraints so the JIT specializes one monomorphized native body per concrete (source, factory) pair — no virtual dispatch on the per-row hot path.
Fixes a latent correctness bug: ReadAsync, ReadFromMemoryOptimized, and ReadFromMemory previously skipped the HeaderAbsent + multiline-first-record pre-pass that Read and ReadAsSpan performed. With the unified engine the pre-pass applies to all five paths.
Plumbs CancellationToken through EnumerateAsync → TextReader.ReadLineAsync(ct). The async path had no cancellation support before.
Public API unchanged. CsvLineSplitter, CsvWriter, CsvOptions, and the row classes are untouched except for a private → internal accessibility shift on the row classes required by the new internal factories (InternalsVisibleTo(\"Csv.Tests\") already in place).

Resolves #118.

What's in the diff

Csv/CsvReader.Engine.cs (new) — 3 interfaces (ILineSource, IAsyncLineSource, IRowFactory<TRow>), 4 struct line sources (TextReaderLineSource, AsyncTextReaderLineSource, MemorySliceLineSource, MemoryReaderLineSource), 4 readonly-struct factories (StringRowFactory, SpanRowFactory, OptimizedRowFactory, MemoryRowFactory), and the two generic iterator methods.
Csv/CsvReader.cs — four *Impl methods reduced to one-line delegates; ConcatenateMemory and ReadLineOptimized removed.
Csv/CsvReader.FromMemory.cs — ReadFromMemory reduced to a one-line delegate.
Csv.Tests/EngineUnificationTests.cs (new) — 17 cross-path tests covering skip/header/alias/duplicate matrix, multiline correctness (including the HeaderAbsent-first-record regression case), per-path contracts, and an allocation-parity smoke test.

Performance notes

ILineSource.TryReadLine and Concat thread the natural-string-form alongside the ReadOnlyMemory<char> view via out string? lineString / out string? combined. StringRowFactory and SpanRowFactory pass the original string straight into the row ctor instead of paying new string(span) per row.
MemoryReaderLineSource.Concat delegates to StringHelpers.Concat (single allocation, matches pre-refactor ReadFromMemory behavior).
MemorySliceLineSource.Concat is the verbatim port of the pre-existing ConcatenateMemory; the pool-then-allocate anti-pattern there is tracked separately in ConcatenateMemory rents from CharArrayPool but allocates a fresh char[] anyway, defeating the pool #119.
Debug.Assert(options.Splitter == null, ...) guards against concurrent CsvOptions reuse (already documented as unsupported).

Behavior change to call out in CHANGELOG

ReadAsync, ReadFromMemoryOptimized, and ReadFromMemory now apply the HeaderAbsent + multiline-first-record pre-pass that Read and ReadAsSpan already performed. Strictly a correctness improvement.

Test plan

dotnet build clean on netstandard2.0, net8.0, net9.0
dotnet test — 169/169 passing (excluding the pre-existing flaky Memory_AllocationComparison GC-ratio test in PerformanceTests.cs:150 which is unrelated to this PR)
All 17 new engine-unification tests green
Bug-fix regression test When_HeaderAbsentAndMultilineInFirstRecord_Then_AllPathsProduceCorrectColumnCount pins the new uniform behavior across all five paths
Spot-check JIT codegen for Enumerate<TextReaderLineSource, StringRowFactory, ReadLine> to confirm no callvirt on TryReadLine / Concat / Create (recommended but not gating)
BenchmarkDotNet [MemoryDiagnoser] allocation-parity run for Read, ReadAsSpan, ReadAsync, ReadFromMemoryOptimized, ReadFromMemory vs pre-refactor baseline (recommended but not gating)

Follow-ups (filed as separate issues)

ConcatenateMemory rents from CharArrayPool but allocates a fresh char[] anyway, defeating the pool #119 — ConcatenateMemory rents from the pool but allocates a fresh char[] anyway, defeating the pool. Preserved verbatim by this PR; tracked for future cleanup.
Follow-ups from #118: prime row split cache and check only last field for unterminated quotes #120 — Eager split-then-discard in the multiline continuation loop, and the dropped // TODO: only check the last part optimization. Both touch row-class internals scoped out of RFC: Unify the four CsvReader read loops behind one JIT-devirtualized engine #118.

🤖 Generated with Claude Code

…lized engine (#118) Collapse the five duplicated read-loop implementations (Read / ReadAsSpan / ReadAsync / ReadFromMemoryOptimized / ReadFromMemory) into a single internal Enumerate<TSource, TFactory, TRow> + EnumerateAsync<...> engine. TSource and TFactory are generic struct constraints (struct, ILineSource / IAsyncLineSource / IRowFactory<TRow>), so the JIT specializes one monomorphized native body per concrete (source, factory) pair with no virtual dispatch on the per-row hot path. Correctness: the HeaderAbsent + multiline-first-record pre-pass now applies uniformly across all five paths. Previously ReadAsync, ReadFromMemoryOptimized, and ReadFromMemory silently miscounted columns when HeaderMode = HeaderAbsent and the first record contained an embedded newline inside a quoted field. Pinned by the cross-path regression test in Csv.Tests/EngineUnificationTests.cs. Performance: TextReader-backed paths thread the natural-string-form alongside the ReadOnlyMemory<char> view via ILineSource.TryReadLine's out parameter, so StringRowFactory / SpanRowFactory pass the original string straight into the row ctor instead of paying new string(span) per row. MemoryReaderLineSource.Concat delegates to StringHelpers.Concat (single allocation) instead of the rent-then-allocate anti-pattern. EnumerateAsync plumbs CancellationToken all the way to TextReader.ReadLineAsync(ct) -- the async path had no cancellation support before this change. Public API unchanged. CsvLineSplitter, CsvWriter, CsvOptions, and the row classes are untouched except for a private -> internal accessibility shift on the row classes required by the new internal factories (InternalsVisibleTo("Csv.Tests") already in place). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request unifies the CSV parsing logic by introducing a centralized engine using ILineSource and IRowFactory abstractions, which replaces duplicated implementations across various reading paths. It also adds a comprehensive test suite to ensure consistency. Feedback identifies critical bugs in TryReadLine implementations where empty lines were incorrectly terminating enumeration, and suggests performance optimizations for string concatenation and multiline field detection in .NET 8+ environments.

gemini-code-assist · 2026-05-16T10:58:19Z

+                if (newlineIndex == -1)
+                {
+                    line = csv.Slice(position);
+                    position = csv.Length;
+                    return !line.IsEmpty;
+                }
+
+                var lineLength = newlineIndex;
+                var slice = csv.Slice(position, lineLength);
+
+                position += lineLength;
+                if (position < csv.Length)
+                {
+                    var ch = csv.Span[position];
+                    if (ch == '\r' || ch == '\n')
+                    {
+                        position++;
+                        if (position < csv.Length && ch == '\r' && csv.Span[position] == '\n')
+                            position++;
+                    }
+                }
+
+                if (slice.IsEmpty)
+                {
+                    line = default;
+                    return false;
+                }
+
+                line = slice;
+                return true;


The current implementation of TryReadLine in MemorySliceLineSource incorrectly returns false when it encounters an empty line (where slice.IsEmpty is true). This causes the parser to stop prematurely even if there is more data in the source. In CSV parsing, an empty line should be treated as a valid (though empty) record, or skipped based on CsvOptions.SkipRow, but it should not terminate the enumeration unless the end of the source has been reached.

Additionally, the check at line 138 should always return true because line 125 already handles the EOF condition.

if (newlineIndex == -1) { line = csv.Slice(position); position = csv.Length; return true; } var lineLength = newlineIndex; var slice = csv.Slice(position, lineLength); position += lineLength; if (position < csv.Length) { var ch = csv.Span[position]; if (ch == ' ' || ch == ' ') { position++; if (position < csv.Length && ch == ' ' && csv.Span[position] == ' ') position++; } } line = slice; return true;

gemini-code-assist · 2026-05-16T10:58:19Z

+                line = csv.ReadLine(ref position);
+                return !line.IsEmpty;


Similar to MemorySliceLineSource, MemoryReaderLineSource.TryReadLine incorrectly returns false on empty lines. Since line 207 already handles the EOF check, any line read (even if empty) should return true to allow the enumeration to continue.

line = csv.ReadLine(ref position); return true;

gemini-code-assist · 2026-05-16T10:58:19Z

+            public MemoryText Concat(MemoryText head, string newLine, MemoryText tail, out string? combined)
+            {
+                combined = null;
+
+                var separator = newLine.AsMemory();
+                var totalLength = head.Length + separator.Length + tail.Length;
+                var buffer = memoryOptions.CharArrayPool.Rent(totalLength);
+
+                try
+                {
+                    var span = buffer.AsSpan();
+                    head.Span.CopyTo(span);
+                    separator.Span.CopyTo(span.Slice(head.Length));
+                    tail.Span.CopyTo(span.Slice(head.Length + separator.Length));
+
+                    var result = new char[totalLength];
+                    span.Slice(0, totalLength).CopyTo(result);
+                    return result.AsMemory();
+                }
+                finally
+                {
+                    memoryOptions.CharArrayPool.Return(buffer);
+                }
+            }


As noted in the PR description, the current pooling logic in Concat is inefficient because it still performs a heap allocation for the final array. Since this code is already scoped to .NET 8+, you can simplify this significantly and improve performance by using string.Concat with spans, which performs a single allocation and avoids the overhead of renting/returning from a pool.

public MemoryText Concat(MemoryText head, string newLine, MemoryText tail, out string? combined) { combined = string.Concat(head.Span, newLine.AsSpan(), tail.Span); return combined.AsMemory(); }

gemini-code-assist · 2026-05-16T10:58:19Z

+            public MemoryText Concat(MemoryText head, string newLine, MemoryText tail, out string? combined)
+            {
+                combined = null;
+                return StringHelpers.Concat(head, newLine, tail);
+            }


For consistency and better performance in .NET 8+, consider using string.Concat with spans here as well, similar to the other ILineSource implementations.

public MemoryText Concat(MemoryText head, string newLine, MemoryText tail, out string? combined) { combined = string.Concat(head.Span, newLine.AsSpan(), tail.Span); return combined.AsMemory(); }

gemini-code-assist · 2026-05-16T10:58:19Z

+                    {
+                        var splitLine = options.Splitter.Split(line, options);
+
+                        while (splitLine.Any(f => CsvLineSplitter.IsUnterminatedQuotedValue(f.AsSpan(), options)))


When checking for unterminated quoted values in a multiline field, only the last field of the split line can be unterminated. Using splitLine.Any(...) is less efficient as it checks every field. Checking only the last element is sufficient.

while (splitLine.Count > 0 && CsvLineSplitter.IsUnterminatedQuotedValue(splitLine[splitLine.Count - 1].AsSpan(), options))

gemini-code-assist · 2026-05-16T10:58:19Z

+                if (options.AllowNewLineInEnclosedFieldValues && !isFirstDataLineInHeaderAbsentMode)
+                {
+                    var rawSplit = options.Splitter.Split(line, options);
+                    while (rawSplit.Any(f => CsvLineSplitter.IsUnterminatedQuotedValue(f.AsSpan(), options)))


Using rawSplit.Any(...) is less efficient than checking only the last field, which is the only one that can be unterminated in this context.

while (rawSplit.Count > 0 && CsvLineSplitter.IsUnterminatedQuotedValue(rawSplit[rawSplit.Count - 1].AsSpan(), options))

…continuation By RFC 4180, only the last field of a split line can be unterminated -- an unterminated quote in any earlier field would have been swallowed by Split into a single multi-comma field. Replace splitLine.Any(...) with a direct check of splitLine[Count-1] in both sync and async engine bodies (header pre-pass and per-row loop, 4 spots). Drops the System.Linq import in the engine file. Closes the deferred should-fix item from the #118 review and the corresponding part of #120. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

stevehansen · 2026-05-16T11:09:21Z

Thanks for the review @gemini-code-assist. Dispositions:

High-priority items 1 & 2 (empty-line termination in MemorySliceLineSource / MemoryReaderLineSource) — confirmed real bugs, but pre-existing in ReadLineOptimized and MemoryText.ReadLine(ref position). The pre-refactor ReadFromMemoryOptimized and ReadFromMemory paths had the same termination-on-blank-line behavior; #118 preserved it verbatim per the refactor's "behavior-preservation" scope. Filed separately as #122 — fixing it needs its own correctness analysis around SkipRow semantics for blank lines, which is out of scope for this consolidation PR.

Medium-priority item 3 (MemorySliceLineSource.Concat pool anti-pattern) — tracked in #119. Preserved verbatim from the pre-#118 ConcatenateMemory per approved deviation.

Medium-priority item 4 (MemoryReaderLineSource.Concat could capture the combined string) — currently delegates to StringHelpers.Concat which already does string.Concat(spans).AsMemory() internally. The suggestion to also publish the materialized string via out combined is cosmetic — MemoryRowFactory.Create ignores rawString, so it would never be consumed downstream on this path. Leaving as-is.

Medium-priority items 5 & 6 (only check the last field for unterminated quotes) — applied in 84e2c96. Replaced splitLine.Any(...) with splitLine[Count-1] in both sync and async engine bodies (header pre-pass + per-row loop, 4 spots). Drops the unused using System.Linq. This was also tracked in #120 and is now closed by this commit.

gemini-code-assist · 2026-05-16T11:09:34Z

Thanks for the detailed breakdown, @stevehansen. It's good to see the rationale behind the prioritization and the confirmation that the identified items are either pre-existing or tracked in separate issues. The application of the fix for the unterminated quote check in the header pre-pass and per-row loop is a solid improvement. I'm satisfied with this resolution.

gemini-code-assist Bot reviewed May 16, 2026

View reviewed changes

stevehansen mentioned this pull request May 16, 2026

Memory-based reader paths terminate enumeration on blank middle-of-stream lines #122

Closed

stevehansen mentioned this pull request May 16, 2026

Follow-ups from #118: prime row split cache and check only last field for unterminated quotes #120

Closed

stevehansen changed the title ~~refactor: unify the four CsvReader read loops behind one JIT-devirtualized engine~~ refactor: unify the four CsvReader read loops behind one JIT-devirtualized engine (#118) May 16, 2026

stevehansen merged commit 24cea69 into master May 16, 2026
3 checks passed

stevehansen deleted the refactor/unify-reader-loops-118 branch May 16, 2026 11:11

This was referenced May 16, 2026

fix: memory-based reader paths drop records after blank lines (#122) #123

Merged

Escaped " in column in middle of line cause invalid operation exception. #114

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: unify the four CsvReader read loops behind one JIT-devirtualized engine (#118)#121

refactor: unify the four CsvReader read loops behind one JIT-devirtualized engine (#118)#121
stevehansen merged 2 commits into
masterfrom
refactor/unify-reader-loops-118

stevehansen commented May 16, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 16, 2026

Uh oh!

gemini-code-assist Bot May 16, 2026

Uh oh!

gemini-code-assist Bot May 16, 2026

Uh oh!

gemini-code-assist Bot May 16, 2026

Uh oh!

gemini-code-assist Bot May 16, 2026

Uh oh!

gemini-code-assist Bot May 16, 2026

Uh oh!

stevehansen commented May 16, 2026

Uh oh!

gemini-code-assist Bot commented May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

stevehansen commented May 16, 2026

Summary

What's in the diff

Performance notes

Behavior change to call out in CHANGELOG

Test plan

Follow-ups (filed as separate issues)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

stevehansen commented May 16, 2026

Uh oh!

gemini-code-assist Bot commented May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant