Skip to content

refactor: unify the four CsvReader read loops behind one JIT-devirtualized engine (#118)#121

Merged
stevehansen merged 2 commits into
masterfrom
refactor/unify-reader-loops-118
May 16, 2026
Merged

refactor: unify the four CsvReader read loops behind one JIT-devirtualized engine (#118)#121
stevehansen merged 2 commits into
masterfrom
refactor/unify-reader-loops-118

Conversation

@stevehansen
Copy link
Copy Markdown
Owner

Summary

  • Collapses the five duplicated read-loop implementations (Read, ReadAsSpan, ReadAsync, ReadFromMemoryOptimized, ReadFromMemory) into a single internal Enumerate<TSource, TFactory, TRow> + EnumerateAsync<...> engine using where TSource : struct, ILineSource / IAsyncLineSource and where TFactory : struct, IRowFactory<TRow> constraints so the JIT specializes one monomorphized native body per concrete (source, factory) pair — no virtual dispatch on the per-row hot path.
  • Fixes a latent correctness bug: ReadAsync, ReadFromMemoryOptimized, and ReadFromMemory previously skipped the HeaderAbsent + multiline-first-record pre-pass that Read and ReadAsSpan performed. With the unified engine the pre-pass applies to all five paths.
  • Plumbs CancellationToken through EnumerateAsyncTextReader.ReadLineAsync(ct). The async path had no cancellation support before.
  • Public API unchanged. CsvLineSplitter, CsvWriter, CsvOptions, and the row classes are untouched except for a privateinternal accessibility shift on the row classes required by the new internal factories (InternalsVisibleTo(\"Csv.Tests\") already in place).

Resolves #118.

What's in the diff

  • Csv/CsvReader.Engine.cs (new) — 3 interfaces (ILineSource, IAsyncLineSource, IRowFactory<TRow>), 4 struct line sources (TextReaderLineSource, AsyncTextReaderLineSource, MemorySliceLineSource, MemoryReaderLineSource), 4 readonly-struct factories (StringRowFactory, SpanRowFactory, OptimizedRowFactory, MemoryRowFactory), and the two generic iterator methods.
  • Csv/CsvReader.cs — four *Impl methods reduced to one-line delegates; ConcatenateMemory and ReadLineOptimized removed.
  • Csv/CsvReader.FromMemory.csReadFromMemory reduced to a one-line delegate.
  • Csv.Tests/EngineUnificationTests.cs (new) — 17 cross-path tests covering skip/header/alias/duplicate matrix, multiline correctness (including the HeaderAbsent-first-record regression case), per-path contracts, and an allocation-parity smoke test.

Performance notes

  • ILineSource.TryReadLine and Concat thread the natural-string-form alongside the ReadOnlyMemory<char> view via out string? lineString / out string? combined. StringRowFactory and SpanRowFactory pass the original string straight into the row ctor instead of paying new string(span) per row.
  • MemoryReaderLineSource.Concat delegates to StringHelpers.Concat (single allocation, matches pre-refactor ReadFromMemory behavior).
  • MemorySliceLineSource.Concat is the verbatim port of the pre-existing ConcatenateMemory; the pool-then-allocate anti-pattern there is tracked separately in ConcatenateMemory rents from CharArrayPool but allocates a fresh char[] anyway, defeating the pool #119.
  • Debug.Assert(options.Splitter == null, ...) guards against concurrent CsvOptions reuse (already documented as unsupported).

Behavior change to call out in CHANGELOG

ReadAsync, ReadFromMemoryOptimized, and ReadFromMemory now apply the HeaderAbsent + multiline-first-record pre-pass that Read and ReadAsSpan already performed. Strictly a correctness improvement.

Test plan

  • dotnet build clean on netstandard2.0, net8.0, net9.0
  • dotnet test — 169/169 passing (excluding the pre-existing flaky Memory_AllocationComparison GC-ratio test in PerformanceTests.cs:150 which is unrelated to this PR)
  • All 17 new engine-unification tests green
  • Bug-fix regression test When_HeaderAbsentAndMultilineInFirstRecord_Then_AllPathsProduceCorrectColumnCount pins the new uniform behavior across all five paths
  • Spot-check JIT codegen for Enumerate<TextReaderLineSource, StringRowFactory, ReadLine> to confirm no callvirt on TryReadLine / Concat / Create (recommended but not gating)
  • BenchmarkDotNet [MemoryDiagnoser] allocation-parity run for Read, ReadAsSpan, ReadAsync, ReadFromMemoryOptimized, ReadFromMemory vs pre-refactor baseline (recommended but not gating)

Follow-ups (filed as separate issues)

🤖 Generated with Claude Code

…lized engine (#118)

Collapse the five duplicated read-loop implementations (Read / ReadAsSpan /
ReadAsync / ReadFromMemoryOptimized / ReadFromMemory) into a single
internal Enumerate<TSource, TFactory, TRow> + EnumerateAsync<...> engine.
TSource and TFactory are generic struct constraints (struct, ILineSource /
IAsyncLineSource / IRowFactory<TRow>), so the JIT specializes one
monomorphized native body per concrete (source, factory) pair with no
virtual dispatch on the per-row hot path.

Correctness: the HeaderAbsent + multiline-first-record pre-pass now
applies uniformly across all five paths. Previously ReadAsync,
ReadFromMemoryOptimized, and ReadFromMemory silently miscounted columns
when HeaderMode = HeaderAbsent and the first record contained an embedded
newline inside a quoted field. Pinned by the cross-path regression test
in Csv.Tests/EngineUnificationTests.cs.

Performance: TextReader-backed paths thread the natural-string-form
alongside the ReadOnlyMemory<char> view via ILineSource.TryReadLine's
out parameter, so StringRowFactory / SpanRowFactory pass the original
string straight into the row ctor instead of paying new string(span) per
row. MemoryReaderLineSource.Concat delegates to StringHelpers.Concat
(single allocation) instead of the rent-then-allocate anti-pattern.
EnumerateAsync plumbs CancellationToken all the way to
TextReader.ReadLineAsync(ct) -- the async path had no cancellation
support before this change.

Public API unchanged. CsvLineSplitter, CsvWriter, CsvOptions, and the
row classes are untouched except for a private -> internal accessibility
shift on the row classes required by the new internal factories
(InternalsVisibleTo("Csv.Tests") already in place).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request unifies the CSV parsing logic by introducing a centralized engine using ILineSource and IRowFactory abstractions, which replaces duplicated implementations across various reading paths. It also adds a comprehensive test suite to ensure consistency. Feedback identifies critical bugs in TryReadLine implementations where empty lines were incorrectly terminating enumeration, and suggests performance optimizations for string concatenation and multiline field detection in .NET 8+ environments.

Comment thread Csv/CsvReader.Engine.cs
Comment on lines +134 to +163
if (newlineIndex == -1)
{
line = csv.Slice(position);
position = csv.Length;
return !line.IsEmpty;
}

var lineLength = newlineIndex;
var slice = csv.Slice(position, lineLength);

position += lineLength;
if (position < csv.Length)
{
var ch = csv.Span[position];
if (ch == '\r' || ch == '\n')
{
position++;
if (position < csv.Length && ch == '\r' && csv.Span[position] == '\n')
position++;
}
}

if (slice.IsEmpty)
{
line = default;
return false;
}

line = slice;
return true;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation of TryReadLine in MemorySliceLineSource incorrectly returns false when it encounters an empty line (where slice.IsEmpty is true). This causes the parser to stop prematurely even if there is more data in the source. In CSV parsing, an empty line should be treated as a valid (though empty) record, or skipped based on CsvOptions.SkipRow, but it should not terminate the enumeration unless the end of the source has been reached.

Additionally, the check at line 138 should always return true because line 125 already handles the EOF condition.

                if (newlineIndex == -1)
                {
                    line = csv.Slice(position);
                    position = csv.Length;
                    return true;
                }

                var lineLength = newlineIndex;
                var slice = csv.Slice(position, lineLength);

                position += lineLength;
                if (position < csv.Length)
                {
                    var ch = csv.Span[position];
                    if (ch == '
' || ch == '
')
                    {
                        position++;
                        if (position < csv.Length && ch == '
' && csv.Span[position] == '
')
                            position++;
                    }
                }

                line = slice;
                return true;

Comment thread Csv/CsvReader.Engine.cs
Comment on lines +213 to +214
line = csv.ReadLine(ref position);
return !line.IsEmpty;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to MemorySliceLineSource, MemoryReaderLineSource.TryReadLine incorrectly returns false on empty lines. Since line 207 already handles the EOF check, any line read (even if empty) should return true to allow the enumeration to continue.

                line = csv.ReadLine(ref position);
                return true;

Comment thread Csv/CsvReader.Engine.cs
Comment on lines +166 to +189
public MemoryText Concat(MemoryText head, string newLine, MemoryText tail, out string? combined)
{
combined = null;

var separator = newLine.AsMemory();
var totalLength = head.Length + separator.Length + tail.Length;
var buffer = memoryOptions.CharArrayPool.Rent(totalLength);

try
{
var span = buffer.AsSpan();
head.Span.CopyTo(span);
separator.Span.CopyTo(span.Slice(head.Length));
tail.Span.CopyTo(span.Slice(head.Length + separator.Length));

var result = new char[totalLength];
span.Slice(0, totalLength).CopyTo(result);
return result.AsMemory();
}
finally
{
memoryOptions.CharArrayPool.Return(buffer);
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

As noted in the PR description, the current pooling logic in Concat is inefficient because it still performs a heap allocation for the final array. Since this code is already scoped to .NET 8+, you can simplify this significantly and improve performance by using string.Concat with spans, which performs a single allocation and avoids the overhead of renting/returning from a pool.

            public MemoryText Concat(MemoryText head, string newLine, MemoryText tail, out string? combined)
            {
                combined = string.Concat(head.Span, newLine.AsSpan(), tail.Span);
                return combined.AsMemory();
            }

Comment thread Csv/CsvReader.Engine.cs
Comment on lines +217 to +221
public MemoryText Concat(MemoryText head, string newLine, MemoryText tail, out string? combined)
{
combined = null;
return StringHelpers.Concat(head, newLine, tail);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency and better performance in .NET 8+, consider using string.Concat with spans here as well, similar to the other ILineSource implementations.

            public MemoryText Concat(MemoryText head, string newLine, MemoryText tail, out string? combined)
            {
                combined = string.Concat(head.Span, newLine.AsSpan(), tail.Span);
                return combined.AsMemory();
            }

Comment thread Csv/CsvReader.Engine.cs Outdated
{
var splitLine = options.Splitter.Split(line, options);

while (splitLine.Any(f => CsvLineSplitter.IsUnterminatedQuotedValue(f.AsSpan(), options)))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When checking for unterminated quoted values in a multiline field, only the last field of the split line can be unterminated. Using splitLine.Any(...) is less efficient as it checks every field. Checking only the last element is sufficient.

                        while (splitLine.Count > 0 && CsvLineSplitter.IsUnterminatedQuotedValue(splitLine[splitLine.Count - 1].AsSpan(), options))

Comment thread Csv/CsvReader.Engine.cs Outdated
if (options.AllowNewLineInEnclosedFieldValues && !isFirstDataLineInHeaderAbsentMode)
{
var rawSplit = options.Splitter.Split(line, options);
while (rawSplit.Any(f => CsvLineSplitter.IsUnterminatedQuotedValue(f.AsSpan(), options)))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using rawSplit.Any(...) is less efficient than checking only the last field, which is the only one that can be unterminated in this context.

                    while (rawSplit.Count > 0 && CsvLineSplitter.IsUnterminatedQuotedValue(rawSplit[rawSplit.Count - 1].AsSpan(), options))

…continuation

By RFC 4180, only the last field of a split line can be unterminated --
an unterminated quote in any earlier field would have been swallowed by
Split into a single multi-comma field. Replace splitLine.Any(...) with a
direct check of splitLine[Count-1] in both sync and async engine bodies
(header pre-pass and per-row loop, 4 spots).

Drops the System.Linq import in the engine file. Closes the deferred
should-fix item from the #118 review and the corresponding part of #120.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@stevehansen
Copy link
Copy Markdown
Owner Author

Thanks for the review @gemini-code-assist. Dispositions:

High-priority items 1 & 2 (empty-line termination in MemorySliceLineSource / MemoryReaderLineSource) — confirmed real bugs, but pre-existing in ReadLineOptimized and MemoryText.ReadLine(ref position). The pre-refactor ReadFromMemoryOptimized and ReadFromMemory paths had the same termination-on-blank-line behavior; #118 preserved it verbatim per the refactor's "behavior-preservation" scope. Filed separately as #122 — fixing it needs its own correctness analysis around SkipRow semantics for blank lines, which is out of scope for this consolidation PR.

Medium-priority item 3 (MemorySliceLineSource.Concat pool anti-pattern) — tracked in #119. Preserved verbatim from the pre-#118 ConcatenateMemory per approved deviation.

Medium-priority item 4 (MemoryReaderLineSource.Concat could capture the combined string) — currently delegates to StringHelpers.Concat which already does string.Concat(spans).AsMemory() internally. The suggestion to also publish the materialized string via out combined is cosmetic — MemoryRowFactory.Create ignores rawString, so it would never be consumed downstream on this path. Leaving as-is.

Medium-priority items 5 & 6 (only check the last field for unterminated quotes) — applied in 84e2c96. Replaced splitLine.Any(...) with splitLine[Count-1] in both sync and async engine bodies (header pre-pass + per-row loop, 4 spots). Drops the unused using System.Linq. This was also tracked in #120 and is now closed by this commit.

@gemini-code-assist
Copy link
Copy Markdown

Thanks for the detailed breakdown, @stevehansen. It's good to see the rationale behind the prioritization and the confirmation that the identified items are either pre-existing or tracked in separate issues. The application of the fix for the unterminated quote check in the header pre-pass and per-row loop is a solid improvement. I'm satisfied with this resolution.

@stevehansen stevehansen changed the title refactor: unify the four CsvReader read loops behind one JIT-devirtualized engine refactor: unify the four CsvReader read loops behind one JIT-devirtualized engine (#118) May 16, 2026
@stevehansen stevehansen merged commit 24cea69 into master May 16, 2026
3 checks passed
@stevehansen stevehansen deleted the refactor/unify-reader-loops-118 branch May 16, 2026 11:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RFC: Unify the four CsvReader read loops behind one JIT-devirtualized engine

1 participant