Skip to content

Fixes continuation row grouping and add strictness#10

Merged
vethman merged 1 commit into
mainfrom
strictness
Apr 9, 2026
Merged

Fixes continuation row grouping and add strictness#10
vethman merged 1 commit into
mainfrom
strictness

Conversation

@vethman

@vethman vethman commented Apr 8, 2026

Copy link
Copy Markdown
Member

This pull request introduces important changes to the CSV parsing library to improve consistency between batch and streaming parsing, enforce stricter validation, and enhance memory safety. The main changes include aligning the grouping behavior of CsvStreamParser with CsvParser, enforcing strict validation of the identifierColumn, clarifying documentation, and adding a memory safeguard for continuation groups. Additionally, several development dependencies have been updated.

CSV Parsing Consistency and Validation:

  • CsvStreamParser now always emits nested grouped output, matching CsvParser continuation-row semantics. The previous nested option is removed to avoid divergence and ensure consistent grouping of continuation rows in both batch and streaming APIs. [1] [2] [3]
  • Enforced strict identifierColumn validation: if the configured identifier column is missing from headers, parsing throws a CsvParseError instead of continuing ambiguously. Additionally, a continuation row cannot start a group; if the first data row has an empty identifier, parsing throws CsvParseError. [1] [2]

Streaming Parser Improvements:

  • Added a maxContinuationGroupSize option (default: 10,000) to CsvStreamParser to prevent unbounded memory usage when identifier values are missing for long stretches. Exceeding this limit throws a CsvParseError. [1] [2] [3] [4]

Documentation Updates:

  • Updated the README to clarify the distinction between CsvParser.parseStream() (buffers entire stream in memory) and CsvStreamParser (true streaming, memory efficient, always groups continuation rows). Also clarified options such as includeColumns, excludeColumns, null handling, and the new maxContinuationGroupSize safeguard. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12]

Bug Fixes and Internal Improvements:

  • Improved line splitting in CsvReader.splitLines() to handle custom quote characters and escaped quotes correctly. [1] [2] [3] [4]
  • Updated dev dependencies to latest versions for better compatibility and tooling. [1] [2]

These changes make the CSV parsing behavior more predictable, robust, and safe for large-scale streaming workloads.

Resolves critical issues where continuation rows (rows with empty identifier columns) were not properly grouped in CsvStreamParser, causing discrepancies between streaming and non-streaming parsing results.

Changes CsvStreamParser to buffer continuation groups by default, matching CsvParser behavior. Adds maxContinuationGroupSize guard (default: 10000) to prevent unbounded memory growth when identifier values are missing.

Improves identifierColumn validation to throw early when configured column doesn't exist after filtering/transformation, requiring transformed column names when headerTransformer or columnMapping are used.

Fixes nested array handling in JsonToCsv to properly emit child continuation values under parent array items rather than at root level.

Enhances documentation to clarify parseStream() buffers full content in memory, while CsvStreamParser provides true incremental processing with continuation grouping.

Updates dependencies and fixes quote character handling in splitLines to properly track escaped quotes.
@vethman vethman self-assigned this Apr 8, 2026
@vethman vethman merged commit 82a3a8e into main Apr 9, 2026
4 checks passed
@vethman vethman deleted the strictness branch April 9, 2026 08:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants