[fix](serde) fix split_by_delimiter missing backslash escape handling#61995
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
TPC-H: Total hot run time: 29317 ms |
TPC-DS: Total hot run time: 179264 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
split_by_delimiter in complex_type_deserialize_util.h did not handle '\' escape characters. When Stream Load JSON writes Map<String,String> columns, the JSON reader stores the map as a raw JSON string (via to_json_string), which is later parsed back by from_string → split_by_delimiter. Without escape handling, '\"' inside a value flips the quote tracker, causing subsequent ':' or ',' to be mis-identified as delimiters. This leads to "Map does not have even number of key-value pairs" error, which is silently swallowed by the nullable serde and results in the Map column being NULL. Also fix test_jsonb_cast and test_jsonb_with_unescaped_string test data: the CSV input 'foo\\'bar' was ambiguous under escape rules (\\' means escaped backslash followed by quote-close, leaving 'bar' dangling). Changed to 'foo\'bar' and updated .out files to reflect correct parsing with backslash escape handling. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
493abfc to
e4dfb2c
Compare
|
run buildall |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 29264 ms |
TPC-DS: Total hot run time: 180400 ms |
|
run buildall |
TPC-H: Total hot run time: 29194 ms |
TPC-DS: Total hot run time: 181088 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
/review |
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
There was a problem hiding this comment.
I found one blocking issue in the current patch.
- The code change does not appear to touch the Stream Load JSON execution path described in the PR body. When
_is_loadis true,NewJsonReader::_simdjson_write_data_to_column()still routes complex JSON values throughdeserialize_one_cell_from_json()(be/src/format/json/new_json_reader.cpp:1084-1121), andDataTypeMapSerDe::deserialize_one_cell_from_json()already has its own backslash-skip logic (be/src/core/data_type_serde/data_type_map_serde.cpp:244-245). The newsplit_by_delimiter()handling only affectsfrom_string()callers such as string/JSONB-to-complex conversions, and the added unit test exercises only that path. So the current change does not yet prove, or appear to implement, the Stream Load JSON fix claimed by the PR description.
Critical checkpoints:
- Goal / proof: Not yet satisfied. The patch demonstrates a
from_string()fix, not the stated Stream Load JSON bug, and there is no test covering the real load path. - Scope / minimality: The code change is small, but it is currently applied to a different path than the reported bug.
- Concurrency: Not applicable in the touched code.
- Lifecycle / static initialization: Not applicable.
- Configuration: No config changes.
- Compatibility: No storage or protocol compatibility issue found in the touched lines.
- Parallel code paths: This is the main problem. The JSON load path uses
deserialize_one_cell_from_json(), while this patch changes only the sharedfrom_string()splitter path. - Special conditional checks: The new escape check itself is straightforward.
- Test coverage: Insufficient for the reported bug; an end-to-end reproducer on the actual JSON Stream Load path is still missing.
- Observability: Not applicable.
- Transaction / persistence: Not applicable.
- Data writes / modifications: This is a user-visible parsing fix, so it needs validation on the real ingestion path before merge.
- FE / BE variable passing: Not applicable.
- Performance: No obvious concern.
- Other issues: None beyond the path mismatch above.
User focus points: no additional focus points were provided.
| for (int pos = 0; pos < str.size; ++pos) { | ||
| char c = str.data[pos]; | ||
| if (c == '"' || c == '\'') { | ||
| if (c == '\\' && pos + 1 < static_cast<int>(str.size)) { |
There was a problem hiding this comment.
This helper is only used by the *_serde::from_string() implementations. The Stream Load JSON path described in the PR body still goes through NewJsonReader::_simdjson_write_data_to_column() and then DataTypeMapSerDe::deserialize_one_cell_from_json() when _is_load is true (be/src/format/json/new_json_reader.cpp:1084-1121). That map JSON path already has its own \ handling at be/src/core/data_type_serde/data_type_map_serde.cpp:244-245, and it never calls split_by_delimiter(). So this change does not appear to fix the reported Stream Load regression; it fixes the from_string() path instead. Can you either add a reproducer on the real load path and patch that path if needed, or narrow the PR description/tests to the string or JSONB-to-map conversion flow that this code actually changes?
There was a problem hiding this comment.
The new_json_reader doesn't parse the input directly into a map — it parses it into a string, and then casts that string into a map. The fix is exactly on the path that casts a string into a map.
…apache#61995) ### Bug Stream Load (JSON format) writing to `Map<String, String>` columns silently sets the column to NULL when the map value contains both `\"` (escaped quotes) and `:` or `,`. Stream Load returns Success with FilteredRows=0 — data loss with no error indication. ### Fix Added `\` escape skip in `split_by_delimiter`'s character loop, before quote detection — consistent with the existing escape handling in `deserialize_one_cell_from_json`.
…#61995) ### Bug Stream Load (JSON format) writing to `Map<String, String>` columns silently sets the column to NULL when the map value contains both `\"` (escaped quotes) and `:` or `,`. Stream Load returns Success with FilteredRows=0 — data loss with no error indication. ### Fix Added `\` escape skip in `split_by_delimiter`'s character loop, before quote detection — consistent with the existing escape handling in `deserialize_one_cell_from_json`.
…apache#61995) Manual backport of apache#61995 to branch-4.0. The serde util still lives under be/src/vec/data_types/serde/ on this branch (namespace doris::vectorized), so the core fix and the new UT are applied at the vec/ paths and the test's added include points to vec/data_types/serde/complex_type_deserialize_util.h. The core change, the UT body, and the jsonb regression data are unchanged. ### Bug Stream Load (JSON format) writing to Map<String, String> columns silently sets the column to NULL when the map value contains both \" (escaped quotes) and : or ,. Stream Load returns Success with FilteredRows=0 — data loss with no error indication. ### Fix Added \ escape skip in split_by_delimiter's character loop, before quote detection — consistent with the existing escape handling in deserialize_one_cell_from_json. (cherry picked from commit 0c9fff3) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
#61995 ([fix](serde) fix split_by_delimiter missing backslash escape handling) made the complex-type splitter skip escaped characters so that '\"' inside a value no longer flips quote state. ### What problem does this PR solve? Issue Number: close #xxx Related PR: #xxx Problem Summary: ### Release note None ### Check List (For Author) - Test <!-- At least one of them must be included. --> - [ ] Regression test - [ ] Unit Test - [ ] Manual test (add detailed scripts or steps below) - [ ] No need to test or manual test. Explain why: - [ ] This is a refactor/code format and no logic has been changed. - [ ] Previous test can cover this change. - [ ] No code files have been changed. - [ ] Other reason <!-- Add your reason? --> - Behavior changed: - [ ] No. - [ ] Yes. <!-- Explain the behavior change --> - Does this need documentation? - [ ] No. - [ ] Yes. <!-- Add document PR link here. eg: apache/doris-website#1214 --> ### Check List (For Reviewer who merge this PR) - [ ] Confirm the release note - [ ] Confirm test cases - [ ] Confirm document - [ ] Add branch pick label <!-- Add branch pick label that this PR should merge into --> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
#61995 ([fix](serde) fix split_by_delimiter missing backslash escape handling) made the complex-type splitter skip escaped characters so that '\"' inside a value no longer flips quote state. ### What problem does this PR solve? Issue Number: close #xxx Related PR: #xxx Problem Summary: ### Release note None ### Check List (For Author) - Test <!-- At least one of them must be included. --> - [ ] Regression test - [ ] Unit Test - [ ] Manual test (add detailed scripts or steps below) - [ ] No need to test or manual test. Explain why: - [ ] This is a refactor/code format and no logic has been changed. - [ ] Previous test can cover this change. - [ ] No code files have been changed. - [ ] Other reason <!-- Add your reason? --> - Behavior changed: - [ ] No. - [ ] Yes. <!-- Explain the behavior change --> - Does this need documentation? - [ ] No. - [ ] Yes. <!-- Add document PR link here. eg: apache/doris-website#1214 --> ### Check List (For Reviewer who merge this PR) - [ ] Confirm the release note - [ ] Confirm test cases - [ ] Confirm document - [ ] Add branch pick label <!-- Add branch pick label that this PR should merge into --> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Bug
Stream Load (JSON format) writing to
Map<String, String>columns silently sets the column to NULL when the map value contains both\"(escaped quotes) and:or,. Stream Load returns Success with FilteredRows=0 — data loss with no error indication.Fix
Added
\escape skip insplit_by_delimiter's character loop, before quote detection — consistent with the existing escape handling indeserialize_one_cell_from_json.What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)