[fix](hive) Preserve empty text records#64671
Conversation
|
/review |
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
Codex automated review failed and did not complete. Error: You've hit your usage limit. Visit https://chatgpt.com/codex/settings/usage to purchase more credits or try again at Jun 24th, 2026 10:56 PM. Please inspect the workflow logs and rerun the review after the underlying issue is resolved. |
|
/review |
|
run buildall |
|
Codex automated review failed and did not complete. Error: You've hit your usage limit. Visit https://chatgpt.com/codex/settings/usage to purchase more credits or try again at Jun 24th, 2026 10:56 PM. Please inspect the workflow logs and rerun the review after the underlying issue is resolved. |
|
/review |
There was a problem hiding this comment.
Review result: no blocking issues found in the PR diff.
Critical checkpoint conclusions:
- Goal/test proof: the change preserves single-column Hive text empty records by routing empty text lines through normal Hive text deserialization, and the added Hive regression checks both row count and null/empty-string classification for Hive 2 and Hive 3 outputs.
- Scope/focus: the implementation is narrow, limited to CsvReader/TextReader empty-line handling plus regression seed/output updates.
- Concurrency/lifecycle/config/compatibility: no new concurrency, lifecycle-managed state, config item, protocol field, storage format, or rolling-upgrade compatibility surface was introduced.
- Parallel paths: CSV behavior is preserved through the default
_empty_line_as_record() == false; Hive text overrides it. COUNT pushdown follows the same empty-record predicate, and nullable counted-slot pushdown remains blocked by existing FE checks. - Tests/results: the added outputs are deterministic scalar aggregates. I did not run the full external Hive regression locally; current CI shows formatter/checkstyle/compile passing, external/regression TeamCity checks still pending, and the macOS BE UT failure is environmental (
JAVA version is 25, it must be JDK-17) before tests run. - Observability/transactions/persistence/data writes: not applicable for this reader-side parsing fix.
- Performance: the added branch only affects zero-length lines and keeps normal non-empty-line processing unchanged.
User focus: no additional user-provided focus points were supplied.
Subagent conclusions: optimizer-rewrite found no candidate findings; tests-session-config found no candidate findings. I independently verified their dismissed points in the reader, line-reader, Hive SerDe property, aggregate pushdown, and regression-output paths. Final convergence round 1 ended with both subagents replying NO_NEW_VALUABLE_FINDINGS for the same ledger/comment set, and no inline comments were proposed.
32d1e52 to
ecdd9f8
Compare
|
run buildall |
TPC-H: Total hot run time: 29493 ms |
TPC-DS: Total hot run time: 173623 ms |
ClickBench: Total hot run time: 25.33 s |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
1f8297f to
c26e6e6
Compare
|
run buildall |
TPC-H: Total hot run time: 29621 ms |
TPC-DS: Total hot run time: 173045 ms |
ClickBench: Total hot run time: 25.32 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
c26e6e6 to
448f1d2
Compare
|
run buildall |
TPC-H: Total hot run time: 29272 ms |
TPC-DS: Total hot run time: 174281 ms |
ClickBench: Total hot run time: 25.19 s |
448f1d2 to
090ad5d
Compare
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
ClickBench: Total hot run time: 25.15 s |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
TPC-H: Total hot run time: 29321 ms |
TPC-DS: Total hot run time: 173213 ms |
ClickBench: Total hot run time: 25.24 s |
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: The Hive serde regression fixture added a multi-column text table to verify the default Hive null marker, but the table DDL did not specify a tab field delimiter while the prepared data file used tab-separated rows. Hive LazySimpleSerDe defaults to Ctrl-A as the field delimiter, so Doris did not split rows such as a\tb into two columns and the expected aggregation result mismatched. This change declares the table with ROW FORMAT DELIMITED and FIELDS TERMINATED BY '\t' while leaving serialization.null.format unset, so the case continues to validate the default \N null format.
### Release note
None
### Check List (For Author)
- Test: Regression test
- Ran ./run-regression-test.sh --run -d external_table_p0/hive -s test_hive_serde_prop; the suite succeeded but the Hive test body was skipped because enableHiveTest=false in the local config.
- Ran git diff --check.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve? Issue Number: close #xxx Related PR: apache#64671 Problem Summary: Refreshing the Hive regression module reruns all data/regression run.sh scripts. The credit-data fixture uploaded files before executing its Hive DDL, but the DDL drops and recreates the managed Hive table. On refresh, DROP TABLE can remove the table location after the files have just been uploaded, leaving crdmm_data empty and causing test_external_credit_data to return no rows. This change runs the DDL first, then recreates the target HDFS directory and uploads the credit-data files into the table location. ### Release note None ### Check List (For Author) - Test: Regression test - Ran bash -n docker/thirdparties/docker-compose/hive/scripts/data/regression/crdmm_data/run.sh. - Ran git diff --check. - Ran ./run-regression-test.sh --run -d external_table_p0/hive -s test_external_credit_data; the suite succeeded but the Hive test body was skipped because enableHiveTest=false in the local config. - Behavior changed: No - Does this need documentation: No
|
run buildall |
TPC-H: Total hot run time: 29147 ms |
TPC-DS: Total hot run time: 172615 ms |
ClickBench: Total hot run time: 25.35 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
|
/review |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
There was a problem hiding this comment.
Reviewed the PR patch for Hive TEXTFILE empty-record preservation and the Hive regression fixture refresh changes. I did not find a substantiated blocking issue to raise inline.
Checkpoint conclusions:
- Goal/test proof: The BE reader change preserves empty physical TEXTFILE records and the regression suite adds count/value checks for custom and default null formats across hive2/hive3 expected output sections.
- Scope/focus: The code change is narrowly scoped to CsvReader/TextReader empty-line behavior plus Hive fixture setup.
- Concurrency/lifecycle/config: No new concurrency, lifecycle ownership, persistent state, or dynamic config behavior is introduced.
- Parallel paths: Normal materialization and COUNT pushdown paths were both checked for empty-line consistency.
- Compatibility: CSV default behavior stays unchanged because
_empty_line_as_record()defaults to false; the override is TextReader-only. - Testing: Added Hive regression coverage targets the changed behavior; local full execution was not rerun because this runner lacks initialized Hive/thirdparty test setup.
- Style/basic checks:
bash -npassed for modified shell scripts; added PR patch lines did not contain trailing whitespace. - User focus: No additional user-provided focus was present.
Subagent conclusions: The optimizer-rewrite subagent reported NO_NEW_VALUABLE_FINDINGS. The tests-session-config subagent reported NO_NEW_VALUABLE_FINDINGS. No subagent candidate became an inline comment, no duplicates were merged, and convergence round 1 ended with all live subagents reporting NO_NEW_VALUABLE_FINDINGS for the current ledger/comment set.
### What problem does this PR solve?
Issue Number: close #xxx
Problem Summary:
When scanning Hive TEXTFILE tables, Doris previously skipped empty
physical lines unless `read_csv_empty_line_as_null` was enabled. This is
inconsistent with Hive TEXTFILE semantics: an empty physical line is
still a record. For a single-column text table it represents one empty
field, and for multi-column text tables missing trailing fields should
be filled using the table's null format.
This can cause Doris to return fewer rows than Hive for text files
containing empty lines, especially when the table uses `LazySimpleSerDe`
and custom or default `serialization.null.format`.
This PR fixes the behavior by adding a format-level hook for empty-line
handling:
- CSV keeps the existing default behavior and does not treat empty lines
as records.
- Hive TEXT overrides the hook and treats empty physical lines as
records.
- Empty Hive text lines are passed through normal field deserialization
so string/null handling stays consistent with `null_format`.
The PR also adds Hive regression coverage for:
- a single-column text table with custom `serialization.null.format`;
- a multi-column text table using the default Hive null marker `\N`;
- preservation of empty records and correct NULL/empty-string
classification.
In addition, the credit-data Hive fixture upload order is made
refresh-safe. The Hive regression module refresh may rerun all
`data/regression` setup scripts; `crdmm_data` now recreates the Hive
table before re-uploading its HDFS data so `DROP TABLE` cannot remove
freshly uploaded files.
### Release note
Fix Hive TEXTFILE scans to preserve empty physical lines as records,
matching Hive behavior.
### Check List (For Author)
- Test: Regression test
- Added/updated `external_table_p0/hive/test_hive_serde_prop`.
- Ran `./run-regression-test.sh --run -d external_table_p0/hive -s
test_hive_serde_prop`; local config had `enableHiveTest=false`, so the
Hive test body was skipped.
- Ran `./run-regression-test.sh --run -d external_table_p0/hive -s
test_external_credit_data`; local config had `enableHiveTest=false`, so
the Hive test body was skipped.
- Ran `bash -n
docker/thirdparties/docker-compose/hive/scripts/data/regression/crdmm_data/run.sh`.
- Ran `git diff --check`.
- Behavior changed: Yes. Hive TEXTFILE scans now preserve empty physical
lines as records instead of skipping them.
- Does this need documentation: No
### What problem does this PR solve? Issue Number: None Related PR: apache#63781, apache#64671 Problem Summary: File scanner v2 did not carry the same fixes as the existing file scanner path. Predicate rows filtered inside v2 file readers were still reported through scanner load counters unless the scanner was a real load source, and Hive TEXTFILE empty physical lines were still skipped unless read_csv_empty_line_as_null was enabled. This change gates v2 load counter reporting with the same FILE_STREAM exception used by FileScanner and adds a delimited text hook so Hive Text v2 treats empty physical lines as records while CSV keeps the old default behavior. ### Release note Fix file scanner v2 load counter reporting and Hive TEXTFILE empty-line handling. ### Check List (For Author) - Test: Unit Test / Manual test - Added TextV2ReaderTest coverage for Hive TEXTFILE empty line records, single-column empty string fields, and COUNT pushdown. - Ran git diff --check. - Ran clang-format v16 through build-support/run_clang_format.py for changed files. - Attempted ./run-be-ut.sh --run --filter='TextV2ReaderTest.*:FileScannerV2Test.*', but the local run was blocked because the script needed to update/download datasketches-cpp and network access was unavailable; no BE UT binary was already built. - Attempted clang-tidy with the available compile_commands.json, but it pointed at a stale /mnt/disk3/gabriel path; the project clang-tidy wrapper also requires bash 4+ while only system bash is available. - Behavior changed: Yes. File scanner v2 now matches v1 load counter gating and Hive TEXTFILE empty-line semantics. - Does this need documentation: No
### What problem does this PR solve? Issue Number: None Related PR: #63781, #64671 Problem Summary: File scanner v2 did not carry the same fixes as the existing file scanner path. Predicate rows filtered inside v2 file readers were still reported through scanner load counters unless the scanner was a real load source, and Hive TEXTFILE empty physical lines were still skipped unless read_csv_empty_line_as_null was enabled. This change gates v2 load counter reporting with the same FILE_STREAM exception used by FileScanner and adds a delimited text hook so Hive Text v2 treats empty physical lines as records while CSV keeps the old default behavior. ### Release note Fix file scanner v2 load counter reporting and Hive TEXTFILE empty-line handling. ### Check List (For Author) - Test: Unit Test / Manual test - Added TextV2ReaderTest coverage for Hive TEXTFILE empty line records, single-column empty string fields, and COUNT pushdown. - Ran git diff --check. - Ran clang-format v16 through build-support/run_clang_format.py for changed files. - Attempted ./run-be-ut.sh --run --filter='TextV2ReaderTest.*:FileScannerV2Test.*', but the local run was blocked because the script needed to update/download datasketches-cpp and network access was unavailable; no BE UT binary was already built. - Attempted clang-tidy with the available compile_commands.json, but it pointed at a stale /mnt/disk3/gabriel path; the project clang-tidy wrapper also requires bash 4+ while only system bash is available. - Behavior changed: Yes. File scanner v2 now matches v1 load counter gating and Hive TEXTFILE empty-line semantics. - Does this need documentation: No ### What problem does this PR solve? Issue Number: close #xxx Related PR: #xxx Problem Summary: ### Release note None ### Check List (For Author) - Test <!-- At least one of them must be included. --> - [ ] Regression test - [ ] Unit Test - [ ] Manual test (add detailed scripts or steps below) - [ ] No need to test or manual test. Explain why: - [ ] This is a refactor/code format and no logic has been changed. - [ ] Previous test can cover this change. - [ ] No code files have been changed. - [ ] Other reason <!-- Add your reason? --> - Behavior changed: - [ ] No. - [ ] Yes. <!-- Explain the behavior change --> - Does this need documentation? - [ ] No. - [ ] Yes. <!-- Add document PR link here. eg: apache/doris-website#1214 --> ### Check List (For Reviewer who merge this PR) - [ ] Confirm the release note - [ ] Confirm test cases - [ ] Confirm document - [ ] Add branch pick label <!-- Add branch pick label that this PR should merge into -->
What problem does this PR solve?
Issue Number: close #xxx
Problem Summary:
When scanning Hive TEXTFILE tables, Doris previously skipped empty physical lines unless
read_csv_empty_line_as_nullwas enabled. This is inconsistent with Hive TEXTFILE semantics: an empty physical line is still a record. For a single-column text table it represents one empty field, and for multi-column text tables missing trailing fields should be filled using the table's null format.This can cause Doris to return fewer rows than Hive for text files containing empty lines, especially when the table uses
LazySimpleSerDeand custom or defaultserialization.null.format.This PR fixes the behavior by adding a format-level hook for empty-line handling:
null_format.The PR also adds Hive regression coverage for:
serialization.null.format;\N;In addition, the credit-data Hive fixture upload order is made refresh-safe. The Hive regression module refresh may rerun all
data/regressionsetup scripts;crdmm_datanow recreates the Hive table before re-uploading its HDFS data soDROP TABLEcannot remove freshly uploaded files.Release note
Fix Hive TEXTFILE scans to preserve empty physical lines as records, matching Hive behavior.
Check List (For Author)
external_table_p0/hive/test_hive_serde_prop../run-regression-test.sh --run -d external_table_p0/hive -s test_hive_serde_prop; local config hadenableHiveTest=false, so the Hive test body was skipped../run-regression-test.sh --run -d external_table_p0/hive -s test_external_credit_data; local config hadenableHiveTest=false, so the Hive test body was skipped.bash -n docker/thirdparties/docker-compose/hive/scripts/data/regression/crdmm_data/run.sh.git diff --check.