branch-4.1: [fix](hive) Preserve empty text records #64671#64837
Merged
Conversation
### What problem does this PR solve?
Issue Number: close #xxx
Problem Summary:
When scanning Hive TEXTFILE tables, Doris previously skipped empty
physical lines unless `read_csv_empty_line_as_null` was enabled. This is
inconsistent with Hive TEXTFILE semantics: an empty physical line is
still a record. For a single-column text table it represents one empty
field, and for multi-column text tables missing trailing fields should
be filled using the table's null format.
This can cause Doris to return fewer rows than Hive for text files
containing empty lines, especially when the table uses `LazySimpleSerDe`
and custom or default `serialization.null.format`.
This PR fixes the behavior by adding a format-level hook for empty-line
handling:
- CSV keeps the existing default behavior and does not treat empty lines
as records.
- Hive TEXT overrides the hook and treats empty physical lines as
records.
- Empty Hive text lines are passed through normal field deserialization
so string/null handling stays consistent with `null_format`.
The PR also adds Hive regression coverage for:
- a single-column text table with custom `serialization.null.format`;
- a multi-column text table using the default Hive null marker `\N`;
- preservation of empty records and correct NULL/empty-string
classification.
In addition, the credit-data Hive fixture upload order is made
refresh-safe. The Hive regression module refresh may rerun all
`data/regression` setup scripts; `crdmm_data` now recreates the Hive
table before re-uploading its HDFS data so `DROP TABLE` cannot remove
freshly uploaded files.
### Release note
Fix Hive TEXTFILE scans to preserve empty physical lines as records,
matching Hive behavior.
### Check List (For Author)
- Test: Regression test
- Added/updated `external_table_p0/hive/test_hive_serde_prop`.
- Ran `./run-regression-test.sh --run -d external_table_p0/hive -s
test_hive_serde_prop`; local config had `enableHiveTest=false`, so the
Hive test body was skipped.
- Ran `./run-regression-test.sh --run -d external_table_p0/hive -s
test_external_credit_data`; local config had `enableHiveTest=false`, so
the Hive test body was skipped.
- Ran `bash -n
docker/thirdparties/docker-compose/hive/scripts/data/regression/crdmm_data/run.sh`.
- Ran `git diff --check`.
- Behavior changed: Yes. Hive TEXTFILE scans now preserve empty physical
lines as records instead of skipping them.
- Does this need documentation: No
Contributor
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
Contributor
|
run buildall |
Contributor
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
Contributor
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
yiguolei
approved these changes
Jun 25, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Cherry-picked from #64671