Skip to content

branch-4.1: [fix](hive) Preserve empty text records #64671#64837

Merged
yiguolei merged 1 commit into
branch-4.1from
auto-pick-64671-branch-4.1
Jun 25, 2026
Merged

branch-4.1: [fix](hive) Preserve empty text records #64671#64837
yiguolei merged 1 commit into
branch-4.1from
auto-pick-64671-branch-4.1

Conversation

@github-actions

Copy link
Copy Markdown
Contributor

Cherry-picked from #64671

### What problem does this PR solve?

Issue Number: close #xxx

Problem Summary:

When scanning Hive TEXTFILE tables, Doris previously skipped empty
physical lines unless `read_csv_empty_line_as_null` was enabled. This is
inconsistent with Hive TEXTFILE semantics: an empty physical line is
still a record. For a single-column text table it represents one empty
field, and for multi-column text tables missing trailing fields should
be filled using the table's null format.

This can cause Doris to return fewer rows than Hive for text files
containing empty lines, especially when the table uses `LazySimpleSerDe`
and custom or default `serialization.null.format`.

This PR fixes the behavior by adding a format-level hook for empty-line
handling:

- CSV keeps the existing default behavior and does not treat empty lines
as records.
- Hive TEXT overrides the hook and treats empty physical lines as
records.
- Empty Hive text lines are passed through normal field deserialization
so string/null handling stays consistent with `null_format`.

The PR also adds Hive regression coverage for:

- a single-column text table with custom `serialization.null.format`;
- a multi-column text table using the default Hive null marker `\N`;
- preservation of empty records and correct NULL/empty-string
classification.

In addition, the credit-data Hive fixture upload order is made
refresh-safe. The Hive regression module refresh may rerun all
`data/regression` setup scripts; `crdmm_data` now recreates the Hive
table before re-uploading its HDFS data so `DROP TABLE` cannot remove
freshly uploaded files.

### Release note

Fix Hive TEXTFILE scans to preserve empty physical lines as records,
matching Hive behavior.

### Check List (For Author)

- Test: Regression test
    - Added/updated `external_table_p0/hive/test_hive_serde_prop`.
- Ran `./run-regression-test.sh --run -d external_table_p0/hive -s
test_hive_serde_prop`; local config had `enableHiveTest=false`, so the
Hive test body was skipped.
- Ran `./run-regression-test.sh --run -d external_table_p0/hive -s
test_external_credit_data`; local config had `enableHiveTest=false`, so
the Hive test body was skipped.
- Ran `bash -n
docker/thirdparties/docker-compose/hive/scripts/data/regression/crdmm_data/run.sh`.
    - Ran `git diff --check`.
- Behavior changed: Yes. Hive TEXTFILE scans now preserve empty physical
lines as records instead of skipping them.
- Does this need documentation: No
@github-actions github-actions Bot requested a review from yiguolei as a code owner June 25, 2026 07:24
@hello-stephen

Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@hello-stephen

Copy link
Copy Markdown
Contributor

run buildall

@hello-stephen

Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 0.00% (0/18) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 54.04% (20633/38182)
Line Coverage 37.64% (196372/521720)
Region Coverage 33.99% (153428/451422)
Branch Coverage 34.93% (66902/191514)

@hello-stephen

Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (18/18) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.71% (27498/37308)
Line Coverage 57.38% (297927/519210)
Region Coverage 55.01% (250050/454528)
Branch Coverage 56.37% (108127/191811)

@yiguolei yiguolei merged commit 05a8c3e into branch-4.1 Jun 25, 2026
29 of 32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants