[refactor](table) Refactor table and file reader#63893
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
### What problem does this PR solve? Issue Number: close #xxx Related PR: apache#63893 Problem Summary: Add focused BE unit coverage for new table reader and new parquet reader edge cases, including aggregate pushdown over split ranges, Iceberg equality/position deletes, row lineage after delete filtering, Parquet dictionary/statistics pruning, and IOContext release. Also clean up temporary delete predicate expression columns in the new Parquet reader so equality delete predicates with cast children do not alter the returned file block schema. ### Release note None ### Check List (For Author) - Test: Unit Test - Added BE UT cases in table_reader_test and parquet_reader_test. - Ran git diff --check. - Tried ./run-be-ut.sh with focused filters, but local JAVA_HOME points to JDK 11 and JDK_17 is not set; the runner requires JDK 17. - Behavior changed: No - Does this need documentation: No
### What problem does this PR solve? Issue Number: close #xxx Related PR: #63893 Problem Summary: Add focused BE unit coverage for new table reader and new parquet reader edge cases, including aggregate pushdown over split ranges, Iceberg equality/position deletes, row lineage after delete filtering, Parquet dictionary/statistics pruning, and IOContext release. Also clean up temporary delete predicate expression columns in the new Parquet reader so equality delete predicates with cast children do not alter the returned file block schema. ### Release note None ### Check List (For Author) - Test: Unit Test - Added BE UT cases in table_reader_test and parquet_reader_test. - Ran git diff --check. - Tried ./run-be-ut.sh with focused filters, but local JAVA_HOME points to JDK 11 and JDK_17 is not set; the runner requires JDK 17. - Behavior changed: No - Does this need documentation: No ### What problem does this PR solve? Issue Number: close #xxx Related PR: #xxx Problem Summary: ### Release note None ### Check List (For Author) - Test <!-- At least one of them must be included. --> - [ ] Regression test - [ ] Unit Test - [ ] Manual test (add detailed scripts or steps below) - [ ] No need to test or manual test. Explain why: - [ ] This is a refactor/code format and no logic has been changed. - [ ] Previous test can cover this change. - [ ] No code files have been changed. - [ ] Other reason <!-- Add your reason? --> - Behavior changed: - [ ] No. - [ ] Yes. <!-- Explain the behavior change --> - Does this need documentation? - [ ] No. - [ ] Yes. <!-- Add document PR link here. eg: apache/doris-website#1214 --> ### Check List (For Reviewer who merge this PR) - [ ] Confirm the release note - [ ] Confirm test cases - [ ] Confirm document - [ ] Add branch pick label <!-- Add branch pick label that this PR should merge into -->
837cc56 to
475e48a
Compare
|
run buildall |
TPC-H: Total hot run time: 29107 ms |
FE Regression Coverage ReportIncrement line coverage |
TPC-DS: Total hot run time: 168765 ms |
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
TPC-H: Total hot run time: 29068 ms |
TPC-DS: Total hot run time: 171113 ms |
FE Regression Coverage ReportIncrement line coverage |
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
TPC-H: Total hot run time: 29390 ms |
FE Regression Coverage ReportIncrement line coverage |
TPC-DS: Total hot run time: 168294 ms |
|
run buildall |
1 similar comment
|
run buildall |
TPC-H: Total hot run time: 29426 ms |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
267891d to
64d2dda
Compare
### What problem does this PR solve? Issue Number: None Related PR: #63781, #64671 Problem Summary: File scanner v2 did not carry the same fixes as the existing file scanner path. Predicate rows filtered inside v2 file readers were still reported through scanner load counters unless the scanner was a real load source, and Hive TEXTFILE empty physical lines were still skipped unless read_csv_empty_line_as_null was enabled. This change gates v2 load counter reporting with the same FILE_STREAM exception used by FileScanner and adds a delimited text hook so Hive Text v2 treats empty physical lines as records while CSV keeps the old default behavior. ### Release note Fix file scanner v2 load counter reporting and Hive TEXTFILE empty-line handling. ### Check List (For Author) - Test: Unit Test / Manual test - Added TextV2ReaderTest coverage for Hive TEXTFILE empty line records, single-column empty string fields, and COUNT pushdown. - Ran git diff --check. - Ran clang-format v16 through build-support/run_clang_format.py for changed files. - Attempted ./run-be-ut.sh --run --filter='TextV2ReaderTest.*:FileScannerV2Test.*', but the local run was blocked because the script needed to update/download datasketches-cpp and network access was unavailable; no BE UT binary was already built. - Attempted clang-tidy with the available compile_commands.json, but it pointed at a stale /mnt/disk3/gabriel path; the project clang-tidy wrapper also requires bash 4+ while only system bash is available. - Behavior changed: Yes. File scanner v2 now matches v1 load counter gating and Hive TEXTFILE empty-line semantics. - Does this need documentation: No ### What problem does this PR solve? Issue Number: close #xxx Related PR: #xxx Problem Summary: ### Release note None ### Check List (For Author) - Test <!-- At least one of them must be included. --> - [ ] Regression test - [ ] Unit Test - [ ] Manual test (add detailed scripts or steps below) - [ ] No need to test or manual test. Explain why: - [ ] This is a refactor/code format and no logic has been changed. - [ ] Previous test can cover this change. - [ ] No code files have been changed. - [ ] Other reason <!-- Add your reason? --> - Behavior changed: - [ ] No. - [ ] Yes. <!-- Explain the behavior change --> - Does this need documentation? - [ ] No. - [ ] Yes. <!-- Add document PR link here. eg: apache/doris-website#1214 --> ### Check List (For Reviewer who merge this PR) - [ ] Confirm the release note - [ ] Confirm test cases - [ ] Confirm document - [ ] Add branch pick label <!-- Add branch pick label that this PR should merge into -->
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
|
run buildall |
FE Regression Coverage ReportIncrement line coverage |
FE Regression Coverage ReportIncrement line coverage |
1 similar comment
FE Regression Coverage ReportIncrement line coverage |
### What problem does this PR solve? Issue Number: close #xxx Related PR: apache#63893 Problem Summary: Restore regression expected outputs that only changed because INT96 timestamp reads moved away from the old parquet reader timezone behavior. The code now restores old INT96 compatibility, so these expected result rows should also return to the old timezone-adjusted values while leaving unrelated output differences untouched. ### Release note None ### Check List (For Author) - Test: Not run (expected result adjustment only) - Behavior changed: No - Does this need documentation: No
### What problem does this PR solve? Issue Number: close #xxx Related PR: apache#63893 Problem Summary: The previous regression expected-result update restored timestamp output too broadly. Only the lines produced by INT96 parquet inputs should be restored to match the old parquet reader compatibility behavior. This change keeps the expected results limited to the affected INT96 tvf cases and avoids unrelated timestamp output changes. ### Release note None ### Check List (For Author) - Test: No need to test (expected-result scope correction only) - Behavior changed: No - Does this need documentation: No
### What problem does this PR solve? Issue Number: close #xxx Related PR: apache#63893 Problem Summary: Restore regression expected outputs that only changed because INT96 timestamp reads moved away from the old parquet reader timezone behavior. The code now restores old INT96 compatibility, so these expected result rows should also return to the old timezone-adjusted values while leaving unrelated output differences untouched. ### Release note None ### Check List (For Author) - Test: Not run (expected result adjustment only) - Behavior changed: No - Does this need documentation: No
TPC-H: Total hot run time: 29192 ms |
TPC-H: Total hot run time: 29229 ms |
TPC-DS: Total hot run time: 170764 ms |
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: File scanner v2 reads Parquet through Arrow, so the old
vparquet page cache path is not used. Repeated scans still go through
the Doris file reader for serialized Parquet column chunk data even when
the Parquet page cache option is enabled. This change registers the
selected Parquet column chunk byte ranges after row-group planning and
lets the Arrow RandomAccessFile adapter reuse StoragePageCache for reads
inside those ranges. Footer and metadata reads happen before range
registration and are intentionally excluded.
### Release note
None
### Check List (For Author)
- Test: Manual test
- Ran git diff --check.
- Ran build-support/run_clang_format.py with clang-format 16 on modified
BE files.
- Could not compile with existing be/cmake-build-debug-dev-perf because
CMakeCache.txt was generated for /mnt/disk3/gabriel/Workspace/dev1/doris
and the configured ninja path is not available in this worktree.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
### Release note
None
### Check List (For Author)
- Test <!-- At least one of them must be included. -->
- [ ] Regression test
- [ ] Unit Test
- [ ] Manual test (add detailed scripts or steps below)
- [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
- [ ] Previous test can cover this change.
- [ ] No code files have been changed.
- [ ] Other reason <!-- Add your reason? -->
- Behavior changed:
- [ ] No.
- [ ] Yes. <!-- Explain the behavior change -->
- Does this need documentation?
- [ ] No.
- [ ] Yes. <!-- Add document PR link here. eg:
apache/doris-website#1214 -->
### Check List (For Reviewer who merge this PR)
- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR
should merge into -->
Issue Number: close #xxx Related PR: apache#63893 Problem Summary: Restore regression expected outputs that only changed because INT96 timestamp reads moved away from the old parquet reader timezone behavior. The code now restores old INT96 compatibility, so these expected result rows should also return to the old timezone-adjusted values while leaving unrelated output differences untouched. None - Test: Not run (expected result adjustment only) - Behavior changed: No - Does this need documentation: No
TPC-DS: Total hot run time: 171959 ms |
ClickBench: Total hot run time: 25.78 s |
ClickBench: Total hot run time: 25.08 s |
### What problem does this PR solve?
Issue Number: None
Related PR: None
Problem Summary: FileScannerV2 did not support Doris Native files. This
change adds a native v2 FileReader implementation instead of wrapping
the legacy NativeReader. The reader validates the Native header, reads
serialized PBlock payloads, caches and replays the first block for
schema probing, exposes nullable file-local schema, projects requested
columns, and applies file-local filters. Shared materialized-column
filtering is also used by JSON and delimited text readers so predicate
accounting stays consistent. WAL is intentionally not implemented on the
v2 path because current group commit WAL scans are load scans and
FileScanOperator only selects FileScannerV2 when src_tuple_id does not
resolve to an input tuple.
### Release note
None
### Check List (For Author)
- Test:
- Style check: build-support/check-format.sh
- Unit Test: not run locally because sandbox execution cannot write
.git/modules for submodule setup and cannot download datasketches-cpp;
the attempted run-be-ut command failed before compiling tests.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
### Release note
None
### Check List (For Author)
- Test <!-- At least one of them must be included. -->
- [ ] Regression test
- [ ] Unit Test
- [ ] Manual test (add detailed scripts or steps below)
- [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
- [ ] Previous test can cover this change.
- [ ] No code files have been changed.
- [ ] Other reason <!-- Add your reason? -->
- Behavior changed:
- [ ] No.
- [ ] Yes. <!-- Explain the behavior change -->
- Does this need documentation?
- [ ] No.
- [ ] Yes. <!-- Add document PR link here. eg:
apache/doris-website#1214 -->
### Check List (For Reviewer who merge this PR)
- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR
should merge into -->
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
### Release note
None
### Check List (For Author)
- Test <!-- At least one of them must be included. -->
- [ ] Regression test
- [ ] Unit Test
- [ ] Manual test (add detailed scripts or steps below)
- [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
- [ ] Previous test can cover this change.
- [ ] No code files have been changed.
- [ ] Other reason <!-- Add your reason? -->
- Behavior changed:
- [ ] No.
- [ ] Yes. <!-- Explain the behavior change -->
- Does this need documentation?
- [ ] No.
- [ ] Yes. <!-- Add document PR link here. eg:
apache/doris-website#1214 -->
### Check List (For Reviewer who merge this PR)
- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR
should merge into -->
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 29032 ms |
TPC-DS: Total hot run time: 171569 ms |
ClickBench: Total hot run time: 25.96 s |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)