[improvement](be) Add scanner v2 parquet page cache by Gabriel39 · Pull Request #64883 · apache/doris

Gabriel39 · 2026-06-26T07:33:22Z

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: File scanner v2 reads Parquet through Arrow, so the old vparquet page cache path is not used. Repeated scans still go through the Doris file reader for serialized Parquet column chunk data even when the Parquet page cache option is enabled. This change registers the selected Parquet column chunk byte ranges after row-group planning and lets the Arrow RandomAccessFile adapter reuse StoragePageCache for reads inside those ranges. Footer and metadata reads happen before range registration and are intentionally excluded.

Release note

None

Check List (For Author)

Test: Manual test
- Ran git diff --check.
- Ran build-support/run_clang_format.py with clang-format 16 on modified BE files.
- Could not compile with existing be/cmake-build-debug-dev-perf because CMakeCache.txt was generated for /mnt/disk3/gabriel/Workspace/dev1/doris and the configured ninja path is not available in this worktree.
Behavior changed: No
Does this need documentation: No

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

Test
- Regression test
- Unit Test
- Manual test (add detailed scripts or steps below)
- No need to test or manual test. Explain why:
  - This is a refactor/code format and no logic has been changed.
  - Previous test can cover this change.
  - No code files have been changed.
  - Other reason
Behavior changed:
- No.
- Yes.
Does this need documentation?
- No.
- Yes.

Check List (For Reviewer who merge this PR)

Confirm the release note
Confirm test cases
Confirm document
Add branch pick label

### What problem does this PR solve? Issue Number: close #xxx Related PR: #xxx Problem Summary: File scanner v2 reads Parquet through Arrow, so the old vparquet page cache path is not used. Repeated scans still go through the Doris file reader for serialized Parquet column chunk data even when the Parquet page cache option is enabled. This change registers the selected Parquet column chunk byte ranges after row-group planning and lets the Arrow RandomAccessFile adapter reuse StoragePageCache for reads inside those ranges. Footer and metadata reads happen before range registration and are intentionally excluded. ### Release note None ### Check List (For Author) - Test: Manual test - Ran git diff --check. - Ran build-support/run_clang_format.py with clang-format 16 on modified BE files. - Could not compile with existing be/cmake-build-debug-dev-perf because CMakeCache.txt was generated for /mnt/disk3/gabriel/Workspace/dev1/doris and the configured ninja path is not available in this worktree. - Behavior changed: No - Does this need documentation: No

### What problem does this PR solve? Issue Number: close #xxx Related PR: #xxx Problem Summary: The Hive OpenX JSON regression expected ten null rows for the ignore.malformed.json table, while the malformed input file contains eleven physical malformed JSON lines. The JSON v2 reader returns one null row for each malformed line when ignore.malformed.json is enabled, so the regression expectation needs to include the extra null row. ### Release note None ### Check List (For Author) - Test: Manual test - Verified the expected q1 output now contains eleven null rows and ran git diff --check. - Behavior changed: No - Does this need documentation: No

hello-stephen · 2026-06-26T07:33:27Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

Gabriel39 added 2 commits June 26, 2026 13:52

Gabriel39 merged commit 14430d2 into apache:refact_reader_branch Jun 26, 2026
18 of 26 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[improvement](be) Add scanner v2 parquet page cache#64883

[improvement](be) Add scanner v2 parquet page cache#64883
Gabriel39 merged 2 commits into
apache:refact_reader_branchfrom
Gabriel39:parquet_cache

Gabriel39 commented Jun 26, 2026

Uh oh!

hello-stephen commented Jun 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Gabriel39 commented Jun 26, 2026

What problem does this PR solve?

Release note

Check List (For Author)

What problem does this PR solve?

Release note

Check List (For Author)

Check List (For Reviewer who merge this PR)

Uh oh!

hello-stephen commented Jun 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants