Skip to content

[improvement](be) Add scanner v2 parquet page cache#64883

Merged
Gabriel39 merged 2 commits into
apache:refact_reader_branchfrom
Gabriel39:parquet_cache
Jun 26, 2026
Merged

[improvement](be) Add scanner v2 parquet page cache#64883
Gabriel39 merged 2 commits into
apache:refact_reader_branchfrom
Gabriel39:parquet_cache

Conversation

@Gabriel39

Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: File scanner v2 reads Parquet through Arrow, so the old vparquet page cache path is not used. Repeated scans still go through the Doris file reader for serialized Parquet column chunk data even when the Parquet page cache option is enabled. This change registers the selected Parquet column chunk byte ranges after row-group planning and lets the Arrow RandomAccessFile adapter reuse StoragePageCache for reads inside those ranges. Footer and metadata reads happen before range registration and are intentionally excluded.

Release note

None

Check List (For Author)

  • Test: Manual test
    • Ran git diff --check.
    • Ran build-support/run_clang_format.py with clang-format 16 on modified BE files.
    • Could not compile with existing be/cmake-build-debug-dev-perf because CMakeCache.txt was generated for /mnt/disk3/gabriel/Workspace/dev1/doris and the configured ninja path is not available in this worktree.
  • Behavior changed: No
  • Does this need documentation: No

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: File scanner v2 reads Parquet through Arrow, so the old vparquet page cache path is not used. Repeated scans still go through the Doris file reader for serialized Parquet column chunk data even when the Parquet page cache option is enabled. This change registers the selected Parquet column chunk byte ranges after row-group planning and lets the Arrow RandomAccessFile adapter reuse StoragePageCache for reads inside those ranges. Footer and metadata reads happen before range registration and are intentionally excluded.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Ran git diff --check.
    - Ran build-support/run_clang_format.py with clang-format 16 on modified BE files.
    - Could not compile with existing be/cmake-build-debug-dev-perf because CMakeCache.txt was generated for /mnt/disk3/gabriel/Workspace/dev1/doris and the configured ninja path is not available in this worktree.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: The Hive OpenX JSON regression expected ten null rows for the ignore.malformed.json table, while the malformed input file contains eleven physical malformed JSON lines. The JSON v2 reader returns one null row for each malformed line when ignore.malformed.json is enabled, so the regression expectation needs to include the extra null row.

### Release note

None

### Check List (For Author)

- Test: Manual test
    - Verified the expected q1 output now contains eleven null rows and ran git diff --check.
- Behavior changed: No
- Does this need documentation: No
@hello-stephen

Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@Gabriel39 Gabriel39 merged commit 14430d2 into apache:refact_reader_branch Jun 26, 2026
18 of 26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants