Skip to content

fix: fix incorrect chunk_end calculation in ComputePageRanges when dictionary page is present#325

Open
zhf999 wants to merge 4 commits into
alibaba:mainfrom
zhf999:page-length-fix
Open

fix: fix incorrect chunk_end calculation in ComputePageRanges when dictionary page is present#325
zhf999 wants to merge 4 commits into
alibaba:mainfrom
zhf999:page-length-fix

Conversation

@zhf999
Copy link
Copy Markdown
Contributor

@zhf999 zhf999 commented Jun 1, 2026

Purpose

When dictionary encoding is enabled, ComputePageRanges computes chunk_end as
data_page_offset + total_compressed_size. This is incorrect because
total_compressed_size covers the entire column chunk including the dictionary page,
but data_page_offset points to the first data page (after the dictionary page).

This causes the last data page's byte range to overshoot the actual chunk boundary
by exactly the dictionary page size, potentially reading into the next column chunk
or beyond the file.

The fix subtracts the dictionary page size from total_compressed_size before
computing chunk_end, so that chunk_end = data_page_offset + (total_compressed_size - dict_size).

Tests

  • Added ComputePageRangesWithDictionaryEncoding unit test that writes a Parquet file
    with dictionary encoding enabled and verifies that no computed page range exceeds
    the true column chunk boundary.

API and Format

No.

Documentation

No.

Generative AI tooling

Claude 4.7 opus

Copilot AI review requested due to automatic review settings June 1, 2026 08:57
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds coverage and fixes range computation for Parquet column chunks that include a dictionary page, preventing computed read ranges from exceeding the actual chunk boundary.

Changes:

  • Fix ComputePageRanges to account for dictionary page size when computing data-page read ranges and chunk_end.
  • Add an end-to-end unit test reproducing/guarding the dictionary-encoding chunk-end overshoot bug.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
src/paimon/format/parquet/page_filtered_row_group_reader.cpp Adjusts chunk boundary and fallback read range sizing when a dictionary page is present.
src/paimon/format/parquet/page_filtered_row_group_reader_test.cpp Adds a regression test ensuring computed ranges don’t exceed the true column chunk end with dictionary encoding.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/paimon/format/parquet/page_filtered_row_group_reader.cpp
Comment thread src/paimon/format/parquet/page_filtered_row_group_reader_test.cpp
Comment thread src/paimon/format/parquet/page_filtered_row_group_reader_test.cpp
Comment thread src/paimon/format/parquet/page_filtered_row_group_reader_test.cpp
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants