fix: fix incorrect chunk_end calculation in ComputePageRanges when dictionary page is present#325
Open
zhf999 wants to merge 4 commits into
Open
fix: fix incorrect chunk_end calculation in ComputePageRanges when dictionary page is present#325zhf999 wants to merge 4 commits into
zhf999 wants to merge 4 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds coverage and fixes range computation for Parquet column chunks that include a dictionary page, preventing computed read ranges from exceeding the actual chunk boundary.
Changes:
- Fix
ComputePageRangesto account for dictionary page size when computing data-page read ranges andchunk_end. - Add an end-to-end unit test reproducing/guarding the dictionary-encoding chunk-end overshoot bug.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| src/paimon/format/parquet/page_filtered_row_group_reader.cpp | Adjusts chunk boundary and fallback read range sizing when a dictionary page is present. |
| src/paimon/format/parquet/page_filtered_row_group_reader_test.cpp | Adds a regression test ensuring computed ranges don’t exceed the true column chunk end with dictionary encoding. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
When dictionary encoding is enabled,
ComputePageRangescomputeschunk_endasdata_page_offset + total_compressed_size. This is incorrect becausetotal_compressed_sizecovers the entire column chunk including the dictionary page,but
data_page_offsetpoints to the first data page (after the dictionary page).This causes the last data page's byte range to overshoot the actual chunk boundary
by exactly the dictionary page size, potentially reading into the next column chunk
or beyond the file.
The fix subtracts the dictionary page size from
total_compressed_sizebeforecomputing
chunk_end, so thatchunk_end = data_page_offset + (total_compressed_size - dict_size).Tests
ComputePageRangesWithDictionaryEncodingunit test that writes a Parquet filewith dictionary encoding enabled and verifies that no computed page range exceeds
the true column chunk boundary.
API and Format
No.
Documentation
No.
Generative AI tooling
Claude 4.7 opus