fix: Parquet bloom filter pruning can incorrectly filter decimals encoded as FIXED_LEN_BYTE_ARRAY#22995
fix: Parquet bloom filter pruning can incorrectly filter decimals encoded as FIXED_LEN_BYTE_ARRAY#22995lyne7-sc wants to merge 3 commits into
Conversation
| } | ||
|
|
||
| #[tokio::test] | ||
| async fn test_row_group_bloom_filter_pruning_predicate_decimal128() { |
There was a problem hiding this comment.
Nice regression coverage for the fixed-width truncation path. It might be worth adding a negative decimal case as well, since Parquet fixed-len decimal bytes depend on two's-complement sign extension and truncation. For example, you could write row groups with negative values and assert that a predicate like decimal_col = -500 keeps only the matching row group.
There was a problem hiding this comment.
Added a negative Decimal128 regression test using values from -100 to -600 and verifying decimal_col = -500 keeps only the matching row group.
| /// * Parquet physical [`Type`] needed to evaluate literals against the filter | ||
| column_sbbf: HashMap<String, (Sbbf, Type)>, | ||
| /// * Type length from the Parquet column descriptor | ||
| column_sbbf: HashMap<String, (Sbbf, Type, i32)>, |
There was a problem hiding this comment.
This tuple now carries a pretty important type_length contract. A small named struct, such as struct ColumnBloomFilter { sbbf: Sbbf, physical_type: Type, type_length: i32 }, could make the invariant clearer and help avoid accidentally mixing up tuple fields at call sites.
There was a problem hiding this comment.
Addressed by replacing the tuple with a ColumnBloomFilter struct.
|
@kosiew Thanks for the review! I’ve addressed the comments. |
Which issue does this PR close?
Rationale for this change
Parquet bloom filter pruning can incorrectly prune decimal columns encoded as
FIXED_LEN_BYTE_ARRAY.Bloom filters are checked against the physical bytes stored in the Parquet file. For
FIXED_LEN_BYTE_ARRAY, the byte width comes from the Parquet column descriptor'stype_length. DataFusion was checking decimal literals using a fixed-width integer byte representation, which can differ from thefile's fixed byte width and cause false negatives.What changes are included in this PR?
type_lengthtogether with the bloom filter metadata.type_lengthwhen checking decimal literals againstFIXED_LEN_BYTE_ARRAYbloom filters.Are these changes tested?
Yes.
Are there any user-facing changes?
No API changes. This fixes incorrect query results when Parquet bloom filter pruning is enabled for fixed-length decimal columns.