fix: Compute murmur3 hash with dictionary input correctly#433
Conversation
| } | ||
| } | ||
| } | ||
| hash_array_boolean!(BooleanArray, col, i32, hashes_buffer); |
There was a problem hiding this comment.
I am wondering why this is a macro. It looks like this is the only use case?
There was a problem hiding this comment.
Two reasons:
- it could be reused for xxhash64 too, which I am currently working in feat: Add xxhash64 function support #424
- mainly style issue, to be consistent with other types in this function, which are all called by macro.
| macro_rules! hash_array_boolean { | ||
| ($array_type: ident, $column: ident, $hash_input_type: ident, $hashes: ident) => { | ||
| let array = $column.as_any().downcast_ref::<$array_type>().unwrap(); | ||
| if array.null_count() == 0 { | ||
| for (i, hash) in $hashes.iter_mut().enumerate() { | ||
| *hash = spark_compatible_murmur3_hash( | ||
| $hash_input_type::from(array.value(i)).to_le_bytes(), | ||
| *hash, | ||
| ); | ||
| } | ||
| } else { | ||
| for (i, hash) in $hashes.iter_mut().enumerate() { | ||
| if !array.is_null(i) { | ||
| *hash = spark_compatible_murmur3_hash( | ||
| $hash_input_type::from(array.value(i)).to_le_bytes(), | ||
| *hash, | ||
| ); | ||
| } | ||
| } | ||
| } | ||
| }; | ||
| } |
There was a problem hiding this comment.
This is pull out as a macro because you will use different hash function other than spark_compatible_murmur3_hash later?
There was a problem hiding this comment.
yeah. It could be used to support xxhash64 function.
cadb5be to
a417a20
Compare
| test("hash functions with random input") { | ||
| val dataGen = DataGenerator.DEFAULT | ||
| // sufficient number of rows to create dictionary encoded ArrowArray. | ||
| val randomNumRows = 1000 |
There was a problem hiding this comment.
Some note here:
I'm not 100 percent sure how we could trigger a dictionary array in the native side from Spark.
When the random number is small, such as 100/200, there's no dictionary array involved in the native side, although the parquet should be written as all columns dictionary encoded.
I tweaked a bit and settled with 1000, which triggers a dictionary encoded ArrowArray in the rust side.
There was a problem hiding this comment.
Potentially we can add repeated values to force dictionary. E.g. randomly generate 100 rows and repeat 10 times to make 1000 rows
There was a problem hiding this comment.
E.g. randomly generate 100 rows and repeat 10 times to make 1000 rows
So dictionary encoding is only triggered with enough repetition?
There was a problem hiding this comment.
Yes, makeParquetFileAllTypes or some existing dictionary related tests may be helpful
There was a problem hiding this comment.
The Parquet file writer will automatically generate a dictionary if the cardinality is low (i.e there is a small number of unique values).
| |insert into $table values | ||
| |('Spark SQL ', 10, 1.2), (NULL, NULL, NULL), ('', 0, 0.0), ('苹果手机', NULL, 3.999999) | ||
| |, ('Spark SQL ', 10, 1.2), (NULL, NULL, NULL), ('', 0, 0.0), ('苹果手机', NULL, 3.999999) | ||
| |""".stripMargin) |
There was a problem hiding this comment.
Did you insert extra space characters?
There was a problem hiding this comment.
oops, this is by accident. Let me try again and revert it.
Did another check. The current version now has 4 space indentations, which should be correct.
I think it was wrong in the previous commit and could be updated in this PR.
|
@viirya @kazuyukitanimura @sunchao PTAL when you have time. |
Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
|
Gently ping @viirya @sunchao and @andygrove |
andygrove
left a comment
There was a problem hiding this comment.
LGTM. Thank you @advancedxy
* fix: Handle compute murmur3 hash with dictionary input correctly * add unit tests * spotless apply * apply scala fix * address comment * another style issue * Update spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> --------- Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit 93af704)
Which issue does this PR close?
Closes #427
Rationale for this change
Bug fixes. When submitting #424, we found there's a bug in spark_hash, which doesn't handle dictionary array correctly.
This PR tries to fix this first.
What changes are included in this PR?
This PR currently depends on #426, will rebase once that's merged.
How are these changes tested?
Updated test with randomized input.