fix: Compute murmur3 hash with dictionary input correctly by advancedxy · Pull Request #433 · apache/datafusion-comet

advancedxy · 2024-05-15T12:47:37Z

Which issue does this PR close?

Closes #427

Rationale for this change

Bug fixes. When submitting #424, we found there's a bug in spark_hash, which doesn't handle dictionary array correctly.
This PR tries to fix this first.

What changes are included in this PR?

refactor some part of spark_hash.rs and be ready for xxhash64 support
unpack dictionary when computing with hashes
updated test

This PR currently depends on #426, will rebase once that's merged.

How are these changes tested?

Updated test with randomized input.

kazuyukitanimura · 2024-05-15T17:38:08Z

-                        }
-                    }
-                }
+                hash_array_boolean!(BooleanArray, col, i32, hashes_buffer);


I am wondering why this is a macro. It looks like this is the only use case?

Two reasons:

it could be reused for xxhash64 too, which I am currently working in feat: Add xxhash64 function support #424

mainly style issue, to be consistent with other types in this function, which are all called by macro.

viirya · 2024-05-15T18:41:24Z

+macro_rules! hash_array_boolean {
+    ($array_type: ident, $column: ident, $hash_input_type: ident, $hashes: ident) => {
+        let array = $column.as_any().downcast_ref::<$array_type>().unwrap();
+        if array.null_count() == 0 {
+            for (i, hash) in $hashes.iter_mut().enumerate() {
+                *hash = spark_compatible_murmur3_hash(
+                    $hash_input_type::from(array.value(i)).to_le_bytes(),
+                    *hash,
+                );
+            }
+        } else {
+            for (i, hash) in $hashes.iter_mut().enumerate() {
+                if !array.is_null(i) {
+                    *hash = spark_compatible_murmur3_hash(
+                        $hash_input_type::from(array.value(i)).to_le_bytes(),
+                        *hash,
+                    );
+                }
+            }
+        }
+    };
+}


This is pull out as a macro because you will use different hash function other than spark_compatible_murmur3_hash later?

yeah. It could be used to support xxhash64 function.

advancedxy · 2024-05-16T00:55:14Z

+  test("hash functions with random input") {
+    val dataGen = DataGenerator.DEFAULT
+    // sufficient number of rows to create dictionary encoded ArrowArray.
+    val randomNumRows = 1000


Some note here:
I'm not 100 percent sure how we could trigger a dictionary array in the native side from Spark.

When the random number is small, such as 100/200, there's no dictionary array involved in the native side, although the parquet should be written as all columns dictionary encoded.

I tweaked a bit and settled with 1000, which triggers a dictionary encoded ArrowArray in the rust side.

Potentially we can add repeated values to force dictionary. E.g. randomly generate 100 rows and repeat 10 times to make 1000 rows

E.g. randomly generate 100 rows and repeat 10 times to make 1000 rows

So dictionary encoding is only triggered with enough repetition?

Yes, makeParquetFileAllTypes or some existing dictionary related tests may be helpful

The Parquet file writer will automatically generate a dictionary if the cardinality is low (i.e there is a small number of unique values).

viirya · 2024-05-16T04:09:36Z

+              |insert into $table values
+              |('Spark SQL  ', 10, 1.2), (NULL, NULL, NULL), ('', 0, 0.0), ('苹果手机', NULL, 3.999999)
+              |, ('Spark SQL  ', 10, 1.2), (NULL, NULL, NULL), ('', 0, 0.0), ('苹果手机', NULL, 3.999999)
+              |""".stripMargin)


Did you insert extra space characters?

~~oops, this is by accident. Let me try again and revert it.~~

Did another check. The current version now has 4 space indentations, which should be correct.
I think it was wrong in the previous commit and could be updated in this PR.

advancedxy · 2024-05-16T13:39:01Z

@viirya @kazuyukitanimura @sunchao PTAL when you have time.

Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>

advancedxy · 2024-05-20T02:40:48Z

Gently ping @viirya @sunchao and @andygrove

kazuyukitanimura

LGTM

andygrove

LGTM. Thank you @advancedxy

* fix: Handle compute murmur3 hash with dictionary input correctly * add unit tests * spotless apply * apply scala fix * address comment * another style issue * Update spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> --------- Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit 93af704)

advancedxy mentioned this pull request May 15, 2024

feat: Add xxhash64 function support #424

Merged

advancedxy changed the title ~~fix: Handle compute murmur3 hash with dictionary input correctly~~ fix: Compute murmur3 hash with dictionary input correctly May 15, 2024

andygrove requested a review from sunchao May 15, 2024 15:04

kazuyukitanimura reviewed May 15, 2024

View reviewed changes

viirya reviewed May 15, 2024

View reviewed changes

Comment thread core/src/execution/datafusion/spark_hash.rs Outdated

viirya reviewed May 15, 2024

View reviewed changes

Comment thread core/src/execution/datafusion/spark_hash.rs Outdated

advancedxy added 5 commits May 16, 2024 08:35

fix: Handle compute murmur3 hash with dictionary input correctly

07705a3

add unit tests

e273e27

spotless apply

2ed2ddb

apply scala fix

0a0ee52

address comment

a417a20

advancedxy force-pushed the fix_murmur3_hash branch from cadb5be to a417a20 Compare May 16, 2024 00:39

another style issue

b708ba0

advancedxy commented May 16, 2024

View reviewed changes

viirya reviewed May 16, 2024

View reviewed changes

Comment thread core/src/execution/datafusion/spark_hash.rs

viirya reviewed May 16, 2024

View reviewed changes

Comment thread spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala Outdated

Update spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala

089b37a

Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>

kazuyukitanimura approved these changes May 20, 2024

View reviewed changes

andygrove approved these changes May 24, 2024

View reviewed changes

andygrove merged commit 93af704 into apache:main May 24, 2024

Conversation

advancedxy commented May 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

advancedxy May 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

advancedxy commented May 16, 2024

Uh oh!

advancedxy commented May 20, 2024

Uh oh!

kazuyukitanimura left a comment

Choose a reason for hiding this comment

Uh oh!

andygrove left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

advancedxy commented May 15, 2024 •

edited

Loading

advancedxy May 16, 2024 •

edited

Loading