feat: Add xxhash64 function support#424
Conversation
27112b6 to
dd9738d
Compare
|
Thanks @advancedxy. I plan on reviewing this PR today. Could you also update |
Of course, I will update that among other things: such as the review comments and the inspection file: |
|
I'd like to see the tests use some randomly generated inputs. As a quick hack, I added the following test to test("xxhash64") {
val input = generateStrings(timestampPattern, 8).toDF("a")
withTempPath { dir =>
val data = roundtripParquet(input, dir).coalesce(1)
data.createOrReplaceTempView("t")
val df = spark.sql(s"select a, xxhash64(a) from t order by a")
checkSparkAnswerAndOperator(df)
}
}Some differences: We could extract the |
|
Our |
Good catch, and a good way to make sure the impl is correct. Let me check why the test is failing first. |
Found the issue. The Let me try to fix that first. |
|
See #426 for proposed DataGenerator class |
I filed #427 |
Thanks for filing this. I think it's the same issue for both murmur3 hash and xxhash64. I will submit a pr to fix that first. |
I have submitted the fix in this PR and waiting for CI passes. I will create a separate PR to include the murmur3 hash fix and depends on your #426 in the morning (in Beijing time) first. |
|
@andygrove @viirya I have created #433 and mark this as a draft. We should merge that first and then come back to this PR . PLAL when you have tome. |
b6e42c3 to
ebb3675
Compare
|
@andygrove @viirya @parthchandra and @sunchao would you mind to take a look at this? I think it's ready for review. |
Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
| let num_rows = args[0..args.len() - 1] | ||
| .iter() | ||
| .find_map(|arg| match arg { | ||
| ColumnarValue::Array(array) => Some(array.len()), | ||
| ColumnarValue::Scalar(_) => None, | ||
| }) | ||
| .unwrap_or(1); | ||
| let mut hashes: Vec<u64> = vec![0_u64; num_rows]; | ||
| hashes.fill(*seed as u64); | ||
| let arrays = args[0..args.len() - 1] | ||
| .iter() | ||
| .map(|arg| match arg { | ||
| ColumnarValue::Array(array) => array.clone(), | ||
| ColumnarValue::Scalar(scalar) => { | ||
| scalar.clone().to_array_of_size(num_rows).unwrap() | ||
| } | ||
| }) | ||
| .collect::<Vec<ArrayRef>>(); |
There was a problem hiding this comment.
nit: I feel this can be simplified a little bit
let arrays = args[0..args.len() - 1]
...;
let mut hashes: Vec<u64> = vec![0_u64; arrays.len()];
hashes.fill(*seed as u64);
There was a problem hiding this comment.
hmm. I think we have to compute num_rows first?
| DataType::Boolean => { | ||
| hash_array_boolean!(BooleanArray, col, i32, $hashes_buffer, $hash_method); | ||
| } |
There was a problem hiding this comment.
nit: I wonder if we can make BooleanArray and i32 as macro argument, so that we can reduce this large case match...
There was a problem hiding this comment.
hmm, let me give it a try. I will report back if it's too hard to do that.
There was a problem hiding this comment.
If I understands your proposal correctly, do you mean something like:
match col.data_type() {
DataType::Int8 | DataType::Int16: | DataType::Int32 | DataType::Int64 | DataType::UInt8 | DataType::UInt16 | DataType::UInt32 | DataType::UInt64 => {
hash_array_primitive!(get_array_type_of!(col.data_type()), col, get_input_native_type_of!(col.data_type()), $hashes_buffer, $hash_method);
}
....
}?
I tried to implement that, but couldn't find a way to do that. The col.data_type() is a runtime value, I don't we can infer it in the compile-time.
|
Gently ping @andygrove @viirya, do you have any more comments? |
andygrove
left a comment
There was a problem hiding this comment.
This looks great to me. Thank you @advancedxy
|
Thanks all for reviewing, @andygrove @viirya @kazuyukitanimura @parthchandra |
* feat: Add xxhash64 function support * Update related docs * Update core/src/execution/datafusion/spark_hash.rs Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> * Update QueriesList results --------- Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Parth Chandra <parthc@apple.com>
Which issue does this PR close?
Part of #205
Closes #344
Rationale for this change
More function coverage
What changes are included in this PR?
How are these changes tested?
New added test.