feat: Implement Spark unhex#342
Conversation
| let fun = BuiltinScalarFunction::from_str(fun_name); | ||
| if fun.is_err() { | ||
|
|
||
| if let Ok(fun) = BuiltinScalarFunction::from_str(fun_name) { |
There was a problem hiding this comment.
Unrelated, but more idiomatic IMO
| let fun = BuiltinScalarFunction::from_str(fun_name); | ||
| if fun.is_err() { | ||
| Ok(ScalarFunctionDefinition::UDF(registry.udf(fun_name)?)) | ||
| if let Ok(fun) = BuiltinScalarFunction::from_str(fun_name) { |
There was a problem hiding this comment.
Unrelated, but more idiomatic IMO
| val optExpr = scalarExprToProto("atan2", leftExpr, rightExpr) | ||
| optExprWithInfo(optExpr, expr, left, right) | ||
|
|
||
| case e @ Unhex(child, failOnError) => |
There was a problem hiding this comment.
https://github.com/apache/spark/blob/59d5946cfd377e9203ccf572deb34f87fab7510c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala#L1152
vs
https://github.com/apache/spark/blob/45ba9224602eb18fe45e339cbb8cf2e8a4924f0b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala#L1125
There was a problem hiding this comment.
We need to use a shim approach. We currently have one folder with source for spark-3.x. We should add additional folders as needed e.g. spark-3.3.x / spark-3.4.x with any code that is specific to those versions.
I am happy to help with this.
There was a problem hiding this comment.
I think we can handle this using existing shim, something like
case e @ Unhex =>
e.child...
getFailOnError(e) ...
There was a problem hiding this comment.
I was wrong, actually getFailOnError needs to be modified for this approach as unhex is unaryExpression
There was a problem hiding this comment.
@andygrove Would you mind having a look at the updated PR and let me know what you think of the shim implemntation? The thought was to setup a major and minor shim to keep the existing shim for all of spark 3 then apply the more granular difference in the minor. It gets a little weird in that I'm applying 3.3's shim to 3.2.
@kazuyukitanimura Thanks for looking!
There was a problem hiding this comment.
I think we can have duplicated shim for 3.3 and 3.2
| <additional.3_4.test.source>spark-3.4</additional.3_4.test.source> | ||
| <shims.source>spark-3.x</shims.source> | ||
| <shims.majorSource>spark-3.x</shims.majorSource> | ||
| <shims.minorSource>spark-3.4.x</shims.minorSource> |
There was a problem hiding this comment.
FYI spark-3.x will be gone and supposed to be replaced by independent spark-3.2, spark-3.3, and spark-3.4 dirs
So let's go ahead and start using spark-3.4 instead of spark-3.4.x? I.e.
<shims.source.shared>spark-3.x</shims.source.shared>
<shims.source>spark-3.4</shims.source>
| <!-- we don't add special test suits for spark-3.2, so a not existed dir is specified--> | ||
| <additional.3_3.test.source>not-needed-yet</additional.3_3.test.source> | ||
| <additional.3_4.test.source>not-needed-yet</additional.3_4.test.source> | ||
| <shims.minorSource>spark-3.3.x</shims.minorSource> |
There was a problem hiding this comment.
I guess this should be spark-3.2?
| val childCast = Cast(unHex._1, StringType) | ||
| val failOnErrorCast = Cast(unHex._2, BooleanType) |
There was a problem hiding this comment.
What would happen if we do not use Cast?
| * An utility object for query plan and expression serialization. | ||
| */ | ||
| object QueryPlanSerde extends Logging with ShimQueryPlanSerde { | ||
| object QueryPlanSerde extends Logging with ShimQueryPlanSerde with ShimCometUnhexExpr { |
There was a problem hiding this comment.
I would say ShimCometUnhexExpr should be more generic name, not unhex specific. What about ShimCometExpr (or if you have other idea, please feel free to propose)?
Otherwise, we will have to keep adding trait class per function.
Eventually we should merge ShimQueryPlanSerde and ShimCometExpr into one when we remove spark-3.x dir.
|
Hi, I updated the shim handling for 3.2 and made various other updates (based on PR feedback and general cleanup). Please have another look and let me know what you think, thanks! |
|
|
||
| // Adjust the string if it has an odd length, and prepare to add a padding byte if needed. | ||
| let needs_padding = string.len() % 2 != 0; | ||
| let adjusted_string = if needs_padding { &string[1..] } else { string }; |
There was a problem hiding this comment.
If I understand this correctly, string[0] is discarded when the length is odd, is it intentional?
There was a problem hiding this comment.
Here is the logic in Spark 3.4.2 for handling the first char if the input is padded, for reference. It looks like there is some validation of the first digit that we do not have in this PR and it also looks like the unhexed digit is stored in the output is used in the return value if the length of the input string is 1. It would be good to make sure that we have tests covering this case.
if ((bytes.length & 0x01) != 0) {
// padding with '0'
if (bytes(0) < 0) {
return null
}
val v = Hex.unhexDigits(bytes(0))
if (v == -1) {
return null
}
out(0) = v
i += 1
oddShift = 1
}| fn test_unhex() -> Result<(), Box<dyn std::error::Error>> { | ||
| let mut result = Vec::new(); | ||
|
|
||
| unhex("537061726B2053514C", &mut result)?; |
There was a problem hiding this comment.
Could we also have a test for the case where the input is padded?
| val table = "test" | ||
| withTable(table) { | ||
| sql(s"create table $table(col string) using parquet") | ||
| sql(s"insert into $table values('537061726B2053514C')") |
There was a problem hiding this comment.
I'd like to see more values being tested here, both valid and invalid, and covering the padded case.
| <sources> | ||
| <source>src/main/${shims.source}</source> | ||
| <source>src/main/${shims.majorSource}</source> | ||
| <source>src/main/${shims.minorSource}</source> |
There was a problem hiding this comment.
Thanks for adding the minor version shims. These are going to help me with some of work around supporting cast.
|
Thanks for all the feedback. I think I've addressed the build/naming/etc feedback, and will have a look at improving the tests and any associated implementation changes sometime tomorrow. I'll request a review via GH when it's ready. |
|
I think this is ready for review. I updated the |
|
Thanks for the updates @tshauck. I plan on reviewing later today. |
| |('A1B'), | ||
| |('0A1B')""".stripMargin) | ||
|
|
||
| checkSparkAnswerAndOperator(s"SELECT unhex(col) FROM $table") |
There was a problem hiding this comment.
It may be worth adding an ORDER BY clause here to ensure that the test is deterministic.
| checkSparkAnswerAndOperator(s"SELECT unhex(col) FROM $table") | |
| checkSparkAnswerAndOperator(s"SELECT unhex(col) FROM $table ORDER BY col") |
There was a problem hiding this comment.
I wonder if the Spark 3.2 failure is at all related to ordering?
[WrappedArray(10, 27)] [WrappedArray(10, 27)]
![WrappedArray(115, 116, 114, 105, 110, 103)] [WrappedArray(10, 27)]
![WrappedArray(27, 0)] [WrappedArray(115, 116, 114, 105, 110, 103)]
There was a problem hiding this comment.
Looking at the spark repo a bit more, it looks like there was a bug in the 3.2 implementation that was fixed in subsequent versions.
How should I handle this? Like should I write a native implementation of the erroneous code in order to be faithful to the spark, have it be correct if different, or maybe just fallback to the spark implementation on that version (not sure if that's possible)? Thanks!
There was a problem hiding this comment.
Actually adding the ORDER BY causes the test to fail because it no longer runs fully native, so ignore that suggestion.
Also, I do see code differences in Spark between 3.2 and 3.4 in the unhex algorithm. 3.2 does not have the oddShift variable. oddShift was added in apache/spark@276abe3
I guess the options are:
- mark
unhexas incompat just for 3.2 and skip the test for 3.2 (probably the easiest path) - implement per-spark-version logic in Rust for unhex
Let me know what you think or if you have any questions on this. This is a good example of the challenge of supporting multiple Spark versions 😓
There was a problem hiding this comment.
Thanks for looking, I think we found the same commit w.r.t. 3.2's differences.
I'll just go the easy route since matching 3.2 would be implementing buggy code and 3.2 is the oldest supported version (so presumably it'll be dropped at some point). I'll use this as a chance to poke around the repo, but may ask in discord if I get blocked on marking it incompat for 3.2 and skipping the tests for that version.
There was a problem hiding this comment.
Made an update in 36baf8e that seems to work (https://github.com/tshauck/arrow-datafusion-comet/actions/runs/8994978806). Happy to get any feedback about the approach taken.
Co-authored-by: Andy Grove <andygrove73@gmail.com>
|
@viirya @kazuyukitanimura do you have any additional feedback? |
|
I plan on merging this tomorrow if there is no more feedback |
kazuyukitanimura
left a comment
There was a problem hiding this comment.
Sorry, one more comment
| } else if fail_on_error { | ||
| return exec_err!("Input to unhex is not a valid hex string: {s}"); | ||
| } else { | ||
| builder.append_null(); |
There was a problem hiding this comment.
If unhex() fails and fail_on_error=false, the encoded is not cleared? Is this a same behavior as Spark? It would be great if we could add a test to verify.
There was a problem hiding this comment.
All good, thank you very much for looking. I think you're right in that there was a bug, which should be fixed and tested in c5c3fcd.
| val optExpr = scalarExprToProto("atan2", leftExpr, rightExpr) | ||
| optExprWithInfo(optExpr, expr, left, right) | ||
|
|
||
| case e: Unhex if !isSpark32 => |
There was a problem hiding this comment.
Unrelated, but potentially we can vote in the community when to deprecate 3.2 support...
|
I forgot to ask about dictionary and scalar. Filed #477 |


Which issue does this PR close?
Closes #341
Rationale for this change
unhexis currently unsupported by comet. This my first PR into this repo, so certainly open to any feedback to make it more inline w/ expectations.What changes are included in this PR?
Add
unhexas well as make some minor refactors.How are these changes tested?
Added simple tests to the rust and spark sql side of the code.