feat: Implement Spark unhex by tshauck · Pull Request #342 · apache/datafusion-comet

tshauck · 2024-04-29T04:17:59Z

Which issue does this PR close?

Closes #341

Rationale for this change

unhex is currently unsupported by comet. This my first PR into this repo, so certainly open to any feedback to make it more inline w/ expectations.

What changes are included in this PR?

Add unhex as well as make some minor refactors.

How are these changes tested?

Added simple tests to the rust and spark sql side of the code.

tshauck · 2024-04-29T04:20:11Z

-                let fun = BuiltinScalarFunction::from_str(fun_name);
-                if fun.is_err() {
+
+                if let Ok(fun) = BuiltinScalarFunction::from_str(fun_name) {


Unrelated, but more idiomatic IMO

tshauck · 2024-04-29T04:20:34Z

-            let fun = BuiltinScalarFunction::from_str(fun_name);
-            if fun.is_err() {
-                Ok(ScalarFunctionDefinition::UDF(registry.udf(fun_name)?))
+            if let Ok(fun) = BuiltinScalarFunction::from_str(fun_name) {


Unrelated, but more idiomatic IMO

tshauck · 2024-04-29T04:50:29Z

          val optExpr = scalarExprToProto("atan2", leftExpr, rightExpr)
          optExprWithInfo(optExpr, expr, left, right)

+        case e @ Unhex(child, failOnError) =>


https://github.com/apache/spark/blob/59d5946cfd377e9203ccf572deb34f87fab7510c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala#L1152
vs
https://github.com/apache/spark/blob/45ba9224602eb18fe45e339cbb8cf2e8a4924f0b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala#L1125

We need to use a shim approach. We currently have one folder with source for spark-3.x. We should add additional folders as needed e.g. spark-3.3.x / spark-3.4.x with any code that is specific to those versions.

I am happy to help with this.

I think we can handle this using existing shim, something like

case e @ Unhex => e.child... getFailOnError(e) ...

I was wrong, actually getFailOnError needs to be modified for this approach as unhex is unaryExpression

@andygrove Would you mind having a look at the updated PR and let me know what you think of the shim implemntation? The thought was to setup a major and minor shim to keep the existing shim for all of spark 3 then apply the more granular difference in the minor. It gets a little weird in that I'm applying 3.3's shim to 3.2.

@kazuyukitanimura Thanks for looking!

Thanks @tshauck I will review this tomorrow.

I think we can have duplicated shim for 3.3 and 3.2

kazuyukitanimura · 2024-05-01T07:41:44Z

    <additional.3_4.test.source>spark-3.4</additional.3_4.test.source>
-    <shims.source>spark-3.x</shims.source>
+    <shims.majorSource>spark-3.x</shims.majorSource>
+    <shims.minorSource>spark-3.4.x</shims.minorSource>


FYI spark-3.x will be gone and supposed to be replaced by independent spark-3.2, spark-3.3, and spark-3.4 dirs
So let's go ahead and start using spark-3.4 instead of spark-3.4.x? I.e.

<shims.source.shared>spark-3.x</shims.source.shared> <shims.source>spark-3.4</shims.source>

kazuyukitanimura · 2024-05-01T07:42:18Z

        <!-- we don't add special test suits for spark-3.2, so a not existed dir is specified-->
        <additional.3_3.test.source>not-needed-yet</additional.3_3.test.source>
        <additional.3_4.test.source>not-needed-yet</additional.3_4.test.source>
+        <shims.minorSource>spark-3.3.x</shims.minorSource>


I guess this should be spark-3.2?

kazuyukitanimura · 2024-05-01T07:55:45Z

+          val childCast = Cast(unHex._1, StringType)
+          val failOnErrorCast = Cast(unHex._2, BooleanType)


What would happen if we do not use Cast?

kazuyukitanimura · 2024-05-01T08:06:41Z

 * An utility object for query plan and expression serialization.
 */
-object QueryPlanSerde extends Logging with ShimQueryPlanSerde {
+object QueryPlanSerde extends Logging with ShimQueryPlanSerde with ShimCometUnhexExpr {


I would say ShimCometUnhexExpr should be more generic name, not unhex specific. What about ShimCometExpr (or if you have other idea, please feel free to propose)?
Otherwise, we will have to keep adding trait class per function.

Eventually we should merge ShimQueryPlanSerde and ShimCometExpr into one when we remove spark-3.x dir.

tshauck · 2024-05-03T05:13:02Z

Hi, I updated the shim handling for 3.2 and made various other updates (based on PR feedback and general cleanup). Please have another look and let me know what you think, thanks!

kazuyukitanimura · 2024-05-03T07:36:10Z

+
+    // Adjust the string if it has an odd length, and prepare to add a padding byte if needed.
+    let needs_padding = string.len() % 2 != 0;
+    let adjusted_string = if needs_padding { &string[1..] } else { string };


If I understand this correctly, string[0] is discarded when the length is odd, is it intentional?

Here is the logic in Spark 3.4.2 for handling the first char if the input is padded, for reference. It looks like there is some validation of the first digit that we do not have in this PR and it also looks like the unhexed digit is stored in the output is used in the return value if the length of the input string is 1. It would be good to make sure that we have tests covering this case.

if ((bytes.length & 0x01) != 0) { // padding with '0' if (bytes(0) < 0) { return null } val v = Hex.unhexDigits(bytes(0)) if (v == -1) { return null } out(0) = v i += 1 oddShift = 1 }

andygrove · 2024-05-03T11:00:24Z

+    fn test_unhex() -> Result<(), Box<dyn std::error::Error>> {
+        let mut result = Vec::new();
+
+        unhex("537061726B2053514C", &mut result)?;


Could we also have a test for the case where the input is padded?

andygrove · 2024-05-03T11:08:16Z

+        val table = "test"
+        withTable(table) {
+          sql(s"create table $table(col string) using parquet")
+          sql(s"insert into $table values('537061726B2053514C')")


I'd like to see more values being tested here, both valid and invalid, and covering the padded case.

andygrove · 2024-05-03T11:09:03Z

              <sources>
-                <source>src/main/${shims.source}</source>
+                <source>src/main/${shims.majorSource}</source>
+                <source>src/main/${shims.minorSource}</source>


Thanks for adding the minor version shims. These are going to help me with some of work around supporting cast.

andygrove

Thanks for the contribution @tshauck. This is looking great. I added some comments around additional testing.

tshauck · 2024-05-03T19:47:16Z

Thanks for all the feedback. I think I've addressed the build/naming/etc feedback, and will have a look at improving the tests and any associated implementation changes sometime tomorrow. I'll request a review via GH when it's ready.

tshauck · 2024-05-04T20:35:04Z

I think this is ready for review. I updated the unhex impl to be more faithful to Spark's (for odd-length inputs in particular), added better null handling, and added more tests from the Spark repo.

tshauck · 2024-05-04T22:12:40Z

Err... looks to be an issue w/ spark 3.2 I'll need to look into. Hopefully the majority of the code'll remain unchanged.

andygrove · 2024-05-07T17:46:33Z

Thanks for the updates @tshauck. I plan on reviewing later today.

andygrove · 2024-05-07T23:41:55Z

+        |('A1B'),
+        |('0A1B')""".stripMargin)

+      checkSparkAnswerAndOperator(s"SELECT unhex(col) FROM $table")


It may be worth adding an ORDER BY clause here to ensure that the test is deterministic.

Suggested change

checkSparkAnswerAndOperator(s"SELECT unhex(col) FROM $table")

checkSparkAnswerAndOperator(s"SELECT unhex(col) FROM $table ORDER BY col")

I wonder if the Spark 3.2 failure is at all related to ordering?

[WrappedArray(10, 27)] [WrappedArray(10, 27)] ![WrappedArray(115, 116, 114, 105, 110, 103)] [WrappedArray(10, 27)] ![WrappedArray(27, 0)] [WrappedArray(115, 116, 114, 105, 110, 103)]

Thanks, yeah, I looked at that bit. I think the test results are sorted, but the Spark case (left column) 27 sorts after 115.

Unfortunately, I think it may be legitimately different behavior between the 3.2 and the more recent versions.
.

Looking at the spark repo a bit more, it looks like there was a bug in the 3.2 implementation that was fixed in subsequent versions.

How should I handle this? Like should I write a native implementation of the erroneous code in order to be faithful to the spark, have it be correct if different, or maybe just fallback to the spark implementation on that version (not sure if that's possible)? Thanks!

Actually adding the ORDER BY causes the test to fail because it no longer runs fully native, so ignore that suggestion.

Also, I do see code differences in Spark between 3.2 and 3.4 in the unhex algorithm. 3.2 does not have the oddShift variable. oddShift was added in apache/spark@276abe3

I guess the options are:

mark unhex as incompat just for 3.2 and skip the test for 3.2 (probably the easiest path)

implement per-spark-version logic in Rust for unhex

Let me know what you think or if you have any questions on this. This is a good example of the challenge of supporting multiple Spark versions 😓

Thanks for looking, I think we found the same commit w.r.t. 3.2's differences.

I'll just go the easy route since matching 3.2 would be implementing buggy code and 3.2 is the oldest supported version (so presumably it'll be dropped at some point). I'll use this as a chance to poke around the repo, but may ask in discord if I get blocked on marking it incompat for 3.2 and skipping the tests for that version.

Made an update in 36baf8e that seems to work (https://github.com/tshauck/arrow-datafusion-comet/actions/runs/8994978806). Happy to get any feedback about the approach taken.

Co-authored-by: Andy Grove <andygrove73@gmail.com>

andygrove

Thank you @tshauck. This is a great first contribution! I added a suggestion to fix the Scala 2.13 build failures and also a suggestion for an additional comment.

LGTM pending CI.

andygrove · 2024-05-08T14:08:38Z

@viirya @kazuyukitanimura do you have any additional feedback?

andygrove · 2024-05-08T20:33:12Z

I plan on merging this tomorrow if there is no more feedback

kazuyukitanimura

Sorry, one more comment

kazuyukitanimura · 2024-05-08T21:57:25Z

+                    } else if fail_on_error {
+                        return exec_err!("Input to unhex is not a valid hex string: {s}");
+                    } else {
+                        builder.append_null();


If unhex() fails and fail_on_error=false, the encoded is not cleared? Is this a same behavior as Spark? It would be great if we could add a test to verify.

All good, thank you very much for looking. I think you're right in that there was a bug, which should be fixed and tested in c5c3fcd.

kazuyukitanimura · 2024-05-08T22:00:17Z

          val optExpr = scalarExprToProto("atan2", leftExpr, rightExpr)
          optExprWithInfo(optExpr, expr, left, right)

+        case e: Unhex if !isSpark32 =>


Unrelated, but potentially we can vote in the community when to deprecate 3.2 support...

kazuyukitanimura

LGTM

kazuyukitanimura · 2024-05-27T01:42:18Z

I forgot to ask about dictionary and scalar. Filed #477

feat: ic for native unhex

5dbd4aa

tshauck commented Apr 29, 2024

View reviewed changes

tshauck added 7 commits April 30, 2024 19:11

feat: setup shim for unhex

04bb619

style: cleanup

6cb88c7

refactor: set minor source

c649aef

style: fix clippy in core

bb4ad43

fix: fix tests

a0bdbbe

style: fix clippy in core

70c9ddd

style: run scalacheck

663aef5

kazuyukitanimura reviewed May 1, 2024

View reviewed changes

tshauck added 3 commits May 2, 2024 15:25

refactor: update w/ feedback

bfe92c4

refactor: delete unused code

966d307

refactor: improve rust

97eae4b

tshauck mentioned this pull request May 3, 2024

Write a guide on contributing a new expression #370

Closed

tshauck requested review from andygrove and kazuyukitanimura May 3, 2024 05:12

viirya changed the title ~~feat: ic for native unhex~~ feat: Implement Spark unhex May 3, 2024

viirya reviewed May 3, 2024

View reviewed changes

Comment thread spark/src/main/spark-3.2/org/apache/comet/shims/ShimCometUnhexExpr.scala Outdated

viirya reviewed May 3, 2024

View reviewed changes

Comment thread spark/src/main/spark-3.2/org/apache/comet/shims/ShimCometUnhexExpr.scala Outdated

viirya reviewed May 3, 2024

View reviewed changes

Comment thread pom.xml Outdated

viirya reviewed May 3, 2024

View reviewed changes

Comment thread spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala Outdated

kazuyukitanimura reviewed May 3, 2024

View reviewed changes

andygrove reviewed May 3, 2024

View reviewed changes

refactor: rename to CometExprShim and update docs

a378f74

refactor: rename majorSource to majorVerSrc, same with minor

112c7c6

tshauck added 3 commits May 4, 2024 11:24

refactor: import unhex impl and testing

6146f3e

tests: improve spark tests, better null handling

1de0887

docs: tweak docs

bd07fed

tshauck requested review from andygrove, kazuyukitanimura and viirya May 4, 2024 20:32

andygrove reviewed May 7, 2024

View reviewed changes

build: dont use unhex on spark 3.2

36baf8e

andygrove reviewed May 8, 2024

View reviewed changes

Comment thread spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala

tshauck and others added 2 commits May 8, 2024 07:03

fix: escape null byte for scala 2.13

d5a1c46

docs: better docs around why unhex test is skipped

fb1c24a

Co-authored-by: Andy Grove <andygrove73@gmail.com>

andygrove reviewed May 8, 2024

View reviewed changes

Comment thread spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala

andygrove approved these changes May 8, 2024

View reviewed changes

andygrove mentioned this pull request May 8, 2024

docs: Document incompatibility of unhex with Spark 3.2 #400

Closed

kazuyukitanimura reviewed May 8, 2024

View reviewed changes

fix: clear result vec of incomplete conversion on fail

c5c3fcd

tshauck requested a review from kazuyukitanimura May 8, 2024 22:36

kazuyukitanimura approved these changes May 8, 2024

View reviewed changes

andygrove merged commit 2b42a61 into apache:main May 9, 2024

tshauck deleted the add-unhex branch May 9, 2024 00:43

himadripal pushed a commit to himadripal/datafusion-comet that referenced this pull request Sep 7, 2024

feat: Implement Spark unhex (apache#342)

86582b9

himadripal pushed a commit to himadripal/datafusion-comet that referenced this pull request Sep 7, 2024

Cargo updates after bumping the version (apache#342)

43f31a9

		val childCast = Cast(unHex._1, StringType)
		val failOnErrorCast = Cast(unHex._2, BooleanType)

	checkSparkAnswerAndOperator(s"SELECT unhex(col) FROM $table")
	checkSparkAnswerAndOperator(s"SELECT unhex(col) FROM $table ORDER BY col")

Conversation

tshauck commented Apr 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tshauck commented May 3, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andygrove left a comment

Choose a reason for hiding this comment

Uh oh!

tshauck commented May 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tshauck commented May 4, 2024

Uh oh!

tshauck commented May 4, 2024

Uh oh!

andygrove commented May 7, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tshauck May 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tshauck commented Apr 29, 2024 •

edited

Loading

tshauck commented May 3, 2024 •

edited

Loading

tshauck May 8, 2024 •

edited

Loading

kazuyukitanimura May 8, 2024 •

edited

Loading