[SPARK-47413][SQL] - add support to substr/left/right for collations by GideonPotok · Pull Request #46040 · apache/spark

GideonPotok · 2024-04-13T09:38:26Z

https://issues.apache.org/jira/browse/SPARK-46830

What changes were proposed in this pull request?

Add collation support to types of return values for calls to substr, left, right, when passed in arguments of an explicit, implicit, or session-specified collations. Add tests to validate behavior.

Why are the changes needed?

We are incrementally adding collation support to built-in string functions in Spark. These functions are intended to be supported for collated types.

Does this PR introduce any user-facing change?

these sql functions will now not throw errors when passed in collated types. Instead, they will return the right value, of the passed in type. Or of the default collation.

How was this patch tested?

Unit testing + ad-hoc spark shell and pyspark shell interactions.

Was this patch authored or co-authored using generative AI tooling?

No.

GideonPotok · 2024-04-14T08:15:41Z

@uros-db please review

HyukjinKwon · 2024-04-15T00:05:45Z

What's idff w/ #46039?

GideonPotok · 2024-04-15T08:54:01Z

@uros-db I made all suggested changes. Please re-review. Thanks!

uros-db · 2024-04-15T09:00:15Z

I don't think we need this many test cases here, if you didn't modify the way Substring/Left/Right expressions behave when given collated strings (i.e. you didn't introduce any collation awareness to nullSafeEval/doCodeGen), then there should be no need to go this deep - a couple of test cases should do the trick just fine

also, I think these tests can be combined with the one above to make:
test("Support Left/Right/Substr with collation") {

so that we could have something like:

checks.foreach { check => // Result & data type (explicit collation) ... // Result & data type (implicit collation) ...

Will do (But without implicit collation at all, as per your other review comment.)

GideonPotok · 2024-04-15T10:12:06Z

What's idff w/ #46039?

@HyukjinKwon Sorry for confusion. I closed out the other one to clear up which one to review.

(To answer your question, that other one did not have the unit test over a struct field, because in the past I have dealt with GHA Test flakiness over using withTable expressions. )

GideonPotok · 2024-04-16T06:35:54Z

@uros-db please re-review!

GideonPotok · 2024-04-17T12:00:07Z

@uros-db please re-review this one too.

uros-db · 2024-04-17T12:15:56Z

28 cases * 4 collations = 112 tests

I'd say we don't need that many SQL tests, there's no need to do Seq("utf8_binary_lcase", "utf8_binary", "unicode", "unicode_ci").flatMap, only 4 tests (with valid values) per function (substring/left/right) should be enough

a couple of additional tests for improper values are fine as well, but we don't need to test every possible pair of collation & function

since we don't have unit tests for these functions, let's make sure to use a smaller number of tests here in order to test more things - for example, I don't see any case/accent variation here, as well as a wider variety of variable len characters, etc.

@uros-db done. Please re-review

uros-db

fix tests

uros-db · 2024-04-17T17:37:41Z

please follow the format of tests in this suite, queries should be constructed from these parameterized cases, otherwise they don't have a point

@uros-db I will change it accordingly. Please advise - Do you want three case classes, or one case class but with a parameter for function name? If the latter (one case class), how do you want me to handle the third parameter (len), which left and right do not have, and which is optional for substr? Maybe with an Option[String]?

Is the quantity of tests satisfactory? I got it down from 112 tests to 25 tests. Thus 13 for valid values and 12 for invalid values. I can get it down to 12 valid test cases and, say, six invalid values if you prefer.

It's fine if there's a parameter and it's not used in some cases, I don't think that causes any error

otherwise you can always introduce LeftRightTestCase, whatever works

quantity is fine here, what's important is proper coverage without needless repetition, these e2e sql tests are pretty slow, so having hundreds of them for trivial expressions is less than ideal

@uros-db Done.

uros-db

no need to avoid scalastyle guides

GideonPotok · 2024-04-19T06:40:08Z

@uros-db I have made the suggested changes. please re-review.

format format format format fewer test cases now fewer test cases now fewer test cases now fewer test cases now fewer test cases now Fewer test cases and fewer tests for left/right/substr rename t1234 to t1234<index> rename QTestCase to SubstringTestCase remove redundant unit test unify test naming with struct test tests pass locally format test impl to make more in line with refactor. next add struct test, maybe. left impl right impl

GideonPotok · 2024-04-21T21:53:04Z

@uros-db

uros-db

lgtm

cloud-fan · 2024-04-22T07:16:56Z

thanks, merging to master!

github-actions Bot added the SQL label Apr 13, 2024

GideonPotok mentioned this pull request Apr 14, 2024

[SPARK-47413][SQL] - add support to substr/left/right for collations #45738

Closed

GideonPotok marked this pull request as ready for review April 14, 2024 08:15

uros-db reviewed Apr 15, 2024

View reviewed changes

Comment thread sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala Outdated

uros-db reviewed Apr 15, 2024

View reviewed changes

Comment thread sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala Outdated

uros-db reviewed Apr 15, 2024

View reviewed changes

Comment thread sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala Outdated

GideonPotok requested a review from uros-db April 15, 2024 08:53

uros-db reviewed Apr 15, 2024

View reviewed changes

Comment thread sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala Outdated

uros-db reviewed Apr 15, 2024

View reviewed changes

Comment thread sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala Outdated

uros-db reviewed Apr 15, 2024

View reviewed changes

Comment thread sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala Outdated

uros-db reviewed Apr 15, 2024

View reviewed changes

Comment thread sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala Outdated

uros-db reviewed Apr 15, 2024

View reviewed changes

Comment thread sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala Outdated

GideonPotok force-pushed the spark_collation_47413_5 branch 2 times, most recently from 62e22c6 to f6efb32 Compare April 16, 2024 06:35

GideonPotok requested a review from uros-db April 16, 2024 06:35

GideonPotok force-pushed the spark_collation_47413_5 branch from f6efb32 to 1d91e95 Compare April 16, 2024 10:39

uros-db reviewed Apr 17, 2024

View reviewed changes

uros-db suggested changes Apr 17, 2024

View reviewed changes

GideonPotok requested a review from uros-db April 17, 2024 13:11

uros-db reviewed Apr 17, 2024

View reviewed changes

GideonPotok force-pushed the spark_collation_47413_5 branch from a4d3c1e to 95be248 Compare April 18, 2024 08:42

GideonPotok requested a review from uros-db April 18, 2024 09:07

GideonPotok changed the title ~~[SPARK-47413][SQL] - add support to substr/left/right for collations [Post-Refactor]~~ [SPARK-47413][SQL] - add support to substr/left/right for collations Apr 18, 2024

uros-db reviewed Apr 18, 2024

View reviewed changes

Comment thread sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala Outdated

uros-db reviewed Apr 18, 2024

View reviewed changes

Comment thread sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala Outdated

uros-db suggested changes Apr 18, 2024

View reviewed changes

GideonPotok requested a review from uros-db April 19, 2024 06:39

GideonPotok force-pushed the spark_collation_47413_5 branch from 5050cf6 to b0357b7 Compare April 21, 2024 12:46

uros-db approved these changes Apr 22, 2024

View reviewed changes

cloud-fan closed this in e1432ef Apr 22, 2024

Uh oh!

Conversation

GideonPotok commented Apr 13, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

GideonPotok commented Apr 14, 2024

Uh oh!

HyukjinKwon commented Apr 15, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

GideonPotok commented Apr 15, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

GideonPotok commented Apr 15, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

GideonPotok commented Apr 16, 2024

Uh oh!

GideonPotok commented Apr 17, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GideonPotok Apr 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

uros-db left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

uros-db left a comment

Choose a reason for hiding this comment

Uh oh!

GideonPotok commented Apr 19, 2024

Uh oh!

GideonPotok commented Apr 21, 2024

Uh oh!

uros-db left a comment

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Apr 22, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

GideonPotok Apr 17, 2024 •

edited

Loading