Skip to content

[SPARK-47413][SQL] - add support to substr/left/right for collations#46040

Closed
GideonPotok wants to merge 1 commit into
apache:masterfrom
GideonPotok:spark_collation_47413_5
Closed

[SPARK-47413][SQL] - add support to substr/left/right for collations#46040
GideonPotok wants to merge 1 commit into
apache:masterfrom
GideonPotok:spark_collation_47413_5

Conversation

@GideonPotok

Copy link
Copy Markdown
Contributor

https://issues.apache.org/jira/browse/SPARK-46830

What changes were proposed in this pull request?

Add collation support to types of return values for calls to substr, left, right, when passed in arguments of an explicit, implicit, or session-specified collations. Add tests to validate behavior.

Why are the changes needed?

We are incrementally adding collation support to built-in string functions in Spark. These functions are intended to be supported for collated types.

Does this PR introduce any user-facing change?

these sql functions will now not throw errors when passed in collated types. Instead, they will return the right value, of the passed in type. Or of the default collation.

How was this patch tested?

Unit testing + ad-hoc spark shell and pyspark shell interactions.

Was this patch authored or co-authored using generative AI tooling?

No.

@GideonPotok

Copy link
Copy Markdown
Contributor Author

@uros-db please review

@HyukjinKwon

Copy link
Copy Markdown
Member

What's idff w/ #46039?

@GideonPotok GideonPotok requested a review from uros-db April 15, 2024 08:53
@GideonPotok

Copy link
Copy Markdown
Contributor Author

@uros-db I made all suggested changes. Please re-review. Thanks!

Comment on lines 191 to 220

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this many test cases here, if you didn't modify the way Substring/Left/Right expressions behave when given collated strings (i.e. you didn't introduce any collation awareness to nullSafeEval/doCodeGen), then there should be no need to go this deep - a couple of test cases should do the trick just fine

also, I think these tests can be combined with the one above to make:
test("Support Left/Right/Substr with collation") {

so that we could have something like:

checks.foreach { check =>
// Result & data type (explicit collation)
...
// Result & data type (implicit collation)
...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do (But without implicit collation at all, as per your other review comment.)

@GideonPotok

Copy link
Copy Markdown
Contributor Author

What's idff w/ #46039?

@HyukjinKwon Sorry for confusion. I closed out the other one to clear up which one to review.

(To answer your question, that other one did not have the unit test over a struct field, because in the past I have dealt with GHA Test flakiness over using withTable expressions. )

@GideonPotok GideonPotok force-pushed the spark_collation_47413_5 branch 2 times, most recently from 62e22c6 to f6efb32 Compare April 16, 2024 06:35
@GideonPotok GideonPotok requested a review from uros-db April 16, 2024 06:35
@GideonPotok

Copy link
Copy Markdown
Contributor Author

@uros-db please re-review!

@GideonPotok GideonPotok force-pushed the spark_collation_47413_5 branch from f6efb32 to 1d91e95 Compare April 16, 2024 10:39
@GideonPotok

Copy link
Copy Markdown
Contributor Author

@uros-db please re-review this one too.

Comment on lines 219 to 247

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

28 cases * 4 collations = 112 tests

I'd say we don't need that many SQL tests, there's no need to do Seq("utf8_binary_lcase", "utf8_binary", "unicode", "unicode_ci").flatMap, only 4 tests (with valid values) per function (substring/left/right) should be enough

a couple of additional tests for improper values are fine as well, but we don't need to test every possible pair of collation & function

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we don't have unit tests for these functions, let's make sure to use a smaller number of tests here in order to test more things - for example, I don't see any case/accent variation here, as well as a wider variety of variable len characters, etc.

@GideonPotok GideonPotok Apr 17, 2024

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@uros-db done. Please re-review

@uros-db uros-db left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix tests

@GideonPotok GideonPotok requested a review from uros-db April 17, 2024 13:11
Comment on lines 218 to 531

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please follow the format of tests in this suite, queries should be constructed from these parameterized cases, otherwise they don't have a point

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@uros-db I will change it accordingly. Please advise - Do you want three case classes, or one case class but with a parameter for function name? If the latter (one case class), how do you want me to handle the third parameter (len), which left and right do not have, and which is optional for substr? Maybe with an Option[String]?

Is the quantity of tests satisfactory? I got it down from 112 tests to 25 tests. Thus 13 for valid values and 12 for invalid values. I can get it down to 12 valid test cases and, say, six invalid values if you prefer.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine if there's a parameter and it's not used in some cases, I don't think that causes any error

otherwise you can always introduce LeftRightTestCase, whatever works

quantity is fine here, what's important is proper coverage without needless repetition, these e2e sql tests are pretty slow, so having hundreds of them for trivial expressions is less than ideal

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@uros-db Done.

@GideonPotok GideonPotok force-pushed the spark_collation_47413_5 branch from a4d3c1e to 95be248 Compare April 18, 2024 08:42
@GideonPotok GideonPotok requested a review from uros-db April 18, 2024 09:07
@GideonPotok GideonPotok changed the title [SPARK-47413][SQL] - add support to substr/left/right for collations [Post-Refactor] [SPARK-47413][SQL] - add support to substr/left/right for collations Apr 18, 2024

@uros-db uros-db left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to avoid scalastyle guides

@GideonPotok GideonPotok requested a review from uros-db April 19, 2024 06:39
@GideonPotok

Copy link
Copy Markdown
Contributor Author

@uros-db I have made the suggested changes. please re-review.

format

format

format

format

fewer test cases now

fewer test cases now

fewer test cases now

fewer test cases now

fewer test cases now

Fewer test cases and fewer tests for left/right/substr

rename t1234 to t1234<index>

rename QTestCase to SubstringTestCase

remove redundant unit test

 unify test naming

with struct test

tests pass locally

format

test impl to make more in line with refactor. next add struct test, maybe.

left impl

right impl
@GideonPotok GideonPotok force-pushed the spark_collation_47413_5 branch from 5050cf6 to b0357b7 Compare April 21, 2024 12:46
@GideonPotok

Copy link
Copy Markdown
Contributor Author

@uros-db

@uros-db uros-db left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@cloud-fan

Copy link
Copy Markdown
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in e1432ef Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants