[SPARK-47413][SQL] - add support to substr/left/right for collations#46040
[SPARK-47413][SQL] - add support to substr/left/right for collations#46040GideonPotok wants to merge 1 commit into
Conversation
|
@uros-db please review |
|
What's idff w/ #46039? |
|
@uros-db I made all suggested changes. Please re-review. Thanks! |
There was a problem hiding this comment.
I don't think we need this many test cases here, if you didn't modify the way Substring/Left/Right expressions behave when given collated strings (i.e. you didn't introduce any collation awareness to nullSafeEval/doCodeGen), then there should be no need to go this deep - a couple of test cases should do the trick just fine
also, I think these tests can be combined with the one above to make:
test("Support Left/Right/Substr with collation") {
so that we could have something like:
checks.foreach { check =>
// Result & data type (explicit collation)
...
// Result & data type (implicit collation)
...
There was a problem hiding this comment.
Will do (But without implicit collation at all, as per your other review comment.)
@HyukjinKwon Sorry for confusion. I closed out the other one to clear up which one to review. (To answer your question, that other one did not have the unit test over a struct field, because in the past I have dealt with GHA Test flakiness over using withTable expressions. ) |
62e22c6 to
f6efb32
Compare
|
@uros-db please re-review! |
f6efb32 to
1d91e95
Compare
|
@uros-db please re-review this one too. |
There was a problem hiding this comment.
28 cases * 4 collations = 112 tests
I'd say we don't need that many SQL tests, there's no need to do Seq("utf8_binary_lcase", "utf8_binary", "unicode", "unicode_ci").flatMap, only 4 tests (with valid values) per function (substring/left/right) should be enough
a couple of additional tests for improper values are fine as well, but we don't need to test every possible pair of collation & function
There was a problem hiding this comment.
since we don't have unit tests for these functions, let's make sure to use a smaller number of tests here in order to test more things - for example, I don't see any case/accent variation here, as well as a wider variety of variable len characters, etc.
There was a problem hiding this comment.
please follow the format of tests in this suite, queries should be constructed from these parameterized cases, otherwise they don't have a point
There was a problem hiding this comment.
@uros-db I will change it accordingly. Please advise - Do you want three case classes, or one case class but with a parameter for function name? If the latter (one case class), how do you want me to handle the third parameter (len), which left and right do not have, and which is optional for substr? Maybe with an Option[String]?
Is the quantity of tests satisfactory? I got it down from 112 tests to 25 tests. Thus 13 for valid values and 12 for invalid values. I can get it down to 12 valid test cases and, say, six invalid values if you prefer.
There was a problem hiding this comment.
It's fine if there's a parameter and it's not used in some cases, I don't think that causes any error
otherwise you can always introduce LeftRightTestCase, whatever works
quantity is fine here, what's important is proper coverage without needless repetition, these e2e sql tests are pretty slow, so having hundreds of them for trivial expressions is less than ideal
a4d3c1e to
95be248
Compare
uros-db
left a comment
There was a problem hiding this comment.
no need to avoid scalastyle guides
|
@uros-db I have made the suggested changes. please re-review. |
format format format format fewer test cases now fewer test cases now fewer test cases now fewer test cases now fewer test cases now Fewer test cases and fewer tests for left/right/substr rename t1234 to t1234<index> rename QTestCase to SubstringTestCase remove redundant unit test unify test naming with struct test tests pass locally format test impl to make more in line with refactor. next add struct test, maybe. left impl right impl
5050cf6 to
b0357b7
Compare
|
thanks, merging to master! |
https://issues.apache.org/jira/browse/SPARK-46830
What changes were proposed in this pull request?
Add collation support to types of return values for calls to substr, left, right, when passed in arguments of an explicit, implicit, or session-specified collations. Add tests to validate behavior.
Why are the changes needed?
We are incrementally adding collation support to built-in string functions in Spark. These functions are intended to be supported for collated types.
Does this PR introduce any user-facing change?
these sql functions will now not throw errors when passed in collated types. Instead, they will return the right value, of the passed in type. Or of the default collation.
How was this patch tested?
Unit testing + ad-hoc spark shell and pyspark shell interactions.
Was this patch authored or co-authored using generative AI tooling?
No.