[SPARK-47413][SQL] - add support to substr/left/right for collations #45738
[SPARK-47413][SQL] - add support to substr/left/right for collations #45738GideonPotok wants to merge 7 commits into
Conversation
|
I have some tests, and have substring implementation basically passing tests. I think it is caused by additional work needed on the overridden implementation of |
|
since the "COMPLEX_EXPRESSION_UNSUPPORTED_INPUT.MISMATCHED_TYPES" problem seems to be in |
|
@uros-db I think that did the trick! I will let you know when to review again. Thanks! |
|
@uros-db what do you think of it? Anything else you want me to test for? I also did some ad-hoc testing of these functions in spark shell and pyspark shell, all looks good.. |
There was a problem hiding this comment.
try covering some more edge cases in these tests, such as: empty strings, uppercase and lowercase mix, different byte-length chars, etc.
good example would be this PR: #45749
There was a problem hiding this comment.
@uros-db Please see what I have since added and let me know whether what to keep going, or whether we have enough!
There was a problem hiding this comment.
I think this will do it, thanks! Now just move over all tests to CollationStringExpressionsSuite, re-run any failing CI checks, and modify the PR title so we can open this up for review to Spark committers
There was a problem hiding this comment.
Sorry @uros-db, I didn't realize you wanted me to add the additional edge cases to this suite specifically. Are you sure you want me to move all the new tests to CollationStringExpressionsSuite?
There was a problem hiding this comment.
yeah, let's put all these tests that are specific to string functions into the separate suite CollationStringExpressionsSuite
this way, CollationSuite won't get cluttered as we continue adding more and more support for various string expressions
9a6f2b9 to
0108768
Compare
a065adf to
63385ed
Compare
9b49d7d to
5270c0a
Compare
67ff2da to
f25e8a9
Compare
|
@uros-db this would be ready for review but spark sql unit tests repeatedly fail for CSVLegacyTimeParserSuite, which I did not modify. I have reran it a total of four times. Any idea how to get this check to pass successfully? |
|
did you try to run this suite locally and investigate any potential issues? |
@uros-db I would love to. I need to fix my setup though, first -- there is an issue I have been putting off where any unit tests (in the spark codebase) that make use of the local file system seems to hang indefinitely when run locally: any tests that use I usually run from sbt directly. I can provide additional setup details, eg my jvmopts and sbt jvm opts. I have also done some cursory JVM performance profiling and can provide some of those measurements. I can also let you know what I have tried (The thing I was most optimistic about was Let me know what to look into first. Thanks so much. |
|
I don't think |
Same issue running tests locally with maven. I always build with maven and when running tests with it (Eg I agree that my changes are probably not what is causing |
|
@uros-db I got the file-writing tests to work locally when I simply More importantly, All GHA are passing. So this is ready for reviewers to begin to review. |
|
@uros-db this is ready for review. |
|
@GideonPotok nice work, thanks! Heads up though: we will soon be finishing some code refactoring related to collation-aware string expression support (SPARK-47410), and will likely need to rewrite this PRs a bit in order to comply with some new design before proceeding to final reivew let's put this PR on hold for now, and I'll ping you when we're ready to move on |
|
@uros-db No problem at all. if I understand your refactor correctly, my changes will basically either stay in the same place or move to the new I will in the meantime take care of implementing https://issues.apache.org/jira/browse/SPARK-47412 and will also then put that on hold until after the refactor is merged. If that sounds good? |
|
PS: Do you think changes, such as these, which are only to implementations of Thank you, kind sir :) |
|
@GideonPotok You are correct, this refactor should not greatly affect your current PR in particular - I expect you'll only need to refactor testing a bit (shouldn't be too much work) Feel free to move over to the next ticket, and I'll let you know when and how to consolidate your code/design so we can move to final review and sign off on this PR |
uros-db
left a comment
There was a problem hiding this comment.
we’ve done some major code restructuring in #45978, so please sync these changes before moving on
@GideonPotok you’ll likely need to rewrite the code in this PR a bit, so please follow the guidelines outlined in https://issues.apache.org/jira/browse/SPARK-47410
https://issues.apache.org/jira/browse/SPARK-46830
What changes were proposed in this pull request?
Add collation support to types of return values for calls to substr, left, right, when passed in arguments of an explicit, implicit, or session-specified collations. Add tests to validate behavior.
Why are the changes needed?
We are incrementally adding collation support to built-in string functions in Spark. These functions are intended to be supported for collated types.
Does this PR introduce any user-facing change?
these sql functions will now not throw errors when passed in collated types. Instead, they will return the right value, of the passed in type. Or of the default collation.
How was this patch tested?
Unit testing + ad-hoc spark shell and pyspark shell interactions.
Was this patch authored or co-authored using generative AI tooling?
No.