Skip to content

[SPARK-32559][SQL] Fix the trim logic did't handle ASCII control characters correctly#41535

Closed
beryllw wants to merge 3 commits into
apache:masterfrom
beryllw:trim_bugfix
Closed

[SPARK-32559][SQL] Fix the trim logic did't handle ASCII control characters correctly#41535
beryllw wants to merge 3 commits into
apache:masterfrom
beryllw:trim_bugfix

Conversation

@beryllw

@beryllw beryllw commented Jun 9, 2023

Copy link
Copy Markdown

What changes were proposed in this pull request?

The trim logic in Cast expression introduced in #29375 trim ASCII control characters unexpectly.

Before this patch
image
And hive
image

Why are the changes needed?

The behavior described above doesn't consistent with the behavior of Hive

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

add ut

@github-actions github-actions Bot added the SQL label Jun 9, 2023
@beryllw

beryllw commented Jun 9, 2023

Copy link
Copy Markdown
Author

@cloud-fan @yaooqinn @gengliangwang @WangGuangxin Could you please help review this?

@beryllw beryllw changed the title Fix the trim logic did't handle ASCII control characters correctly [SPARK-32559]Fix the trim logic did't handle ASCII control characters correctly Jun 9, 2023
Comment thread common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java Outdated
@srowen

srowen commented Jun 9, 2023

Copy link
Copy Markdown
Member

I'm not sure this is correct. SQL TRIM from other DBs seems to default to removing whitespace or the space char, not control codes.

@beryllw

beryllw commented Jun 10, 2023

Copy link
Copy Markdown
Author

I'm not sure this is correct. SQL TRIM from other DBs seems to default to removing whitespace or the space char, not control codes.

I conducted tests on both MySQL and Hive, and found that Hive supports trimming control characters and spaces but does not support trimming characters within the range 0x7F to 0x9F. MySQL, on the other hand, supports trimming specific control characters.
image

@yaooqinn yaooqinn changed the title [SPARK-32559]Fix the trim logic did't handle ASCII control characters correctly [SPARK-32559][SQL] Fix the trim logic did't handle ASCII control characters correctly Jun 12, 2023

@yaooqinn yaooqinn left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Spark is a general engine supporting various kinds of data formats and sources, IMO, it's necessary to trim controls that are very likely to be involved during data exchange.

So, LGTM.

…/DateTimeUtilsSuite.scala

Co-authored-by: Kent Yao <yao@apache.org>
@yaooqinn yaooqinn closed this in 80588e4 Jun 13, 2023
yaooqinn pushed a commit that referenced this pull request Jun 13, 2023
…acters correctly

### What changes were proposed in this pull request?
The trim logic in Cast expression introduced in #29375 trim ASCII control characters unexpectly.

Before this patch
![image](https://github.com/apache/spark/assets/25627922/ca6a2fb1-2143-4264-84d1-70b6bb755ec7)
And hive
![image](https://github.com/apache/spark/assets/25627922/017aaa4a-133e-4396-9694-79f03f027bbe)

### Why are the changes needed?
The behavior described above doesn't consistent with the behavior of Hive

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
add ut

Closes #41535 from Kwafoor/trim_bugfix.

Lead-authored-by: wangjunbo <wangjunbo@qiyi.com>
Co-authored-by: Junbo wang <1042815068@qq.com>
Signed-off-by: Kent Yao <yao@apache.org>
(cherry picked from commit 80588e4)
Signed-off-by: Kent Yao <yao@apache.org>
yaooqinn pushed a commit that referenced this pull request Jun 13, 2023
…acters correctly

### What changes were proposed in this pull request?
The trim logic in Cast expression introduced in #29375 trim ASCII control characters unexpectly.

Before this patch
![image](https://github.com/apache/spark/assets/25627922/ca6a2fb1-2143-4264-84d1-70b6bb755ec7)
And hive
![image](https://github.com/apache/spark/assets/25627922/017aaa4a-133e-4396-9694-79f03f027bbe)

### Why are the changes needed?
The behavior described above doesn't consistent with the behavior of Hive

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
add ut

Closes #41535 from Kwafoor/trim_bugfix.

Lead-authored-by: wangjunbo <wangjunbo@qiyi.com>
Co-authored-by: Junbo wang <1042815068@qq.com>
Signed-off-by: Kent Yao <yao@apache.org>
(cherry picked from commit 80588e4)
Signed-off-by: Kent Yao <yao@apache.org>
@yaooqinn

Copy link
Copy Markdown
Member

Thanks, merged to master and 3.4/3.3

@beryllw beryllw deleted the trim_bugfix branch June 14, 2023 02:06
czxm pushed a commit to czxm/spark that referenced this pull request Jun 19, 2023
…acters correctly

### What changes were proposed in this pull request?
The trim logic in Cast expression introduced in apache#29375 trim ASCII control characters unexpectly.

Before this patch
![image](https://github.com/apache/spark/assets/25627922/ca6a2fb1-2143-4264-84d1-70b6bb755ec7)
And hive
![image](https://github.com/apache/spark/assets/25627922/017aaa4a-133e-4396-9694-79f03f027bbe)

### Why are the changes needed?
The behavior described above doesn't consistent with the behavior of Hive

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
add ut

Closes apache#41535 from Kwafoor/trim_bugfix.

Lead-authored-by: wangjunbo <wangjunbo@qiyi.com>
Co-authored-by: Junbo wang <1042815068@qq.com>
Signed-off-by: Kent Yao <yao@apache.org>

@dongjoon-hyun dongjoon-hyun left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @Kwafoor , @yaooqinn and all.
SPARK-32559 is delivered as Apache Spark 3.0.1 patch.
We need a new JIRA issue because this will land at Apache Spark 3.5.0/3.4.1/3.3.3 newly.

@yaooqinn

yaooqinn commented Jun 20, 2023

Copy link
Copy Markdown
Member

@dongjoon-hyun Oops, My bad. For the current status, is it OK to link this pull request to a new jira?

snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Jun 20, 2023
…acters correctly

### What changes were proposed in this pull request?
The trim logic in Cast expression introduced in apache#29375 trim ASCII control characters unexpectly.

Before this patch
![image](https://github.com/apache/spark/assets/25627922/ca6a2fb1-2143-4264-84d1-70b6bb755ec7)
And hive
![image](https://github.com/apache/spark/assets/25627922/017aaa4a-133e-4396-9694-79f03f027bbe)

### Why are the changes needed?
The behavior described above doesn't consistent with the behavior of Hive

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
add ut

Closes apache#41535 from Kwafoor/trim_bugfix.

Lead-authored-by: wangjunbo <wangjunbo@qiyi.com>
Co-authored-by: Junbo wang <1042815068@qq.com>
Signed-off-by: Kent Yao <yao@apache.org>
(cherry picked from commit 80588e4)
Signed-off-by: Kent Yao <yao@apache.org>
@beryllw

beryllw commented Jul 12, 2023

Copy link
Copy Markdown
Author

Hi, @Kwafoor , @yaooqinn and all. SPARK-32559 is delivered as Apache Spark 3.0.1 patch. We need a new JIRA issue because this will land at Apache Spark 3.5.0/3.4.1/3.3.3 newly.

My mistake, should I raise a new issue to link this pr?

@yaooqinn

Copy link
Copy Markdown
Member

@Kwafoor Yes

@beryllw

beryllw commented Jul 12, 2023

Copy link
Copy Markdown
Author

@Kwafoor Yes

I created an issue https://issues.apache.org/jira/browse/SPARK-44383, but how to link this pull request to a new jira?

I saw that spark 3.4.1 was released, with this pr and the wrong issue name SPARK-32559.I’m very sorry about this.

image

@yaooqinn

Copy link
Copy Markdown
Member

I have manually resolved and re-targeted the links.

We probably need to push a new PR to the spark-website repo to fix the release note @Kwafoor

@beryllw

beryllw commented Jul 13, 2023

Copy link
Copy Markdown
Author

I have manually resolved and re-targeted the links.

We probably need to push a new PR to the spark-website repo to fix the release note @Kwafoor

I created PR-466 to the spark-website repo to fix the release note.

@yaooqinn

Copy link
Copy Markdown
Member

thank you @Kwafoor

GladwinLee pushed a commit to lyft/spark that referenced this pull request Oct 10, 2023
…acters correctly

### What changes were proposed in this pull request?
The trim logic in Cast expression introduced in apache#29375 trim ASCII control characters unexpectly.

Before this patch
![image](https://github.com/apache/spark/assets/25627922/ca6a2fb1-2143-4264-84d1-70b6bb755ec7)
And hive
![image](https://github.com/apache/spark/assets/25627922/017aaa4a-133e-4396-9694-79f03f027bbe)

### Why are the changes needed?
The behavior described above doesn't consistent with the behavior of Hive

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
add ut

Closes apache#41535 from Kwafoor/trim_bugfix.

Lead-authored-by: wangjunbo <wangjunbo@qiyi.com>
Co-authored-by: Junbo wang <1042815068@qq.com>
Signed-off-by: Kent Yao <yao@apache.org>
(cherry picked from commit 80588e4)
Signed-off-by: Kent Yao <yao@apache.org>
catalinii pushed a commit to lyft/spark that referenced this pull request Oct 10, 2023
…acters correctly

### What changes were proposed in this pull request?
The trim logic in Cast expression introduced in apache#29375 trim ASCII control characters unexpectly.

Before this patch
![image](https://github.com/apache/spark/assets/25627922/ca6a2fb1-2143-4264-84d1-70b6bb755ec7)
And hive
![image](https://github.com/apache/spark/assets/25627922/017aaa4a-133e-4396-9694-79f03f027bbe)

### Why are the changes needed?
The behavior described above doesn't consistent with the behavior of Hive

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
add ut

Closes apache#41535 from Kwafoor/trim_bugfix.

Lead-authored-by: wangjunbo <wangjunbo@qiyi.com>
Co-authored-by: Junbo wang <1042815068@qq.com>
Signed-off-by: Kent Yao <yao@apache.org>
(cherry picked from commit 80588e4)
Signed-off-by: Kent Yao <yao@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants