Skip to content

[GLUTEN-7749][VL] Trim ISOControl characters in string for casting to integral type#7806

Merged
philo-he merged 7 commits into
apache:mainfrom
wForget:GLUTEN-7749
Nov 6, 2024
Merged

[GLUTEN-7749][VL] Trim ISOControl characters in string for casting to integral type#7806
philo-he merged 7 commits into
apache:mainfrom
wForget:GLUTEN-7749

Conversation

@wForget

@wForget wForget commented Nov 4, 2024

Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

Fixes: #7749, to align with Spark's change introduced by apache/spark#41535.

How was this patch tested?

added unit test

@github-actions github-actions Bot added CORE works for Gluten Core VELOX labels Nov 4, 2024
@github-actions

github-actions Bot commented Nov 4, 2024

Copy link
Copy Markdown

#7749

@github-actions

github-actions Bot commented Nov 4, 2024

Copy link
Copy Markdown

Run Gluten Clickhouse CI

@github-actions

github-actions Bot commented Nov 4, 2024

Copy link
Copy Markdown

Run Gluten Clickhouse CI

@github-actions

github-actions Bot commented Nov 4, 2024

Copy link
Copy Markdown

Run Gluten Clickhouse CI

@github-actions

github-actions Bot commented Nov 5, 2024

Copy link
Copy Markdown

Run Gluten Clickhouse CI

@philo-he philo-he self-requested a review November 5, 2024 07:08
@github-actions

github-actions Bot commented Nov 5, 2024

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

1 similar comment
@wForget

wForget commented Nov 5, 2024

Copy link
Copy Markdown
Member Author

Run Gluten Clickhouse CI on x86

@jackylee-ch jackylee-ch left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically look good to me. @philo-he any more comments?

@philo-he philo-he left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wForget, thanks for your fix! Could you give me some details about where these control chars are skipped when casting to integral types in Spark? I only found white space characters are checked and skipped.

// Needs to be trimmed for casting to float/double/decimal
val trimSpaceStr = ('\u0000' to '\u0020').toList.mkString
// ISOControl characters, refer java.lang.Character.isISOControl(int)
val isoControlControlStr = (('\u0000' to '\u001F') ++ ('\u007F' to '\u009F')).toList.mkString

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isoControlControlStr->isoControlStr, a typo?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, fixed

@wForget

wForget commented Nov 6, 2024

Copy link
Copy Markdown
Member Author

@wForget, thanks for your fix! Could you give me some details about where these control chars are skipped when casting to integral types in Spark? I only found white space characters are checked and skipped.

Introduced from #41535, the relevant calls are as follows:

https://github.com/apache/spark/blob/36410f073cb978fc504f85fb25b4942dac10db3f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L785

https://github.com/apache/spark/blob/36410f073cb978fc504f85fb25b4942dac10db3f/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L1617

@github-actions

github-actions Bot commented Nov 6, 2024

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@philo-he

philo-he commented Nov 6, 2024

Copy link
Copy Markdown
Member

@wForget, I see. What I checked is Spark-3.3.1, which doesn't cover that change.

Let's note this is not applicable to all supported Spark versions. But I think it may be acceptable to end users.

@philo-he philo-he merged commit 3099799 into apache:main Nov 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CORE works for Gluten Core VELOX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[VL][1.2][Result mismatch] Cast string to integral type does not ignore ISO control characters

3 participants