[SPARK-18653][SQL] Fix incorrect space padding for unicode character at Dataset.show#16086
[SPARK-18653][SQL] Fix incorrect space padding for unicode character at Dataset.show#16086kiszk wants to merge 5 commits into
Conversation
|
Test build #69424 has finished for PR 16086 at commit
|
|
Test build #69425 has finished for PR 16086 at commit
|
|
Test build #69443 has finished for PR 16086 at commit
|
|
Test build #69446 has finished for PR 16086 at commit
|
|
@gatorsmile would it be possible to review this? You would be familiar with Kanji? |
There was a problem hiding this comment.
I'm not against it, but I'm a little hesitant to bring in all this weight to fix a basically cosmetic problem. This may already be included transitively though. WOrth checking the a) license of this library and b) whether it's already in use in the transitive dependencies?
There was a problem hiding this comment.
Sure.
a) I think there is no limitation in the licence
b) I cannot find this jar in the current transitive dependency
There was a problem hiding this comment.
OK, yes that's a cat-A license so it's OK. With any dependency I'd also want to check whether it brings in anything else under a different license or whether it's particularly large, etc.
Disregard my other comment. I think I was thinking of the fact that Lucene already uses this.
There was a problem hiding this comment.
ICU is widely used, as shown in http://site.icu-project.org . A very useful package. Before, we used it for codepage conversion.
There was a problem hiding this comment.
There is a ";" at the end of this line.
There was a problem hiding this comment.
Any reason we replace StringUtils.leftPad/rightPad with repeatPadding?
There was a problem hiding this comment.
StringUtils.leftPad/rightPad uses String.length. Since this usage causes the same problem, the new code does not use these methods.
There was a problem hiding this comment.
oh. Got it.
For this purpose, current repeatPadding looks verbose. If you just want to create exact number of spaces, you can use " " * n.
|
Test build #69483 has finished for PR 16086 at commit
|
|
Test build #69487 has finished for PR 16086 at commit
|
|
Test build #69486 has finished for PR 16086 at commit
|
| if (locale == null) { | ||
| throw new NullPointerException("locale is null") | ||
| } | ||
| val ambiguousLen = if (EAST_ASIAN_LANGS.contains(locale.getLanguage())) 2 else 1 |
There was a problem hiding this comment.
How about creating a separate helper function for the default width?
There was a problem hiding this comment.
I can create the separate helper for the default width. A challenge is how we can decide the helper can be applied when we have got a string.
While I have been thinking about these conditions, I have not answers yet.
| } | ||
| } | ||
|
|
||
| val EAST_ASIAN_LANGS = Seq("ja", "vi", "kr", "zh") |
| val value = UCharacter.getIntPropertyValue(codePoint, UProperty.EAST_ASIAN_WIDTH) | ||
| len = len + (value match { | ||
| case UCharacter.EastAsianWidth.NARROW | UCharacter.EastAsianWidth.NEUTRAL | | ||
| UCharacter.EastAsianWidth.HALFWIDTH => 1 |
|
Now, |
|
yeah, I would think this is a relatively rare use case. Need to consider if it is worth extra complexity. |
|
I agree - don't think this is worth the complexity. |
|
Unless someone vigorously objects, yes let's close this. |
|
I am thinking about an simpler approach. However, it is fine to close for now. |
Closes apache#12968 Closes apache#16215 Closes apache#16212 Closes apache#16086 Closes apache#15713 Closes apache#16413 Closes apache#16396
Closes apache#12968 Closes apache#16215 Closes apache#16212 Closes apache#16086 Closes apache#15713 Closes apache#16413 Closes apache#16396 Author: Sean Owen <sowen@cloudera.com> Closes apache#16447 from srowen/CloseStalePRs.
What changes were proposed in this pull request?
This PR put correct space padding for unicode character at
Dataset.show().The reason of putting incorrect padding is to count string width by string.length. This PR counds string width by using East Asian Width.
Example program
Output without this PR
Output with this PR
How was this patch tested?
Add a test suite