Skip to content

[SPARK-18653][SQL] Fix incorrect space padding for unicode character at Dataset.show#16086

Closed
kiszk wants to merge 5 commits into
apache:masterfrom
kiszk:SPARK-18653
Closed

[SPARK-18653][SQL] Fix incorrect space padding for unicode character at Dataset.show#16086
kiszk wants to merge 5 commits into
apache:masterfrom
kiszk:SPARK-18653

Conversation

@kiszk

@kiszk kiszk commented Nov 30, 2016

Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

This PR put correct space padding for unicode character at Dataset.show().
The reason of putting incorrect padding is to count string width by string.length. This PR counds string width by using East Asian Width.

Example program

case class UnicodeCaseClass(整数: Int, 実数: Double, s: String)
Seq(UnicodeCaseClass(1, 1.1, "文字列1")).toDS.show

Output without this PR

+---+---+----+
| 整数| 実数|   s|
+---+---+----+
|  1|1.1|文字列1|
+---+---+----+

Output with this PR

+----+----+-------+
|整数|実数|      s|
+----+----+-------+
|   1| 1.1|文字列1|
+----+----+-------+

How was this patch tested?

Add a test suite

@SparkQA

SparkQA commented Nov 30, 2016

Copy link
Copy Markdown

Test build #69424 has finished for PR 16086 at commit 93379b6.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Nov 30, 2016

Copy link
Copy Markdown

Test build #69425 has finished for PR 16086 at commit e4555f7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Dec 1, 2016

Copy link
Copy Markdown

Test build #69443 has finished for PR 16086 at commit 4e71dc6.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Dec 1, 2016

Copy link
Copy Markdown

Test build #69446 has finished for PR 16086 at commit 350e1ae.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk

kiszk commented Dec 1, 2016

Copy link
Copy Markdown
Member Author

@gatorsmile would it be possible to review this? You would be familiar with Kanji?

Comment thread sql/core/pom.xml Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not against it, but I'm a little hesitant to bring in all this weight to fix a basically cosmetic problem. This may already be included transitively though. WOrth checking the a) license of this library and b) whether it's already in use in the transitive dependencies?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.
a) I think there is no limitation in the licence
b) I cannot find this jar in the current transitive dependency

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, yes that's a cat-A license so it's OK. With any dependency I'd also want to check whether it brings in anything else under a different license or whether it's particularly large, etc.

Disregard my other comment. I think I was thinking of the fact that Lucene already uses this.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ICU is widely used, as shown in http://site.icu-project.org . A very useful package. Before, we used it for codepage conversion.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why var?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, done

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a ";" at the end of this line.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, done

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ident looks weird.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, updated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason we replace StringUtils.leftPad/rightPad with repeatPadding?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StringUtils.leftPad/rightPad uses String.length. Since this usage causes the same problem, the new code does not use these methods.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh. Got it.

For this purpose, current repeatPadding looks verbose. If you just want to create exact number of spaces, you can use " " * n.

@SparkQA

SparkQA commented Dec 1, 2016

Copy link
Copy Markdown

Test build #69483 has finished for PR 16086 at commit 60aa2cb.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Dec 1, 2016

Copy link
Copy Markdown

Test build #69487 has finished for PR 16086 at commit e828ad9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Dec 1, 2016

Copy link
Copy Markdown

Test build #69486 has finished for PR 16086 at commit 7049809.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

if (locale == null) {
throw new NullPointerException("locale is null")
}
val ambiguousLen = if (EAST_ASIAN_LANGS.contains(locale.getLanguage())) 2 else 1

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about creating a separate helper function for the default width?

@kiszk kiszk Dec 2, 2016

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can create the separate helper for the default width. A challenge is how we can decide the helper can be applied when we have got a string.
While I have been thinking about these conditions, I have not answers yet.

}
}

val EAST_ASIAN_LANGS = Seq("ja", "vi", "kr", "zh")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only these four?

val value = UCharacter.getIntPropertyValue(codePoint, UProperty.EAST_ASIAN_WIDTH)
len = len + (value match {
case UCharacter.EastAsianWidth.NARROW | UCharacter.EastAsianWidth.NEUTRAL |
UCharacter.EastAsianWidth.HALFWIDTH => 1

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An indent issue.

@gatorsmile

Copy link
Copy Markdown
Member

Now, showString become more complex for handling the Unicode width. I am neutral on this fix.

@viirya

viirya commented Dec 2, 2016

Copy link
Copy Markdown
Member

yeah, I would think this is a relatively rare use case. Need to consider if it is worth extra complexity.

@rxin

rxin commented Dec 2, 2016

Copy link
Copy Markdown
Contributor

I agree - don't think this is worth the complexity.

@srowen

srowen commented Dec 11, 2016

Copy link
Copy Markdown
Member

Unless someone vigorously objects, yes let's close this.

@kiszk

kiszk commented Dec 12, 2016

Copy link
Copy Markdown
Member Author

I am thinking about an simpler approach. However, it is fine to close for now.

srowen added a commit to srowen/spark that referenced this pull request Jan 1, 2017
@srowen srowen mentioned this pull request Jan 1, 2017
@asfgit asfgit closed this in ba48812 Jan 2, 2017
zifeif2 pushed a commit to zifeif2/spark that referenced this pull request Nov 22, 2025
Closes apache#12968
Closes apache#16215
Closes apache#16212
Closes apache#16086
Closes apache#15713
Closes apache#16413
Closes apache#16396

Author: Sean Owen <sowen@cloudera.com>

Closes apache#16447 from srowen/CloseStalePRs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants