[SPARK-16429][SQL] Include `StringType` columns in `describe()` by dongjoon-hyun · Pull Request #14095 · apache/spark

dongjoon-hyun · 2016-07-07T20:52:29Z

What changes were proposed in this pull request?

Currently, Spark describe supports StringType. However, describe() returns a dataset for only all numeric columns. This PR aims to include StringType columns in describe(), describe without argument.

Background

scala> spark.read.json("examples/src/main/resources/people.json").describe("age", "name").show()
+-------+------------------+-------+
|summary|               age|   name|
+-------+------------------+-------+
|  count|                 2|      3|
|   mean|              24.5|   null|
| stddev|7.7781745930520225|   null|
|    min|                19|   Andy|
|    max|                30|Michael|
+-------+------------------+-------+

Before

scala> spark.read.json("examples/src/main/resources/people.json").describe().show()
+-------+------------------+
|summary|               age|
+-------+------------------+
|  count|                 2|
|   mean|              24.5|
| stddev|7.7781745930520225|
|    min|                19|
|    max|                30|
+-------+------------------+

After

scala> spark.read.json("examples/src/main/resources/people.json").describe().show()
+-------+------------------+-------+                                            
|summary|               age|   name|
+-------+------------------+-------+
|  count|                 2|      3|
|   mean|              24.5|   null|
| stddev|7.7781745930520225|   null|
|    min|                19|   Andy|
|    max|                30|Michael|
+-------+------------------+-------+

How was this patch tested?

Pass the Jenkins with a update testcase.

rxin · 2016-07-07T21:16:23Z

private rather than private sql?

That would be better.

dongjoon-hyun · 2016-07-07T21:36:59Z

Thank you for fast review, @rxin . I updated it.

SparkQA · 2016-07-07T22:28:52Z

Test build #61929 has finished for PR 14095 at commit df2edd7.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-07-07T22:43:28Z

Oh, it's a documented behavior.

Computes statistics for **numeric** columns

SparkQA · 2016-07-07T23:09:02Z

Test build #61930 has finished for PR 14095 at commit b6673cb.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-07-08T04:12:41Z

Can you fix Python?

dongjoon-hyun · 2016-07-08T05:32:52Z

Oh, sure!

rxin · 2016-07-08T05:39:05Z

And also update the documentation.

…ribe()`

dongjoon-hyun · 2016-07-08T05:49:43Z

Of course!

dongjoon-hyun · 2016-07-08T06:01:31Z

I fixed Python/R and the docs accordingly, and tested locally.

rxin · 2016-07-08T06:06:40Z

+      .filter(f => f.dataType.isInstanceOf[NumericType] || f.dataType.isInstanceOf[StringType])
+      .map { n =>
+        queryExecution.analyzed.resolveQuoted(n.name, sparkSession.sessionState.analyzer.resolver)
+          .get


is it possible that this would fail?

Ur, this is an direct extension of line 225 of existing numericColumns.

https://github.com/apache/spark/pull/14095/files#diff-7a46f10c3cedbf013cf255564d9483cdR225

You mean the failure of resolveQuoted?

It will not fail because the names come from schema.fields.

SparkQA · 2016-07-08T08:01:09Z

Test build #61962 has finished for PR 14095 at commit 8915adb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-07-08T08:08:49Z

Hi, @rxin .
It's ready for review again.

SparkQA · 2016-07-08T08:15:16Z

Test build #61965 has finished for PR 14095 at commit fa4d3b4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-07-08T21:36:32Z

Thanks - merging in master.

dongjoon-hyun · 2016-07-08T21:41:53Z

Thank you for merging, @rxin .

rxin reviewed Jul 7, 2016
View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-16429][SQL] Include StringType columns in Scala/Python describe()~~ [SPARK-16429][SQL] Include StringType columns in describe() Jul 7, 2016

dongjoon-hyun added 2 commits July 7, 2016 22:41

[SPARK-16429][SQL] Include StringType columns in Scala/Python `desc…

52b8562

…ribe()`

Replace private[sql] to private.

4127918

Fix documents and Python/R API consistently.

8915adb

rxin reviewed Jul 8, 2016
View reviewed changes

Fix the doc.

fa4d3b4

asfgit closed this in 142df48 Jul 8, 2016

dongjoon-hyun deleted the SPARK-16429 branch July 20, 2016 07:43

shivaram mentioned this pull request Jul 29, 2016

[Spark-16579][SparkR] add install.spark function #14258

Closed

Uh oh!

Conversation

dongjoon-hyun commented Jul 7, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

rxin Jul 7, 2016

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Jul 7, 2016

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Jul 7, 2016

Uh oh!

SparkQA commented Jul 7, 2016

Uh oh!

dongjoon-hyun commented Jul 7, 2016

Uh oh!

SparkQA commented Jul 7, 2016

Uh oh!

rxin commented Jul 8, 2016

Uh oh!

dongjoon-hyun commented Jul 8, 2016

Uh oh!

rxin commented Jul 8, 2016

Uh oh!

dongjoon-hyun commented Jul 8, 2016

Uh oh!

dongjoon-hyun commented Jul 8, 2016

Uh oh!

rxin Jul 8, 2016

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Jul 8, 2016

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Jul 8, 2016

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Jul 8, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 8, 2016

Uh oh!

dongjoon-hyun commented Jul 8, 2016

Uh oh!

SparkQA commented Jul 8, 2016

Uh oh!

rxin commented Jul 8, 2016

Uh oh!

dongjoon-hyun commented Jul 8, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dongjoon-hyun commented Jul 7, 2016 •

edited

Loading