[SPARK-17699] Support for parsing JSON string columns by marmbrus · Pull Request #15274 · apache/spark

marmbrus · 2016-09-28T02:56:22Z

Spark SQL has great support for reading text files that contain JSON data. However, in many cases the JSON data is just one column amongst others. This is particularly true when reading from sources such as Kafka. This PR adds a new functions from_json that converts a string column into a nested StructType with a user specified schema.

Example usage:

val df = Seq("""{"a": 1}""").toDS()
val schema = new StructType().add("a", IntegerType)

df.select(from_json($"value", schema) as 'json) // => [json: <a: int>]

This PR adds support for java, scala and python. I leveraged our existing JSON parsing support by moving it into catalyst (so that we could define expressions using it). I left SQL out for now, because I'm not sure how users would specify a schema.

SparkQA · 2016-09-28T02:59:21Z

Test build #66016 has finished for PR 15274 at commit 62f56a7.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2016-09-28T03:07:52Z

+/**
+ * Converts an json input string to a [[StructType]] with the specified schema.
+ */
+case class JsonToStruct(schema: StructType, options: Map[String, String], child: Expression)


Should this implement ExpectsInputTypes?

Ah, yes, it definitly should. Let me update.

rxin · 2016-09-28T07:26:15Z

Might want to send a dev list email to solicit feedback on the API?

marmbrus · 2016-09-28T19:18:25Z

Emailed the list. Seems like a popular feature so far :)

SparkQA · 2016-09-28T19:29:13Z

Test build #66048 has finished for PR 15274 at commit 983def2.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-28T21:50:03Z

Test build #66052 has finished for PR 15274 at commit 360b97b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-09-29T01:47:45Z

@marmbrus I just wonder if adding to_json make senses (although maybe it should be done in another PR). Just curious. I am just imaging the case to write out dataframes by some data sources not supporting nested structured types.

yhuai · 2016-09-29T19:18:36Z

LGTM. Merging to master.

marmbrus · 2016-09-29T20:09:24Z

@HyukjinKwon absolutely. I actually changed the name from json_parser to from_json in anticipation of adding to_json :)

DanielMe · 2016-10-17T10:12:01Z

@marmbrus: Is there any workaround I can use to achieve a similar effect in 1.6?

yhuai · 2016-10-17T16:46:29Z

@DanielMe The best options for 1.6 are get_json_object and json_tuple (their docs can be found at https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.sql.functions$).

DanielMe · 2016-10-18T08:58:40Z

@yhuai thanks! My impression was that get_json_object does not convert json arrays to ArrayTypes, maybe I misunderstood the way it's supposed to be used though.

yhuai · 2016-10-18T21:17:09Z

@DanielMe oh, I see. get_json_object will not parse json array. You need to have a UDF to do that for Spark 1.6.

gatorsmile · 2017-01-30T07:47:04Z

Actually, to specify the schema in SQL language, maybe we can use a JSON string. A little bit odd. So far, nobody is asking for it, I guess. Let us see whether users need it in SQL

Sazpaimon · 2017-03-04T18:58:26Z

@gatorsmile Alternatively, one can use do what brickhouse's from_json Hive UDF does ( https://gist.github.com/jeromebanks/8855408#file-gistfile1-sql )

(For the record, I actually need this in SQL)

gatorsmile · 2017-03-04T19:16:04Z

Based on the comment @marmbrus in a JIRA, we prefer to using our DDL format. For example, like what we did for CREATE TABLE, we can specify the schema using a int, b string

[SPARK-17699] Support for parsing JSON string columns

62f56a7

marmbrus mentioned this pull request Sep 28, 2016

[SPARK-17346][SQL] Add Kafka source for Structured Streaming #15102

Closed

hvanhovell reviewed Sep 28, 2016

View reviewed changes

address comments

983def2

style

360b97b

asfgit closed this in fe33121 Sep 29, 2016

maropu mentioned this pull request Oct 11, 2017

[SPARK-22245][SQL] partitioned data set should always put partition columns at the end #19471

Closed

Uh oh!

Conversation

marmbrus commented Sep 28, 2016

Uh oh!

SparkQA commented Sep 28, 2016

Uh oh!

hvanhovell Sep 28, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marmbrus Sep 28, 2016

Choose a reason for hiding this comment

Uh oh!

rxin commented Sep 28, 2016

Uh oh!

marmbrus commented Sep 28, 2016

Uh oh!

SparkQA commented Sep 28, 2016

Uh oh!

SparkQA commented Sep 28, 2016

Uh oh!

HyukjinKwon commented Sep 29, 2016

Uh oh!

yhuai commented Sep 29, 2016

Uh oh!

marmbrus commented Sep 29, 2016

Uh oh!

DanielMe commented Oct 17, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yhuai commented Oct 17, 2016

Uh oh!

DanielMe commented Oct 18, 2016

Uh oh!

yhuai commented Oct 18, 2016

Uh oh!

gatorsmile commented Jan 30, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sazpaimon commented Mar 4, 2017

Uh oh!

gatorsmile commented Mar 4, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

hvanhovell Sep 28, 2016 •

edited

Loading

DanielMe commented Oct 17, 2016 •

edited

Loading

gatorsmile commented Jan 30, 2017 •

edited

Loading