Skip to content

[SPARK-17699] Support for parsing JSON string columns#15274

Closed
marmbrus wants to merge 3 commits into
apache:masterfrom
marmbrus:jsonParser
Closed

[SPARK-17699] Support for parsing JSON string columns#15274
marmbrus wants to merge 3 commits into
apache:masterfrom
marmbrus:jsonParser

Conversation

@marmbrus

Copy link
Copy Markdown
Contributor

Spark SQL has great support for reading text files that contain JSON data. However, in many cases the JSON data is just one column amongst others. This is particularly true when reading from sources such as Kafka. This PR adds a new functions from_json that converts a string column into a nested StructType with a user specified schema.

Example usage:

val df = Seq("""{"a": 1}""").toDS()
val schema = new StructType().add("a", IntegerType)

df.select(from_json($"value", schema) as 'json) // => [json: <a: int>]

This PR adds support for java, scala and python. I leveraged our existing JSON parsing support by moving it into catalyst (so that we could define expressions using it). I left SQL out for now, because I'm not sure how users would specify a schema.

@SparkQA

SparkQA commented Sep 28, 2016

Copy link
Copy Markdown

Test build #66016 has finished for PR 15274 at commit 62f56a7.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

/**
* Converts an json input string to a [[StructType]] with the specified schema.
*/
case class JsonToStruct(schema: StructType, options: Map[String, String], child: Expression)

@hvanhovell hvanhovell Sep 28, 2016

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this implement ExpectsInputTypes?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, it definitly should. Let me update.

@rxin

rxin commented Sep 28, 2016

Copy link
Copy Markdown
Contributor

Might want to send a dev list email to solicit feedback on the API?

@marmbrus

Copy link
Copy Markdown
Contributor Author

Emailed the list. Seems like a popular feature so far :)

@SparkQA

SparkQA commented Sep 28, 2016

Copy link
Copy Markdown

Test build #66048 has finished for PR 15274 at commit 983def2.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Sep 28, 2016

Copy link
Copy Markdown

Test build #66052 has finished for PR 15274 at commit 360b97b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon

Copy link
Copy Markdown
Member

@marmbrus I just wonder if adding to_json make senses (although maybe it should be done in another PR). Just curious. I am just imaging the case to write out dataframes by some data sources not supporting nested structured types.

@yhuai

yhuai commented Sep 29, 2016

Copy link
Copy Markdown
Contributor

LGTM. Merging to master.

@asfgit asfgit closed this in fe33121 Sep 29, 2016
@marmbrus

Copy link
Copy Markdown
Contributor Author

@HyukjinKwon absolutely. I actually changed the name from json_parser to from_json in anticipation of adding to_json :)

@DanielMe

DanielMe commented Oct 17, 2016

Copy link
Copy Markdown

@marmbrus: Is there any workaround I can use to achieve a similar effect in 1.6?

@yhuai

yhuai commented Oct 17, 2016

Copy link
Copy Markdown
Contributor

@DanielMe The best options for 1.6 are get_json_object and json_tuple (their docs can be found at https://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.sql.functions$).

@DanielMe

Copy link
Copy Markdown

@yhuai thanks! My impression was that get_json_object does not convert json arrays to ArrayTypes, maybe I misunderstood the way it's supposed to be used though.

@yhuai

yhuai commented Oct 18, 2016

Copy link
Copy Markdown
Contributor

@DanielMe oh, I see. get_json_object will not parse json array. You need to have a UDF to do that for Spark 1.6.

@gatorsmile

gatorsmile commented Jan 30, 2017

Copy link
Copy Markdown
Member

Actually, to specify the schema in SQL language, maybe we can use a JSON string. A little bit odd. So far, nobody is asking for it, I guess. Let us see whether users need it in SQL

@Sazpaimon

Copy link
Copy Markdown

@gatorsmile Alternatively, one can use do what brickhouse's from_json Hive UDF does ( https://gist.github.com/jeromebanks/8855408#file-gistfile1-sql )

(For the record, I actually need this in SQL)

@gatorsmile

Copy link
Copy Markdown
Member

Based on the comment @marmbrus in a JIRA, we prefer to using our DDL format. For example, like what we did for CREATE TABLE, we can specify the schema using a int, b string

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants