Skip to content

[SPARK-17620][SQL] Determine Serde by hive.default.fileformat when Creating Hive Serde Tables#15495

Closed
dilipbiswal wants to merge 5 commits into
apache:masterfrom
dilipbiswal:orc2
Closed

[SPARK-17620][SQL] Determine Serde by hive.default.fileformat when Creating Hive Serde Tables#15495
dilipbiswal wants to merge 5 commits into
apache:masterfrom
dilipbiswal:orc2

Conversation

@dilipbiswal

@dilipbiswal dilipbiswal commented Oct 14, 2016

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Reopens the closed PR #15190
(Please refer to the above link for review comments on the PR)

Make sure the hive.default.fileformat is used to when creating the storage format metadata.

Output

scala> spark.sql("SET hive.default.fileformat=orc")
res1: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> spark.sql("CREATE TABLE tmp_default(id INT)")
res2: org.apache.spark.sql.DataFrame = []

Before

scala> spark.sql("DESC FORMATTED tmp_default").collect.foreach(println)
..
[# Storage Information,,]
[SerDe Library:,org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,]
[InputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,]
[OutputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,]
[Compressed:,No,]
[Storage Desc Parameters:,,]
[  serialization.format,1,]

After

scala> spark.sql("DESC FORMATTED tmp_default").collect.foreach(println)
..
[# Storage Information,,]
[SerDe Library:,org.apache.hadoop.hive.ql.io.orc.OrcSerde,]
[InputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,]
[OutputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,]
[Compressed:,No,]
[Storage Desc Parameters:,,]
[  serialization.format,1,]

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Added new tests to HiveDDLCommandSuite, SQLQuerySuite

// This testcase verifies that setting `hive.default.fileformat` has no impact on
// the target table's fileformat in case of CTAS.
assert(sessionState.conf.defaultDataSourceName === "parquet")
checkRelation(tableName = table, isDataSourceTable = true, format = "parquet")

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Scala 2.10, we need to name all the fields when we use named parameters. This is the reason for build failure while using scala 2.10

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think there is an error because in your mixing of named and positional arguments, the positional arguments are not the prefix of argument lists.

I.e., the compilation error:

checkRelation(table, isDataSourceTable = true, "parquet")

Should be fixed by:

checkRelation(table, isDataSourceTable = true, format = "parquet")

@dilipbiswal dilipbiswal Oct 15, 2016

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya so what i have should work , right ? I have also named the first field for code readability even though its not strictly necessary to fix the compilation issue. I also had a question, do you know how to trigger a test for scala 2.10 ? I would like to run against 2.10 if possible. I have run it against my local env though.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I know, we can't trigger it. Maybe @yhuai will know it? You can compile it with scala 2.10 locally to make sure it passes.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I think it is no problem you have all named arguments for them.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we cannot trigger scala 2.10 build for a pr.

@SparkQA

SparkQA commented Oct 15, 2016

Copy link
Copy Markdown

Test build #66995 has finished for PR 15495 at commit a98e173.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile

Copy link
Copy Markdown
Member

@yhuai Do you think it is good enough to merge? Thank you!

@yhuai

yhuai commented Oct 17, 2016

Copy link
Copy Markdown
Contributor

@gatorsmile If this pr fixes the problem related to the build, I am fine to merge it.

@gatorsmile

Copy link
Copy Markdown
Member

Thank you! Will do it soon.

@gatorsmile

Copy link
Copy Markdown
Member

retest this please

@SparkQA

SparkQA commented Oct 18, 2016

Copy link
Copy Markdown

Test build #67097 has finished for PR 15495 at commit a98e173.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in 813ab5e Oct 18, 2016
@gatorsmile

Copy link
Copy Markdown
Member

Merging to master! Thanks!

@dilipbiswal

Copy link
Copy Markdown
Contributor Author

@gatorsmile @yhuai Many thanks !!

robert3005 pushed a commit to palantir/spark that referenced this pull request Nov 1, 2016
…eating Hive Serde Tables

## What changes were proposed in this pull request?
Reopens the closed PR apache#15190
(Please refer to the above link for review comments on the PR)

Make sure the hive.default.fileformat is used to when creating the storage format metadata.

Output
``` SQL
scala> spark.sql("SET hive.default.fileformat=orc")
res1: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> spark.sql("CREATE TABLE tmp_default(id INT)")
res2: org.apache.spark.sql.DataFrame = []
```
Before
```SQL
scala> spark.sql("DESC FORMATTED tmp_default").collect.foreach(println)
..
[# Storage Information,,]
[SerDe Library:,org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,]
[InputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,]
[OutputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,]
[Compressed:,No,]
[Storage Desc Parameters:,,]
[  serialization.format,1,]
```
After
```SQL
scala> spark.sql("DESC FORMATTED tmp_default").collect.foreach(println)
..
[# Storage Information,,]
[SerDe Library:,org.apache.hadoop.hive.ql.io.orc.OrcSerde,]
[InputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,]
[OutputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,]
[Compressed:,No,]
[Storage Desc Parameters:,,]
[  serialization.format,1,]

```
## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Added new tests to HiveDDLCommandSuite, SQLQuerySuite

Author: Dilip Biswal <dbiswal@us.ibm.com>

Closes apache#15495 from dilipbiswal/orc2.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…eating Hive Serde Tables

## What changes were proposed in this pull request?
Reopens the closed PR apache#15190
(Please refer to the above link for review comments on the PR)

Make sure the hive.default.fileformat is used to when creating the storage format metadata.

Output
``` SQL
scala> spark.sql("SET hive.default.fileformat=orc")
res1: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> spark.sql("CREATE TABLE tmp_default(id INT)")
res2: org.apache.spark.sql.DataFrame = []
```
Before
```SQL
scala> spark.sql("DESC FORMATTED tmp_default").collect.foreach(println)
..
[# Storage Information,,]
[SerDe Library:,org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,]
[InputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,]
[OutputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,]
[Compressed:,No,]
[Storage Desc Parameters:,,]
[  serialization.format,1,]
```
After
```SQL
scala> spark.sql("DESC FORMATTED tmp_default").collect.foreach(println)
..
[# Storage Information,,]
[SerDe Library:,org.apache.hadoop.hive.ql.io.orc.OrcSerde,]
[InputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,]
[OutputFormat:,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,]
[Compressed:,No,]
[Storage Desc Parameters:,,]
[  serialization.format,1,]

```
## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Added new tests to HiveDDLCommandSuite, SQLQuerySuite

Author: Dilip Biswal <dbiswal@us.ibm.com>

Closes apache#15495 from dilipbiswal/orc2.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants