Skip to content

[SPARK-15676] [SQL] Disallow Column Names as Partition Columns For Hive Tables#13415

Closed
gatorsmile wants to merge 4 commits into
apache:masterfrom
gatorsmile:partitionColumnsInTableSchema
Closed

[SPARK-15676] [SQL] Disallow Column Names as Partition Columns For Hive Tables#13415
gatorsmile wants to merge 4 commits into
apache:masterfrom
gatorsmile:partitionColumnsInTableSchema

Conversation

@gatorsmile

Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

When creating a Hive Table (not data source tables), a common error users might make is to specify an existing column name as a partition column. Below is what Hive returns in this case:

hive> CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (data string, part string);
FAILED: SemanticException [Error 10035]: Column repeated in partitioning columns

Currently, the error we issued is very confusing:

org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:For direct MetaStore DB connections, we don't support retries at the client level.);

This PR is to fix the above issue by capturing the usage error in Parser.

How was this patch tested?

Added a test case to DDLCommandSuite

@@ -937,7 +937,13 @@ class SparkSqlAstBuilder(conf: SQLConf) extends AstBuilder {

selectQuery match {
case Some(q) => CreateTableAsSelectLogicalPlan(tableDesc, q, ifNotExists)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For CTAS, another PR (#13395) resolves the issue by disallowing users to specify Partitioned By clauses.

@SparkQA

SparkQA commented May 31, 2016

Copy link
Copy Markdown

Test build #59666 has finished for PR 13415 at commit 48ddb08.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile

gatorsmile commented May 31, 2016

Copy link
Copy Markdown
Member Author

cc @cloud-fan @yhuai @andrewor14

val partitionColsInTable = partitionCols.map(_.name).toSet.intersect(cols.map(_.name).toSet)
if (partitionColsInTable.nonEmpty) {
throw new ParseException(s"Column repeated in partitioning columns: " +
partitionColsInTable.mkString("[", ",", "]"), ctx)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @gatorsmile, this looks OK but it seems a better place to do it is up there in L885, where we just concatenate the schema with the partition columns together. There we can just check if schema.map(_.name) has any duplicate values.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I can move it there.

The reason why I put here is because CTAS should not see the partitioning columns. If we move there, we could issue this error message before the expected message: https://github.com/yhuai/spark/blob/fa8908122a238d6cdc0a9fc0f003221ef5601565/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala#L940-L948

@andrewor14 andrewor14 May 31, 2016

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's fine. I would still move it. Maybe I would even move the datasource partition check before this exception; we don't have to throw that one so late.


// Ensuring whether no duplicate name is used in table definition;
// Also ensuring the existing columns are not used as partition columns
checkDuplicateNames(colNames = schema.map(_.name), ctx)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After code changes, we are verifying two cases: one is duplicate names in table definition; another is column repeated in partitioning columns.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually it might be better to explicitly check if there are common columns between cols and partitionCols. Then we can give a better error message.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thanks!

@SparkQA

SparkQA commented Jun 1, 2016

Copy link
Copy Markdown

Test build #59695 has finished for PR 13415 at commit 6c5c2d9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Jun 1, 2016

Copy link
Copy Markdown

Test build #59705 has finished for PR 13415 at commit 942366f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile

Copy link
Copy Markdown
Member Author

retest this please

@SparkQA

SparkQA commented Jun 1, 2016

Copy link
Copy Markdown

Test build #59712 has finished for PR 13415 at commit 942366f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile

Copy link
Copy Markdown
Member Author

retest this please

val duplicateColumns = colNames.groupBy(identity).collect {
case (x, ys) if ys.length > 1 => "\"" + x + "\""
}
throw new ParseException(s"Duplicate column name key(s) in the table definition: " +

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does "column name key(s)" means? I think we should just say: Duplicated column names found in table definition: ...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also would be good to throw operationNotAllowed here

@andrewor14 andrewor14 Jun 2, 2016

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also print the table name? found in table definition for 'my_table'

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

: ) This just follows the error message of Hive. Will change it. Thanks!

@cloud-fan

Copy link
Copy Markdown
Contributor

I'm thinking about case sensitivity, maybe we should put this check in analyzer instead of parser?

@gatorsmile

Copy link
Copy Markdown
Member Author

@cloud-fan Yeah. Agree. I knew you will say that. : )

@SparkQA

SparkQA commented Jun 2, 2016

Copy link
Copy Markdown

Test build #59866 has finished for PR 13415 at commit 942366f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14

Copy link
Copy Markdown
Contributor

We should still do it in the parser, but use the SQLConf

@gatorsmile

Copy link
Copy Markdown
Member Author

Sure, will do it. Thanks!

@gatorsmile

gatorsmile commented Jun 3, 2016

Copy link
Copy Markdown
Member Author

@cloud-fan @andrewor14 In this scenario, we do not have the case sensitivity issues. The names of all the catalog columns are converted to lower case by

I remember we gave up the case sensitivity support in this release.

Let me know if you have any question regarding the current implementation. Thanks!

@SparkQA

SparkQA commented Jun 3, 2016

Copy link
Copy Markdown

Test build #59912 has finished for PR 13415 at commit f4207e3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile

Copy link
Copy Markdown
Member Author

retest this please

@SparkQA

SparkQA commented Jun 7, 2016

Copy link
Copy Markdown

Test build #60099 has finished for PR 13415 at commit f4207e3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan

Copy link
Copy Markdown
Contributor

LGTM, cc @andrewor14 for final sign off

@gatorsmile

Copy link
Copy Markdown
Member Author

Thank you! @cloud-fan

@gatorsmile

Copy link
Copy Markdown
Member Author

retest this please

@SparkQA

SparkQA commented Jun 10, 2016

Copy link
Copy Markdown

Test build #60275 has finished for PR 13415 at commit f4207e3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@andrewor14

andrewor14 commented Jun 10, 2016

Copy link
Copy Markdown
Contributor

LGTM sorry for the wait

@gatorsmile

Copy link
Copy Markdown
Member Author

Thank you! @andrewor14

@gatorsmile

Copy link
Copy Markdown
Member Author

retest this please

@SparkQA

SparkQA commented Jun 13, 2016

Copy link
Copy Markdown

Test build #60408 has finished for PR 13415 at commit f4207e3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai

yhuai commented Jun 13, 2016

Copy link
Copy Markdown
Contributor

Thanks. Merging to master and branch 2.0.

asfgit pushed a commit that referenced this pull request Jun 13, 2016
…e Tables

#### What changes were proposed in this pull request?
When creating a Hive Table (not data source tables), a common error users might make is to specify an existing column name as a partition column. Below is what Hive returns in this case:
```
hive> CREATE TABLE partitioned (id bigint, data string) PARTITIONED BY (data string, part string);
FAILED: SemanticException [Error 10035]: Column repeated in partitioning columns
```
Currently, the error we issued is very confusing:
```
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:For direct MetaStore DB connections, we don't support retries at the client level.);
```
This PR is to fix the above issue by capturing the usage error in `Parser`.

#### How was this patch tested?
Added a test case to `DDLCommandSuite`

Author: gatorsmile <gatorsmile@gmail.com>

Closes #13415 from gatorsmile/partitionColumnsInTableSchema.

(cherry picked from commit 3b7fb84)
Signed-off-by: Yin Huai <yhuai@databricks.com>
@asfgit asfgit closed this in 3b7fb84 Jun 13, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants