Skip to content

[SPARK-21912][SQL] ORC/Parquet table should not create invalid column names#19124

Closed
dongjoon-hyun wants to merge 12 commits into
apache:masterfrom
dongjoon-hyun:SPARK-21912
Closed

[SPARK-21912][SQL] ORC/Parquet table should not create invalid column names#19124
dongjoon-hyun wants to merge 12 commits into
apache:masterfrom
dongjoon-hyun:SPARK-21912

Conversation

@dongjoon-hyun

@dongjoon-hyun dongjoon-hyun commented Sep 4, 2017

Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

Currently, users meet job abortions while creating or altering ORC/Parquet tables with invalid column names. We had better prevent this by raising AnalysisException with a guide to use aliases instead like Paquet data source tables.

BEFORE

scala> sql("CREATE TABLE orc1 USING ORC AS SELECT 1 `a b`")
17/09/04 13:28:21 ERROR Utils: Aborting task
java.lang.IllegalArgumentException: Error: : expected at the position 8 of 'struct<a b:int>' but ' ' is found.
17/09/04 13:28:21 ERROR FileFormatWriter: Job job_20170904132821_0001 aborted.
17/09/04 13:28:21 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
org.apache.spark.SparkException: Task failed while writing rows.

AFTER

scala> sql("CREATE TABLE orc1 USING ORC AS SELECT 1 `a b`")
17/09/04 13:27:40 ERROR CreateDataSourceTableAsSelectCommand: Failed to write to table orc1
org.apache.spark.sql.AnalysisException: Attribute name "a b" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;

How was this patch tested?

Pass the Jenkins with a new test case.

@SparkQA

SparkQA commented Sep 4, 2017

Copy link
Copy Markdown

Test build #81391 has finished for PR 19124 at commit 808dfe0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}

private def checkFieldName(name: String): Unit = {
// ,;{}()\n\t= and space are special characters in ORC schema

@tejasapatil tejasapatil Sep 4, 2017

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an exhaustive list ? eg. looks like ? is not allowed either. Given that the underlying lib (ORC) can evolve to support / not support certain chars, its safer to rely on some method rather than coming up with a blacklist. Can you simply call TypeInfoUtils.getTypeInfoFromTypeString or any related method which would do this check ?

Caused by: java.lang.IllegalArgumentException: Error: : expected at the position 8 of 'struct<i?:int,j:int,k:string>' but '?' is found.
  at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:360)
  at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331)
  at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:483)
  at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305)
  at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfoFromTypeString(TypeInfoUtils.java:770)
  at org.apache.spark.sql.hive.orc.OrcSerializer.<init>(OrcFileFormat.scala:194)
  at org.apache.spark.sql.hive.orc.OrcOutputWriter.<init>(OrcFileFormat.scala:231)
  at org.apache.spark.sql.hive.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:91)
...
...

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for review, @tejasapatil !
That's a good idea. Right, It's not an exhaustive list. I'll update the PR.

@SparkQA

SparkQA commented Sep 5, 2017

Copy link
Copy Markdown

Test build #81394 has finished for PR 19124 at commit a738943.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

} catch {
case _: IllegalArgumentException =>
throw new AnalysisException(
s"""Attribute name "$name" contains invalid character(s).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Attribute-> Column

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for review. Sure.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you that column is more accurate here. Previously, I borrowed this from ParquetSchemaConverter

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L565-L572

withTable("orc1") {
Seq(" ", "?", ",", ";", "{", "}", "(", ")", "\n", "\t", "=").foreach { name =>
val m = intercept[AnalysisException] {
sql(s"CREATE TABLE orc1 USING ORC AS SELECT 1 `column$name`")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is CTAS. How about CREATE TABLE?

@dongjoon-hyun dongjoon-hyun Sep 5, 2017

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. I'll check the code path, too.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be the same situation with Parquet. CREATE TABLE passes but SELECT raises exceptions.

scala> sql("CREATE TABLE parquet1(`a b` int) using parquet")
res1: org.apache.spark.sql.DataFrame = []

scala> sql("select * from parquet1").show
org.apache.spark.sql.AnalysisException: Attribute name "a b" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add Datasource specific operation on createDataSourceTables for Parquet and ORC?

@dongjoon-hyun dongjoon-hyun Sep 5, 2017

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile . I tried the following in CreateDataSourceTableCommand. We can add a check for ParquetFileFormat, but not for OrcFileFormat. Should I change the PR title and scope instead?

    table.provider.get.toLowerCase match {
      case "parquet" =>
        dataSource.schema.map(_.name).foreach(
          org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.checkFieldName)
      case "orc" =>
        dataSource.schema.map(_.name).foreach(
          org.apache.spark.sql.hive.OrcRelation.checkFieldName)
    }

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try in another way.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-21912][SQL] Creating ORC datasource table should check invalid column names [SPARK-21912][SQL] Creating ORC/Parquet datasource table should check invalid column names Sep 5, 2017
}
}

table.provider.get.toLowerCase match {

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are able to check here for a normal CREATE TABLE.

@gatorsmile gatorsmile Sep 5, 2017

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just covers CREATE DATA SOURCE TABLES. How about CREATE HIVE SERDE TABLES?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya. That's a good point!

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile . I have a question. Do we have an issue on Hive SERDE table?

CREATE TABLE t(`a b` INT) USING hive OPTIONS (fileFormat 'parquet')

I thought Hive schema is preferred than Parquet schema.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is Hive schema? What is Parquet schema?

Could you insert rows to the table you created?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. It fails.

scala> sql("set spark.sql.hive.convertMetastoreParquet=false")
res5: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> sql("INSERT INTO t VALUES(1)")
17/09/05 11:34:03 ERROR Utils: Aborting task
org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.IllegalArgumentException: field ended by ';': expected ';' but got 'b' at line 1:   optional int32 a b
	at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:249)
	at org.apache.spark.sql.hive.execution.HiveOutputWriter.<init>(HiveFileFormat.scala:123)


import org.apache.spark.sql.AnalysisException

private[sql] object OrcFileFormat {

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fortunately, we already have new ORC dependency.

@SparkQA

SparkQA commented Sep 5, 2017

Copy link
Copy Markdown

Test build #81403 has finished for PR 19124 at commit cd539fe.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Sep 5, 2017

Copy link
Copy Markdown

Test build #81404 has finished for PR 19124 at commit 66aff54.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Sep 5, 2017

Copy link
Copy Markdown

Test build #81405 has finished for PR 19124 at commit aa78eaf.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun

dongjoon-hyun commented Sep 5, 2017

Copy link
Copy Markdown
Member Author

Hi, @gatorsmile .

I can fix it in most cases, but we have the following test case.

-- !query 2
CREATE TABLE showcolumn1 (col1 int, `col 2` int) USING parquet
-- !query 2 schema
struct<>
-- !query 2 output

In case of Parquet, currently, CREATE TABLE is allowed and CTAS and SELECT shows AnalysisException. How can I proceed this?

@dongjoon-hyun

Copy link
Copy Markdown
Member Author

I just updated the output answer file to show the result for review only.

@SparkQA

SparkQA commented Sep 5, 2017

Copy link
Copy Markdown

Test build #81408 has finished for PR 19124 at commit 0bf3b43.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -23,7 +23,8 @@ CREATE TABLE showcolumn1 (col1 int, `col 2` int) USING parquet
-- !query 2 schema

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change it to JSON

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, thank you for the guide!

@@ -145,15 +146,27 @@ class DetermineTableStats(session: SparkSession) extends Rule[LogicalPlan] {
* `PreprocessTableInsertion`.
*/
object HiveAnalysis extends Rule[LogicalPlan] {

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought HiveAnalysis is the best place to check this.

@dongjoon-hyun

Copy link
Copy Markdown
Member Author

Thank you, @gatorsmile . The PR becomes much general.
When I started this PR, I didn't notice that the current Spark had so many missing cases like this.



-- !query 3
CREATE TABLE showcolumn2 (price int, qty int, year int, month int) USING parquet partitioned by (year, month)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not need to change this.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. It's reverted.

@gatorsmile

Copy link
Copy Markdown
Member

That is normal. When we find a bug, it normally means we ignore it in more than one place. Thus, we need to check all the other code paths that could trigger it.

}
}

table.provider.get.toLowerCase(Locale.ROOT) match {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should do it in DataSourceAnalysis

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Right.

classOf[MapRedOutputFormat[_, _]])
}

org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.checkFieldNames(dataSchema)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this when we move the check to DataSourceAnalysis ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I'll check this and remove it. Maybe, we can remove the similar logic from ParquetFileFormat, too.

@gatorsmile

Copy link
Copy Markdown
Member

Could this PR cover this scenario?

@SparkQA

SparkQA commented Sep 6, 2017

Copy link
Copy Markdown

Test build #81432 has finished for PR 19124 at commit 8ee87dd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun

Copy link
Copy Markdown
Member Author

I created SPARK-21929 for "Support ALTER TABLE table_name ADD COLUMNS(..) for ORC data source".

For Parquet ALTER TABLE, yes. I think I can include that here.
But, for the title of PR, I'm not sure. It's not clear because it's partial.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-21912][SQL] Creating ORC/Parquet datasource table should check invalid column names [SPARK-21912][SQL] ORC/Parquet datasource table should check invalid column names Sep 6, 2017
@dongjoon-hyun dongjoon-hyun changed the title [SPARK-21912][SQL] ORC/Parquet datasource table should check invalid column names [SPARK-21912][SQL] ORC/Parquet table should create invalid column names Sep 6, 2017
@dongjoon-hyun dongjoon-hyun changed the title [SPARK-21912][SQL] ORC/Parquet table should create invalid column names [SPARK-21912][SQL] ORC/Parquet table should not create invalid column names Sep 6, 2017
conf.caseSensitiveAnalysis)

val newDataSchema = StructType(catalogTable.dataSchema ++ columns)
DDLUtils.checkFieldNames(catalogTable.copy(schema = newDataSchema))

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this command, it's not easy to get CatalogTable at DataSourceStrategy.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    val reorderedSchema = catalogTable.dataSchema ++ columns ++ catalogTable.partitionSchema
    val newSchema = catalogTable.schema.copy(fields = reorderedSchema.toArray)

    SchemaUtils.checkColumnNameDuplication(
      reorderedSchema.map(_.name), "in the table definition of " + table.identifier,
      conf.caseSensitiveAnalysis)
    DDLUtils.checkFieldNames(catalogTable.copy(schema = newSchema))

    catalog.alterTableSchema(table, newSchema)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ur, actually. Excluding partition columns was intentional.
Maybe, I used a misleading PR title and description here.
So far, I checked dataSchema only. I think partition columns are okay because they are not a part of Parquet/ORC file schema.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it okay to use the following?

    val reorderedSchema = catalogTable.dataSchema ++ columns ++ catalogTable.partitionSchema
    val newDataSchema = StructType(catalogTable.dataSchema ++ columns)

    SchemaUtils.checkColumnNameDuplication(
      reorderedSchema.map(_.name), "in the table definition of " + table.identifier,
      conf.caseSensitiveAnalysis)
    DDLUtils.checkFieldNames(catalogTable.copy(schema = newDataSchema))

    catalog.alterTableSchema(
      table, catalogTable.schema.copy(fields = reorderedSchema.toArray))

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I found that your code is better. I'll updated it like yours.

}

// TODO: After SPARK-21929, we need to check ORC, too.
Seq("PARQUET").foreach { source =>

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added only Parquet test case due to SPARK-21929.

@SparkQA

SparkQA commented Sep 6, 2017

Copy link
Copy Markdown

Test build #81435 has finished for PR 19124 at commit c6e9ab6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

private[sql] object OrcFileFormat {
private def checkFieldName(name: String): Unit = {
try {
TypeDescription.fromString(s"struct<$name:int>")

@viirya viirya Sep 6, 2017

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parseName looks not public though .. I don't like this line too but could not think of another alternative for now.

@viirya viirya Sep 6, 2017

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right, I forgot that is java...

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. I agree that it's a little urgly now.

} else if (serde == HiveSerDe.sourceToSerDe("parquet").get.serde) {
ParquetSchemaConverter.checkFieldNames(table.dataSchema)
} else {
table.provider.get.toLowerCase(Locale.ROOT) match {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

table.provider could be None in the previous versions of Spark. Thus, .get is risky.


private[sql] def checkFieldNames(table: CatalogTable): Unit = {
val serde = table.storage.serde
if (serde == HiveSerDe.sourceToSerDe("orc").get.serde) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way is not right. Let use your previous way with a foreach loop

    table.provider.foreach {
      _.toLowerCase(Locale.ROOT) match {
        case "hive" =>

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep!

val serde = table.storage.serde
if (serde == HiveSerDe.sourceToSerDe("orc").get.serde) {
OrcFileFormat.checkFieldNames(table.dataSchema)
} else if (serde == HiveSerDe.sourceToSerDe("parquet").get.serde) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could have different Parquet serde. For example, parquet.hive.serde.ParquetHiveSerDe and org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe. How about ORC?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, it's only org.apache.hadoop.hive.ql.io.orc.OrcSerde. I checked again whether Apache ORC 1.4.0 have some renamed one under hive-storage API or not. But, it doesn't have it.

For parquet, I'll handle that too.

@dongjoon-hyun

Copy link
Copy Markdown
Member Author

Oh, thank you for review, @viirya, @HyukjinKwon and @gatorsmile !
I'll follow up your comments!

}
}

private[sql] def checkFieldNames(table: CatalogTable): Unit = {

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll rename this to checkDataSchemaFieldNames.

@SparkQA

SparkQA commented Sep 6, 2017

Copy link
Copy Markdown

Test build #81443 has finished for PR 19124 at commit 46847f8.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun

Copy link
Copy Markdown
Member Author

Retest this please

@SparkQA

SparkQA commented Sep 6, 2017

Copy link
Copy Markdown

Test build #81445 has finished for PR 19124 at commit 46847f8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SchemaUtils.checkColumnNameDuplication(
reorderedSchema.map(_.name), "in the table definition of " + table.identifier,
conf.caseSensitiveAnalysis)
DDLUtils.checkDataSchemaFieldNames(catalogTable.copy(schema = newSchema))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newSchema also contains partition schema. How about partition schema? Do we have the same limits on it?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's okay. Inside checkDataSchemaFieldNames, we only uses table.dataSchema like the following.

ParquetSchemaConverter.checkFieldNames(table.dataSchema)

For the partition columns, we have been allowing the special characters.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add test cases and ensure the partitioning columns with special characters work?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR passes the above two test cases, too.

@gatorsmile

Copy link
Copy Markdown
Member

LGTM

@gatorsmile

Copy link
Copy Markdown
Member

Thanks! Merged to master.

@asfgit asfgit closed this in eea2b87 Sep 7, 2017
@dongjoon-hyun

Copy link
Copy Markdown
Member Author

@gatorsmile . Thank you for your help! This PR is almost made by you.

@dongjoon-hyun dongjoon-hyun deleted the SPARK-21912 branch September 7, 2017 05:28
@dongjoon-hyun

Copy link
Copy Markdown
Member Author

Thank you for your reviewing and helping this PR, @tejasapatil , @viirya , and @HyukjinKwon , too!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants