[SPARK-21912][SQL] ORC/Parquet table should not create invalid column names by dongjoon-hyun · Pull Request #19124 · apache/spark

dongjoon-hyun · 2017-09-04T20:50:39Z

What changes were proposed in this pull request?

Currently, users meet job abortions while creating or altering ORC/Parquet tables with invalid column names. We had better prevent this by raising AnalysisException with a guide to use aliases instead like Paquet data source tables.

BEFORE

scala> sql("CREATE TABLE orc1 USING ORC AS SELECT 1 `a b`")
17/09/04 13:28:21 ERROR Utils: Aborting task
java.lang.IllegalArgumentException: Error: : expected at the position 8 of 'struct<a b:int>' but ' ' is found.
17/09/04 13:28:21 ERROR FileFormatWriter: Job job_20170904132821_0001 aborted.
17/09/04 13:28:21 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
org.apache.spark.SparkException: Task failed while writing rows.

AFTER

scala> sql("CREATE TABLE orc1 USING ORC AS SELECT 1 `a b`")
17/09/04 13:27:40 ERROR CreateDataSourceTableAsSelectCommand: Failed to write to table orc1
org.apache.spark.sql.AnalysisException: Attribute name "a b" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;

How was this patch tested?

Pass the Jenkins with a new test case.

… column names

SparkQA · 2017-09-04T22:49:37Z

Test build #81391 has finished for PR 19124 at commit 808dfe0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tejasapatil · 2017-09-04T23:53:24Z

  }
+
+  private def checkFieldName(name: String): Unit = {
+    // ,;{}()\n\t= and space are special characters in ORC schema


Is this an exhaustive list ? eg. looks like ? is not allowed either. Given that the underlying lib (ORC) can evolve to support / not support certain chars, its safer to rely on some method rather than coming up with a blacklist. Can you simply call TypeInfoUtils.getTypeInfoFromTypeString or any related method which would do this check ?

Caused by: java.lang.IllegalArgumentException: Error: : expected at the position 8 of 'struct<i?:int,j:int,k:string>' but '?' is found. at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:360) at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331) at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:483) at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305) at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfoFromTypeString(TypeInfoUtils.java:770) at org.apache.spark.sql.hive.orc.OrcSerializer.<init>(OrcFileFormat.scala:194) at org.apache.spark.sql.hive.orc.OrcOutputWriter.<init>(OrcFileFormat.scala:231) at org.apache.spark.sql.hive.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:91) ... ...

Thank you for review, @tejasapatil !
That's a good idea. Right, It's not an exhaustive list. I'll update the PR.

SparkQA · 2017-09-05T03:30:46Z

Test build #81394 has finished for PR 19124 at commit a738943.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-09-05T07:16:32Z

+    } catch {
+      case _: IllegalArgumentException =>
+        throw new AnalysisException(
+          s"""Attribute name "$name" contains invalid character(s).


Nit: Attribute-> Column

Thank you for review. Sure.

I agree with you that column is more accurate here. Previously, I borrowed this from ParquetSchemaConverter

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L565-L572

gatorsmile · 2017-09-05T07:17:54Z

+    withTable("orc1") {
+      Seq(" ", "?", ",", ";", "{", "}", "(", ")", "\n", "\t", "=").foreach { name =>
+        val m = intercept[AnalysisException] {
+          sql(s"CREATE TABLE orc1 USING ORC AS SELECT 1 `column$name`")


This is CTAS. How about CREATE TABLE?

Yep. I'll check the code path, too.

It seems to be the same situation with Parquet. CREATE TABLE passes but SELECT raises exceptions.

scala> sql("CREATE TABLE parquet1(`a b` int) using parquet") res1: org.apache.spark.sql.DataFrame = [] scala> sql("select * from parquet1").show org.apache.spark.sql.AnalysisException: Attribute name "a b" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;

Do we need to add Datasource specific operation on createDataSourceTables for Parquet and ORC?

@gatorsmile . I tried the following in CreateDataSourceTableCommand. We can add a check for ParquetFileFormat, but not for OrcFileFormat. Should I change the PR title and scope instead?

table.provider.get.toLowerCase match { case "parquet" => dataSource.schema.map(_.name).foreach( org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.checkFieldName) case "orc" => dataSource.schema.map(_.name).foreach( org.apache.spark.sql.hive.OrcRelation.checkFieldName) }

I'll try in another way.

dongjoon-hyun · 2017-09-05T08:40:05Z

      }
    }

+    table.provider.get.toLowerCase match {


We are able to check here for a normal CREATE TABLE.

This just covers CREATE DATA SOURCE TABLES. How about CREATE HIVE SERDE TABLES?

Ya. That's a good point!

@gatorsmile . I have a question. Do we have an issue on Hive SERDE table?

CREATE TABLE t(`a b` INT) USING hive OPTIONS (fileFormat 'parquet')

I thought Hive schema is preferred than Parquet schema.

What is Hive schema? What is Parquet schema?

Could you insert rows to the table you created?

Oh, I see. It fails.

scala> sql("set spark.sql.hive.convertMetastoreParquet=false") res5: org.apache.spark.sql.DataFrame = [key: string, value: string] scala> sql("INSERT INTO t VALUES(1)") 17/09/05 11:34:03 ERROR Utils: Aborting task org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.IllegalArgumentException: field ended by ';': expected ';' but got 'b' at line 1: optional int32 a b at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:249) at org.apache.spark.sql.hive.execution.HiveOutputWriter.<init>(HiveFileFormat.scala:123)

dongjoon-hyun · 2017-09-05T08:40:57Z

+
+import org.apache.spark.sql.AnalysisException
+
+private[sql] object OrcFileFormat {


Fortunately, we already have new ORC dependency.

SparkQA · 2017-09-05T08:44:02Z

Test build #81403 has finished for PR 19124 at commit cd539fe.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-05T08:59:43Z

Test build #81404 has finished for PR 19124 at commit 66aff54.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-05T09:53:06Z

Test build #81405 has finished for PR 19124 at commit aa78eaf.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-09-05T10:46:19Z

Hi, @gatorsmile .

I can fix it in most cases, but we have the following test case.

-- !query 2
CREATE TABLE showcolumn1 (col1 int, `col 2` int) USING parquet
-- !query 2 schema
struct<>
-- !query 2 output

In case of Parquet, currently, CREATE TABLE is allowed and CTAS and SELECT shows AnalysisException. How can I proceed this?

dongjoon-hyun · 2017-09-05T10:52:03Z

I just updated the output answer file to show the result for review only.

SparkQA · 2017-09-05T13:32:27Z

Test build #81408 has finished for PR 19124 at commit 0bf3b43.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-09-05T15:56:42Z

@@ -23,7 +23,8 @@ CREATE TABLE showcolumn1 (col1 int, `col 2` int) USING parquet
 -- !query 2 schema


Please change it to JSON

Sure, thank you for the guide!

dongjoon-hyun · 2017-09-05T19:52:08Z

@@ -145,15 +146,27 @@ class DetermineTableStats(session: SparkSession) extends Rule[LogicalPlan] {
 * `PreprocessTableInsertion`.
 */
 object HiveAnalysis extends Rule[LogicalPlan] {


I thought HiveAnalysis is the best place to check this.

dongjoon-hyun · 2017-09-05T19:53:32Z

Thank you, @gatorsmile . The PR becomes much general.
When I started this PR, I didn't notice that the current Spark had so many missing cases like this.

gatorsmile · 2017-09-05T20:14:48Z



 -- !query 3
-CREATE TABLE showcolumn2 (price int, qty int, year int, month int) USING parquet partitioned by (year, month)


We do not need to change this.

Yep. It's reverted.

gatorsmile · 2017-09-05T20:16:44Z

That is normal. When we find a bug, it normally means we ignore it in more than one place. Thus, we need to check all the other code paths that could trigger it.

gatorsmile · 2017-09-05T20:18:16Z

      }
    }

+    table.provider.get.toLowerCase(Locale.ROOT) match {


We should do it in DataSourceAnalysis

Thanks. Right.

gatorsmile · 2017-09-05T20:19:39Z

          classOf[MapRedOutputFormat[_, _]])
    }

+    org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.checkFieldNames(dataSchema)


Do we still need this when we move the check to DataSourceAnalysis ?

I see. I'll check this and remove it. Maybe, we can remove the similar logic from ParquetFileFormat, too.

gatorsmile · 2017-09-06T00:43:07Z

Could this PR cover this scenario?

SparkQA · 2017-09-06T00:50:01Z

Test build #81432 has finished for PR 19124 at commit 8ee87dd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-09-06T00:51:02Z

I created SPARK-21929 for "Support ALTER TABLE table_name ADD COLUMNS(..) for ORC data source".

For Parquet ALTER TABLE, yes. I think I can include that here.
But, for the title of PR, I'm not sure. It's not clear because it's partial.

dongjoon-hyun · 2017-09-06T01:29:21Z

      conf.caseSensitiveAnalysis)

+    val newDataSchema = StructType(catalogTable.dataSchema ++ columns)
+    DDLUtils.checkFieldNames(catalogTable.copy(schema = newDataSchema))


For this command, it's not easy to get CatalogTable at DataSourceStrategy.

val reorderedSchema = catalogTable.dataSchema ++ columns ++ catalogTable.partitionSchema val newSchema = catalogTable.schema.copy(fields = reorderedSchema.toArray) SchemaUtils.checkColumnNameDuplication( reorderedSchema.map(_.name), "in the table definition of " + table.identifier, conf.caseSensitiveAnalysis) DDLUtils.checkFieldNames(catalogTable.copy(schema = newSchema)) catalog.alterTableSchema(table, newSchema)

Ur, actually. Excluding partition columns was intentional.
Maybe, I used a misleading PR title and description here.
So far, I checked dataSchema only. I think partition columns are okay because they are not a part of Parquet/ORC file schema.

Is it okay to use the following?

val reorderedSchema = catalogTable.dataSchema ++ columns ++ catalogTable.partitionSchema val newDataSchema = StructType(catalogTable.dataSchema ++ columns) SchemaUtils.checkColumnNameDuplication( reorderedSchema.map(_.name), "in the table definition of " + table.identifier, conf.caseSensitiveAnalysis) DDLUtils.checkFieldNames(catalogTable.copy(schema = newDataSchema)) catalog.alterTableSchema( table, catalogTable.schema.copy(fields = reorderedSchema.toArray))

Sorry, I found that your code is better. I'll updated it like yours.

dongjoon-hyun · 2017-09-06T01:29:58Z

+        }
+
+        // TODO: After SPARK-21929, we need to check ORC, too.
+        Seq("PARQUET").foreach { source =>


I added only Parquet test case due to SPARK-21929.

SparkQA · 2017-09-06T04:06:14Z

Test build #81435 has finished for PR 19124 at commit c6e9ab6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-09-06T04:32:12Z

+private[sql] object OrcFileFormat {
+  private def checkFieldName(name: String): Unit = {
+    try {
+      TypeDescription.fromString(s"struct<$name:int>")


This seems being equal to call TypeDescription.parseName(name).

https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/TypeDescription.java#L325

parseName looks not public though .. I don't like this line too but could not think of another alternative for now.

Oh, right, I forgot that is java...

Yep. I agree that it's a little urgly now.

gatorsmile · 2017-09-06T04:46:22Z

+    } else if (serde == HiveSerDe.sourceToSerDe("parquet").get.serde) {
+      ParquetSchemaConverter.checkFieldNames(table.dataSchema)
+    } else {
+      table.provider.get.toLowerCase(Locale.ROOT) match {


table.provider could be None in the previous versions of Spark. Thus, .get is risky.

gatorsmile · 2017-09-06T04:52:07Z

+
+  private[sql] def checkFieldNames(table: CatalogTable): Unit = {
+    val serde = table.storage.serde
+    if (serde == HiveSerDe.sourceToSerDe("orc").get.serde) {


This way is not right. Let use your previous way with a foreach loop

table.provider.foreach { _.toLowerCase(Locale.ROOT) match { case "hive" =>

gatorsmile · 2017-09-06T04:56:49Z

+    val serde = table.storage.serde
+    if (serde == HiveSerDe.sourceToSerDe("orc").get.serde) {
+      OrcFileFormat.checkFieldNames(table.dataSchema)
+    } else if (serde == HiveSerDe.sourceToSerDe("parquet").get.serde) {


We could have different Parquet serde. For example, parquet.hive.serde.ParquetHiveSerDe and org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe. How about ORC?

AFAIK, it's only org.apache.hadoop.hive.ql.io.orc.OrcSerde. I checked again whether Apache ORC 1.4.0 have some renamed one under hive-storage API or not. But, it doesn't have it.

For parquet, I'll handle that too.

dongjoon-hyun · 2017-09-06T05:20:50Z

Oh, thank you for review, @viirya, @HyukjinKwon and @gatorsmile !
I'll follow up your comments!

dongjoon-hyun · 2017-09-06T06:30:50Z

    }
  }
+
+  private[sql] def checkFieldNames(table: CatalogTable): Unit = {


I'll rename this to checkDataSchemaFieldNames.

SparkQA · 2017-09-06T07:04:47Z

Test build #81443 has finished for PR 19124 at commit 46847f8.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-09-06T07:25:06Z

Retest this please

SparkQA · 2017-09-06T10:04:04Z

Test build #81445 has finished for PR 19124 at commit 46847f8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-09-06T20:38:27Z

    SchemaUtils.checkColumnNameDuplication(
      reorderedSchema.map(_.name), "in the table definition of " + table.identifier,
      conf.caseSensitiveAnalysis)
+    DDLUtils.checkDataSchemaFieldNames(catalogTable.copy(schema = newSchema))


newSchema also contains partition schema. How about partition schema? Do we have the same limits on it?

It's okay. Inside checkDataSchemaFieldNames, we only uses table.dataSchema like the following.

ParquetSchemaConverter.checkFieldNames(table.dataSchema)

For the partition columns, we have been allowing the special characters.

Could you add test cases and ensure the partitioning columns with special characters work?

DDLSuite and HiveDDLSuite have them here.

https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala#L2045-L2070

https://github.com/apache/spark/blob/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala#L1736

I think this PR passes the above two test cases, too.

gatorsmile · 2017-09-07T05:20:26Z

LGTM

gatorsmile · 2017-09-07T05:21:05Z

Thanks! Merged to master.

dongjoon-hyun · 2017-09-07T05:28:51Z

@gatorsmile . Thank you for your help! This PR is almost made by you.

dongjoon-hyun · 2017-09-07T05:41:39Z

Thank you for your reviewing and helping this PR, @tejasapatil , @viirya , and @HyukjinKwon , too!

[SPARK-21912][SQL] Creating ORC datasource table should check invalid…

808dfe0

… column names

tejasapatil reviewed Sep 4, 2017

View reviewed changes

Address comments.

a738943

gatorsmile reviewed Sep 5, 2017

View reviewed changes

Address comments.

cd539fe

dongjoon-hyun changed the title ~~[SPARK-21912][SQL] Creating ORC datasource table should check invalid column names~~ [SPARK-21912][SQL] Creating ORC/Parquet datasource table should check invalid column names Sep 5, 2017

dongjoon-hyun commented Sep 5, 2017

View reviewed changes

Add a newline at the end of the file.

66aff54

fix

aa78eaf

Update answer file to show the result.

0bf3b43

gatorsmile reviewed Sep 5, 2017

View reviewed changes

dongjoon-hyun added 2 commits September 5, 2017 09:07

Replace parquet with json.

79929e9

Address comments.

368b242

dongjoon-hyun commented Sep 5, 2017

View reviewed changes

gatorsmile reviewed Sep 5, 2017

View reviewed changes

Use DataSourceStategy.

c70c03c

dongjoon-hyun changed the title ~~[SPARK-21912][SQL] Creating ORC/Parquet datasource table should check invalid column names~~ [SPARK-21912][SQL] ORC/Parquet datasource table should check invalid column names Sep 6, 2017

dongjoon-hyun changed the title ~~[SPARK-21912][SQL] ORC/Parquet datasource table should check invalid column names~~ [SPARK-21912][SQL] ORC/Parquet table should create invalid column names Sep 6, 2017

dongjoon-hyun changed the title ~~[SPARK-21912][SQL] ORC/Parquet table should create invalid column names~~ [SPARK-21912][SQL] ORC/Parquet table should not create invalid column names Sep 6, 2017

Add ALTER TABLE and address comments.

c6e9ab6

dongjoon-hyun commented Sep 6, 2017

View reviewed changes

viirya reviewed Sep 6, 2017

View reviewed changes

gatorsmile reviewed Sep 6, 2017

View reviewed changes

dongjoon-hyun commented Sep 6, 2017

View reviewed changes

Address comments.

46847f8

gatorsmile reviewed Sep 6, 2017

View reviewed changes

asfgit closed this in eea2b87 Sep 7, 2017

dongjoon-hyun deleted the SPARK-21912 branch September 7, 2017 05:28

AngersZhuuuu mentioned this pull request Jan 21, 2022

[SPARK-37965][SQL] Remove check field name when reading/writing existing data in Orc #35253

Closed


		import org.apache.spark.sql.AnalysisException

		private[sql] object OrcFileFormat {

		@@ -23,7 +23,8 @@ CREATE TABLE showcolumn1 (col1 int, `col 2` int) USING parquet
		-- !query 2 schema



		-- !query 3
		CREATE TABLE showcolumn2 (price int, qty int, year int, month int) USING parquet partitioned by (year, month)

Uh oh!

Conversation

dongjoon-hyun commented Sep 4, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Sep 4, 2017

Uh oh!

tejasapatil Sep 4, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 5, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Sep 5, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Sep 5, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile Sep 5, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 5, 2017

Uh oh!

SparkQA commented Sep 5, 2017

Uh oh!

SparkQA commented Sep 5, 2017

Uh oh!

dongjoon-hyun commented Sep 5, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Sep 5, 2017

Uh oh!

SparkQA commented Sep 5, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Sep 5, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Sep 4, 2017 •

edited

Loading

tejasapatil Sep 4, 2017 •

edited

Loading

dongjoon-hyun Sep 5, 2017 •

edited

Loading

dongjoon-hyun Sep 5, 2017 •

edited

Loading

gatorsmile Sep 5, 2017 •

edited

Loading

dongjoon-hyun commented Sep 5, 2017 •

edited

Loading

viirya Sep 6, 2017 •

edited

Loading