[SPARK-32889][SQL] orc table column name supports special characters. by jzc928 · Pull Request #29761 · apache/spark

jzc928 · 2020-09-15T13:21:04Z

What changes were proposed in this pull request?

make orc table column name support special characters like $

Why are the changes needed?

Special characters like $ are allowed in orc table column name by Hive.
But it's error when execute command "CREATE TABLE tbl($ INT, b INT) using orc" in spark. it's not compatible with Hive.

Column name "$" contains invalid character(s). Please use alias to rename it.;Column name "$" contains invalid character(s). Please use alias to rename it.;org.apache.spark.sql.AnalysisException: Column name "$" contains invalid character(s). Please use alias to rename it.; at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$.checkFieldName(OrcFileFormat.scala:51) at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$.$anonfun$checkFieldNames$1(OrcFileFormat.scala:59) at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$.$anonfun$checkFieldNames$1$adapted(OrcFileFormat.scala:59) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)

Does this PR introduce any user-facing change?

No

How was this patch tested?

Add unit test

wzhfy · 2020-09-15T15:09:51Z

test this please

wzhfy · 2020-09-15T15:23:44Z

also cc @dongjoon-hyun @cloud-fan

SparkQA · 2020-09-15T20:01:33Z

Test build #128719 has finished for PR 29761 at commit 0a2cca7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

Since we support special column names in data source already, I believe this PR is okay. I left a few comments, @jzc928 .

scala> Seq(1, 2).toDF("$").write.orc("/tmp/orc")

scala> spark.read.orc("/tmp/orc").printSchema
root
 |-- $: integer (nullable = true)


scala> sc.version
res3: String = 3.0.1

dongjoon-hyun · 2020-09-17T05:40:49Z

Retest this please.

dongjoon-hyun · 2020-09-17T05:43:31Z

@jzc928 . I left a few comments. Please update the PR accordingly. Although this is different from Parquet, but this is the same with JSON data source. So, I think we can accept this approach after revising the PR and passing Jenkins CI tests.

jzc928 · 2020-09-17T06:50:08Z

@dongjoon-hyun comments fixed.

SparkQA · 2020-09-17T07:05:02Z

Test build #128797 has finished for PR 29761 at commit c3c7f4c.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-09-17T14:53:06Z

Retest this please.

SparkQA · 2020-09-17T20:47:25Z

Test build #128828 has finished for PR 29761 at commit 7d63fd9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM. Thank you, @jzc928 and @wzhfy .
Merged to master for Apache Spark 3.1.0 on December 2020.

dongjoon-hyun · 2020-09-17T21:53:54Z

Thank you for your first contribution, @jzc928 .
You are added to the Apache Spark contributor group and SPARK-32889 is assigned to you.
Welcome to the Apache Spark community!

orc table column name supports special characters.

0a2cca7

probot-autolabeler Bot added the SQL label Sep 15, 2020

wzhfy reviewed Sep 15, 2020

View reviewed changes

Comment thread sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala Outdated

dongjoon-hyun reviewed Sep 15, 2020

View reviewed changes

Comment thread sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala Outdated

dongjoon-hyun reviewed Sep 15, 2020

View reviewed changes

Comment thread sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala Outdated

dongjoon-hyun requested changes Sep 15, 2020

View reviewed changes

dongjoon-hyun reviewed Sep 15, 2020

View reviewed changes

Comment thread sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala Outdated

add orc datasource column name check.

c3c7f4c

jzc928 force-pushed the orcColSpecialChar branch from b378671 to c3c7f4c Compare September 17, 2020 04:10

jzc928 requested review from dongjoon-hyun and wzhfy September 17, 2020 04:11