Skip to content

[SPARK-37965][SQL] Remove check field name when reading/writing existing data in Orc#35253

Closed
AngersZhuuuu wants to merge 6 commits into
apache:masterfrom
AngersZhuuuu:SPARK-37965
Closed

[SPARK-37965][SQL] Remove check field name when reading/writing existing data in Orc#35253
AngersZhuuuu wants to merge 6 commits into
apache:masterfrom
AngersZhuuuu:SPARK-37965

Conversation

@AngersZhuuuu

@AngersZhuuuu AngersZhuuuu commented Jan 20, 2022

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Remove supportFieldName check in DataSource ORCFormat.

  1. org.apache.spark.sql.hive.orc.OrcFileFormat didn't add this check too
  2. Tried a lot of wield column name, all can be reading and writing.

Why are the changes needed?

Remove unnecessary check

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added UT

@AngersZhuuuu AngersZhuuuu marked this pull request as draft January 20, 2022 03:13
@github-actions github-actions Bot added the SQL label Jan 20, 2022
@HyukjinKwon

Copy link
Copy Markdown
Member

I checked the history. Seems like we added this check mainly because Parquet restricts the column names that will be removed from #35229. So this change seems fine to me but would be great to double check w/ @dongjoon-hyun

@HyukjinKwon

Copy link
Copy Markdown
Member

@AngersZhuuuu BTW, I think it would be great to explain why we can remove this change in the PR description with pointing out the commits in the history.

@HyukjinKwon HyukjinKwon changed the title [WIP][SPARK-37965][SQL]Remove check field name when reading/writing existing data in Orc [WIP][SPARK-37965][SQL] Remove check field name when reading/writing existing data in Orc Jan 20, 2022
@AngersZhuuuu

Copy link
Copy Markdown
Contributor Author

@AngersZhuuuu BTW, I think it would be great to explain why we can remove this change in the PR description with pointing out the commits in the history.

Yea, will do this later.

@dongjoon-hyun dongjoon-hyun left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, @HyukjinKwon 's comment is correct.
Let's review this after #35229 landed to the master first.

Thank you for keeping Apache Spark data sources consistent.

@AngersZhuuuu

AngersZhuuuu commented Jan 21, 2022

Copy link
Copy Markdown
Contributor Author

This check is added in #19124 but change to use back quote to wrap field name in #29761

And in pr #29761 added a test test("SPARK-32889: ORC table column name supports special characters")

and a b is supported. , is not supported caused by we cannot create a table having a column whose name
contains commas in Hive metastore. It's not related to file format

@AngersZhuuuu AngersZhuuuu marked this pull request as ready for review January 21, 2022 08:26
@AngersZhuuuu AngersZhuuuu changed the title [WIP][SPARK-37965][SQL] Remove check field name when reading/writing existing data in Orc [SPARK-37965][SQL] Remove check field name when reading/writing existing data in Orc Jan 21, 2022
@AngersZhuuuu

Copy link
Copy Markdown
Contributor Author

@cloud-fan

Copy link
Copy Markdown
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 5b91381 Jan 24, 2022
@dongjoon-hyun

Copy link
Copy Markdown
Member

+1, LGTM.

@cxzl25

cxzl25 commented Nov 21, 2022

Copy link
Copy Markdown
Contributor

I think we still check for the empty character case, or add the supportFieldName method back.

Here are some tests.

native

native write

set spark.sql.orc.impl=native;
create table t_1 stored as orc  as select '' ;

suceess.

native read

set spark.sql.orc.impl=native;
select t_1;
java.lang.IllegalArgumentException: Empty quoted field name at '``^'
        at org.apache.orc.impl.ParserUtils.parseName(ParserUtils.java:114)
        at org.apache.orc.OrcUtils.convertTypeFromProtobuf(OrcUtils.java:352)
        at org.apache.orc.impl.OrcTail.<init>(OrcTail.java:72)
        at org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:845)
        at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:566)
        at org.apache.orc.OrcFile.createReader(OrcFile.java:385)
        at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$2(OrcFileFormat.scala:147)

hive read

set spark.sql.orc.impl=hive;
select t_1;
java.lang.IllegalArgumentException: Empty quoted field name at '``^'
        at org.apache.orc.impl.ParserUtils.parseName(ParserUtils.java:114)
        at org.apache.orc.OrcUtils.convertTypeFromProtobuf(OrcUtils.java:352)
        at org.apache.orc.impl.OrcTail.<init>(OrcTail.java:72)
        at org.apache.orc.impl.ReaderImpl.extractFileTail(ReaderImpl.java:845)
        at org.apache.orc.impl.ReaderImpl.<init>(ReaderImpl.java:566)
        at org.apache.hadoop.hive.ql.io.orc.ReaderImpl.<init>(ReaderImpl.java:63)
        at org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(OrcFile.java:55)
        at org.apache.spark.sql.hive.orc.OrcFileOperator$.$anonfun$getFileReader$3(OrcFileOperator.scala:76)

hive

hive write

set spark.sql.orc.impl=hive;
create table t_1 stored as orc  as select '' ;
java.lang.IllegalArgumentException: Error: name expected at the position 7 of 'struct<:string>' but ':' is found.
        at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:378)
        at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:502)
        at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:329)
        at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfoFromTypeString(TypeInfoUtils.java:831)
        at org.apache.spark.sql.hive.orc.OrcSerializer.<init>(OrcFileFormat.scala:242)
        at org.apache.spark.sql.hive.orc.OrcOutputWriter.<init>(OrcFileFormat.scala:279)
        at org.apache.spark.sql.hive.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:110)
        at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:161)

use HiveFileFormat

set spark.sql.hive.convertMetastoreOrc=false;
create table t_1 stored as orc  as select '' ;
Error in query:  Column name "" contains invalid character(s). Please use alias to rename it.

org.apache.spark.sql.hive.execution.HiveFileFormat#supportFieldName

case "org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat" =>
try {
TypeInfoUtils.getTypeInfoFromTypeString(s"struct<$name:int>")
true
} catch {
case _: IllegalArgumentException => false

@cloud-fan

Copy link
Copy Markdown
Contributor

@AngersZhuuuu is there a way to only check field name in the write side?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants