[WIP][SPARK-28495][SQL] Table insertion: follow store assignment rules of ANSI SQL by gengliangwang · Pull Request #25239 · apache/spark

gengliangwang · 2019-07-24T07:18:30Z

What changes were proposed in this pull request?

In Spark version 2.4 and earlier, when inserting into a table, Spark will cast the data type of input query to the data type of target table by coercion. This can be super confusing, e.g. users make a mistake and write string values to an int column.

In data source V2, by default, only upcasting is allowed when inserting data into a table. E.g. int -> long and int -> string are allowed, while decimal -> double or long -> int are not allowed. The rules of UpCast was originally created for Dataset type coercion. They are quite strict and different from the behavior of all existing popular DBMS. Making it the default behavior of the table insertion is breaking change. It is possible that it would hurt some Spark users after 3.0 releases.

This PR proposes that we can follow the rules of store assignment(section 9.2) in ANSI SQL. Two significant differences from Up-Cast:

Any numeric type can be assigned to another numeric type.
TimestampType can be assigned DateType

The new behavior is consistent with PostgreSQL. It is more explainable and acceptable than using UpCast .
The change will be applied in Data Source V2 first. If it is merged, we can apply it into data source V1.

How was this patch tested?

Unit test

gengliangwang · 2019-07-24T07:19:19Z

We can find a copy of ANSI SQL 2009 in http://jtc1sc32.org/doc/N1801-1850/32N1822T-text_for_ballot-CD_9075-2.pdf

gengliangwang · 2019-07-24T07:19:42Z

@cloud-fan @maropu @rdblue @gatorsmile

cloud-fan · 2019-07-24T09:13:38Z

+ * Cast the child expression to the target data type, but will throw error if the cast violates
+ * the store assignment rules of ANSI SQL, e.g. string -> int, array -> string.
+ */
+case class AssignableCast(child: Expression, dataType: DataType, walkedTypePath: Seq[String] = Nil)


can we apply it for the v2 table insertion first? We probably need another PR to apply it for v1 table insertion, as it's a behavior change and needs migration guide.

BTW I'm not sure if we need walkedTypePath. Upcast needs it because it needs to record the field path in a class. I think for AssignableCast we only need to record the column name.

cloud-fan · 2019-07-24T09:22:44Z

@@ -2454,7 +2455,7 @@ class Analyzer(
      } else {
        // always add an UpCast. it will be removed in the optimizer if it is unnecessary.


comment needs update.

Also, we might need to update the comment in the header;

* - Insert safe casts when data types do not match

https://github.com/apache/spark/pull/25239/files#diff-57b3d87be744b7d79a9beacf8e5e5eb2R2353

maropu · 2019-07-24T10:05:26Z

+          if !Cast.canAssign(child.dataType, dataType) =>
+          fail(child, dataType, walkedTypePath)
+
+        case AssignableCast(child, dataType, _) => Cast(child, dataType.asNullable)


We don't need rounding/truncating/out-of-range checks here for some cases, e.g., int->short, double->float?

As per the ANSI SQL, rounding/truncating is allowed.
So here we still convert the assignable writes to cast, the result is null if the conversion is out-of-range. Later on, we can add a configuration mode for throwing exceptions on out-of-range conversion.

In the current pr, is the result null if out-of-range cases?
For example, in case of int->short casts, it seems Cast just returns a weired value for a out-of-range value?;

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala

Line 479 in 167fa04

b => x.numeric.asInstanceOf[Numeric[Any]].toInt(b).toShort

scala> sql("create table t (s short) using parquet") scala> sql("insert into t values (int(32768))") // InsertIntoTable Relation[s#12] parquet, false, false // +- Project [cast(col1#31 as smallint) AS s#32] // +- LocalRelation [col1#31] scala> sql("select * from t").show +------+ | s| +------+ |-32768| +------+

@maropu nice catch. It seems that it is an existing issue in Cast. We can fix it first.

Created https://issues.apache.org/jira/browse/SPARK-28503 for this.

Thanks! Just a note; I'm a bit worried about additional overheads to check valid value ranges inside Cast since it is already used in many places. As another option, I think we can wrap Cast with a If expression to check value ranges, e.g., IF('value range check', CAST, Literal(null)).

SparkQA · 2019-07-24T11:57:02Z

Test build #108081 has finished for PR 25239 at commit b7fdc93.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class AssignableCast(child: Expression, dataType: DataType, walkedTypePath: Seq[String] = Nil)

SparkQA · 2019-07-24T12:48:44Z

Test build #108100 has finished for PR 25239 at commit 84c9200.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-07-24T14:29:23Z

Test build #108101 has finished for PR 25239 at commit 1abcc9d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2019-07-24T16:49:31Z

Later on, we can add a configuration mode for throwing exceptions on out-of-range conversion.

Is this later on in this PR, or in a follow-up change? If it is in a follow-up, what is the JIRA issue? I think that should be a blocker for 3.0 if it isn't included in this PR.

rdblue · 2019-07-24T17:38:51Z

@gengliangwang, can you bring this up in a DISCUSS thread on the dev list? I think the decision about this behavior should include more people and requires a vote.

gengliangwang · 2019-07-24T17:41:26Z

Is this later on in this PR, or in a follow-up change? If it is in a follow-up, what is the JIRA issue? I think that should be a blocker for 3.0 if it isn't included in this PR.

I meant it will be in a follow-up. I can create one.

@gengliangwang, can you bring this up in a DISCUSS thread on the dev list? I think the decision about this behavior should include more people and requires a vote.

Good suggestion. I will do it :)

gengliangwang · 2019-07-25T07:43:22Z

@maropu @rdblue I have created a JIRA for the new optional behavior that throws runtime exceptions on casting failures: https://issues.apache.org/jira/browse/SPARK-28512

rdblue · 2019-08-20T23:03:50Z

@gengliangwang, it looks like this intended to be added after #25453, is that correct?

cloud-fan · 2019-08-21T00:28:33Z

Yes, it's blocked by #25453

gengliangwang · 2019-08-21T02:57:37Z

Yes, this will be added after #25453 is merged.

gatorsmile · 2019-08-24T06:36:25Z

#25453 has been merged. We can start working on this.

gengliangwang · 2019-08-26T11:40:47Z

Close this one and open #25581

Assignable Cast

b7fdc93

dongjoon-hyun added the SQL label Jul 24, 2019

cloud-fan reviewed Jul 24, 2019

View reviewed changes

maropu reviewed Jul 24, 2019

View reviewed changes

simplify code

84c9200

gengliangwang changed the title ~~[SPARK-28495][SQL] AssignableCast: A new type coercion following store assignment rules of ANSI SQL~~ [SPARK-28495][SQL] Table insertion: follow store assignment rules of ANSI SQL Jul 24, 2019

gengliangwang changed the title ~~[SPARK-28495][SQL] Table insertion: follow store assignment rules of ANSI SQL~~ [WIP][SPARK-28495][SQL] Table insertion: follow store assignment rules of ANSI SQL Jul 24, 2019

fix build

1abcc9d

maropu mentioned this pull request Jul 31, 2019

[SPARK-28503][SQL] Return null result on cast an out-of-range value to a integral type #25300

Closed

gengliangwang mentioned this pull request Aug 14, 2019

[SPARK-28730][SQL] Configurable type coercion policy for table insertion #25453

Closed

gengliangwang closed this Aug 26, 2019

		@@ -2454,7 +2455,7 @@ class Analyzer(
		} else {
		// always add an UpCast. it will be removed in the optimizer if it is unnecessary.

Uh oh!

Conversation

gengliangwang commented Jul 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

gengliangwang commented Jul 24, 2019

Uh oh!

gengliangwang commented Jul 24, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu Jul 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 24, 2019

Uh oh!

SparkQA commented Jul 24, 2019

Uh oh!

SparkQA commented Jul 24, 2019

Uh oh!

rdblue commented Jul 24, 2019

Uh oh!

rdblue commented Jul 24, 2019

Uh oh!

gengliangwang commented Jul 24, 2019

Uh oh!

gengliangwang commented Jul 25, 2019

Uh oh!

rdblue commented Aug 20, 2019

Uh oh!

cloud-fan commented Aug 21, 2019

Uh oh!

gengliangwang commented Aug 21, 2019

Uh oh!

gatorsmile commented Aug 24, 2019

Uh oh!

gengliangwang commented Aug 26, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

gengliangwang commented Jul 24, 2019 •

edited

Loading

maropu Jul 24, 2019 •

edited

Loading