[WIP][SPARK-28495][SQL] Table insertion: follow store assignment rules of ANSI SQL#25239
[WIP][SPARK-28495][SQL] Table insertion: follow store assignment rules of ANSI SQL#25239gengliangwang wants to merge 3 commits into
Conversation
|
We can find a copy of ANSI SQL 2009 in http://jtc1sc32.org/doc/N1801-1850/32N1822T-text_for_ballot-CD_9075-2.pdf |
| * Cast the child expression to the target data type, but will throw error if the cast violates | ||
| * the store assignment rules of ANSI SQL, e.g. string -> int, array -> string. | ||
| */ | ||
| case class AssignableCast(child: Expression, dataType: DataType, walkedTypePath: Seq[String] = Nil) |
There was a problem hiding this comment.
can we apply it for the v2 table insertion first? We probably need another PR to apply it for v1 table insertion, as it's a behavior change and needs migration guide.
There was a problem hiding this comment.
BTW I'm not sure if we need walkedTypePath. Upcast needs it because it needs to record the field path in a class. I think for AssignableCast we only need to record the column name.
| @@ -2454,7 +2455,7 @@ class Analyzer( | |||
| } else { | |||
| // always add an UpCast. it will be removed in the optimizer if it is unnecessary. | |||
There was a problem hiding this comment.
Also, we might need to update the comment in the header;
* - Insert safe casts when data types do not match
https://github.com/apache/spark/pull/25239/files#diff-57b3d87be744b7d79a9beacf8e5e5eb2R2353
| if !Cast.canAssign(child.dataType, dataType) => | ||
| fail(child, dataType, walkedTypePath) | ||
|
|
||
| case AssignableCast(child, dataType, _) => Cast(child, dataType.asNullable) |
There was a problem hiding this comment.
We don't need rounding/truncating/out-of-range checks here for some cases, e.g., int->short, double->float?
There was a problem hiding this comment.
As per the ANSI SQL, rounding/truncating is allowed.
So here we still convert the assignable writes to cast, the result is null if the conversion is out-of-range. Later on, we can add a configuration mode for throwing exceptions on out-of-range conversion.
There was a problem hiding this comment.
In the current pr, is the result null if out-of-range cases?
For example, in case of int->short casts, it seems Cast just returns a weired value for a out-of-range value?;
scala> sql("create table t (s short) using parquet")
scala> sql("insert into t values (int(32768))")
// InsertIntoTable Relation[s#12] parquet, false, false
// +- Project [cast(col1#31 as smallint) AS s#32]
// +- LocalRelation [col1#31]
scala> sql("select * from t").show
+------+
| s|
+------+
|-32768|
+------+
There was a problem hiding this comment.
@maropu nice catch. It seems that it is an existing issue in Cast. We can fix it first.
There was a problem hiding this comment.
Created https://issues.apache.org/jira/browse/SPARK-28503 for this.
There was a problem hiding this comment.
Thanks! Just a note; I'm a bit worried about additional overheads to check valid value ranges inside Cast since it is already used in many places. As another option, I think we can wrap Cast with a If expression to check value ranges, e.g., IF('value range check', CAST, Literal(null)).
|
Test build #108081 has finished for PR 25239 at commit
|
|
Test build #108100 has finished for PR 25239 at commit
|
|
Test build #108101 has finished for PR 25239 at commit
|
Is this later on in this PR, or in a follow-up change? If it is in a follow-up, what is the JIRA issue? I think that should be a blocker for 3.0 if it isn't included in this PR. |
|
@gengliangwang, can you bring this up in a DISCUSS thread on the dev list? I think the decision about this behavior should include more people and requires a vote. |
I meant it will be in a follow-up. I can create one.
Good suggestion. I will do it :) |
|
@maropu @rdblue I have created a JIRA for the new optional behavior that throws runtime exceptions on casting failures: https://issues.apache.org/jira/browse/SPARK-28512 |
|
@gengliangwang, it looks like this intended to be added after #25453, is that correct? |
|
Yes, it's blocked by #25453 |
|
Yes, this will be added after #25453 is merged. |
|
#25453 has been merged. We can start working on this. |
|
Close this one and open #25581 |
What changes were proposed in this pull request?
In Spark version 2.4 and earlier, when inserting into a table, Spark will cast the data type of input query to the data type of target table by coercion. This can be super confusing, e.g. users make a mistake and write string values to an int column.
In data source V2, by default, only upcasting is allowed when inserting data into a table. E.g. int -> long and int -> string are allowed, while decimal -> double or long -> int are not allowed. The rules of UpCast was originally created for Dataset type coercion. They are quite strict and different from the behavior of all existing popular DBMS. Making it the default behavior of the table insertion is breaking change. It is possible that it would hurt some Spark users after 3.0 releases.
This PR proposes that we can follow the rules of store assignment(section 9.2) in ANSI SQL. Two significant differences from Up-Cast:
The new behavior is consistent with PostgreSQL. It is more explainable and acceptable than using UpCast .
The change will be applied in Data Source V2 first. If it is merged, we can apply it into data source V1.
How was this patch tested?
Unit test