Skip to content

[WIP][SPARK-28495][SQL] Table insertion: follow store assignment rules of ANSI SQL#25239

Closed
gengliangwang wants to merge 3 commits into
apache:masterfrom
gengliangwang:assignCast
Closed

[WIP][SPARK-28495][SQL] Table insertion: follow store assignment rules of ANSI SQL#25239
gengliangwang wants to merge 3 commits into
apache:masterfrom
gengliangwang:assignCast

Conversation

@gengliangwang

@gengliangwang gengliangwang commented Jul 24, 2019

Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

In Spark version 2.4 and earlier, when inserting into a table, Spark will cast the data type of input query to the data type of target table by coercion. This can be super confusing, e.g. users make a mistake and write string values to an int column.

In data source V2, by default, only upcasting is allowed when inserting data into a table. E.g. int -> long and int -> string are allowed, while decimal -> double or long -> int are not allowed. The rules of UpCast was originally created for Dataset type coercion. They are quite strict and different from the behavior of all existing popular DBMS. Making it the default behavior of the table insertion is breaking change. It is possible that it would hurt some Spark users after 3.0 releases.

This PR proposes that we can follow the rules of store assignment(section 9.2) in ANSI SQL. Two significant differences from Up-Cast:

  1. Any numeric type can be assigned to another numeric type.
  2. TimestampType can be assigned DateType

The new behavior is consistent with PostgreSQL. It is more explainable and acceptable than using UpCast .
The change will be applied in Data Source V2 first. If it is merged, we can apply it into data source V1.

How was this patch tested?

Unit test

@gengliangwang

Copy link
Copy Markdown
Member Author

We can find a copy of ANSI SQL 2009 in http://jtc1sc32.org/doc/N1801-1850/32N1822T-text_for_ballot-CD_9075-2.pdf

@gengliangwang

Copy link
Copy Markdown
Member Author

* Cast the child expression to the target data type, but will throw error if the cast violates
* the store assignment rules of ANSI SQL, e.g. string -> int, array -> string.
*/
case class AssignableCast(child: Expression, dataType: DataType, walkedTypePath: Seq[String] = Nil)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we apply it for the v2 table insertion first? We probably need another PR to apply it for v1 table insertion, as it's a behavior change and needs migration guide.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW I'm not sure if we need walkedTypePath. Upcast needs it because it needs to record the field path in a class. I think for AssignableCast we only need to record the column name.

@@ -2454,7 +2455,7 @@ class Analyzer(
} else {
// always add an UpCast. it will be removed in the optimizer if it is unnecessary.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment needs update.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we might need to update the comment in the header;

   * - Insert safe casts when data types do not match

https://github.com/apache/spark/pull/25239/files#diff-57b3d87be744b7d79a9beacf8e5e5eb2R2353

if !Cast.canAssign(child.dataType, dataType) =>
fail(child, dataType, walkedTypePath)

case AssignableCast(child, dataType, _) => Cast(child, dataType.asNullable)

@maropu maropu Jul 24, 2019

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need rounding/truncating/out-of-range checks here for some cases, e.g., int->short, double->float?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per the ANSI SQL, rounding/truncating is allowed.
So here we still convert the assignable writes to cast, the result is null if the conversion is out-of-range. Later on, we can add a configuration mode for throwing exceptions on out-of-range conversion.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current pr, is the result null if out-of-range cases?
For example, in case of int->short casts, it seems Cast just returns a weired value for a out-of-range value?;

b => x.numeric.asInstanceOf[Numeric[Any]].toInt(b).toShort

scala> sql("create table t (s short) using parquet")
scala> sql("insert into t values (int(32768))")
// InsertIntoTable Relation[s#12] parquet, false, false
// +- Project [cast(col1#31 as smallint) AS s#32]
//    +- LocalRelation [col1#31]
scala> sql("select * from t").show
+------+
|     s|
+------+
|-32768|
+------+

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu nice catch. It seems that it is an existing issue in Cast. We can fix it first.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Just a note; I'm a bit worried about additional overheads to check valid value ranges inside Cast since it is already used in many places. As another option, I think we can wrap Cast with a If expression to check value ranges, e.g., IF('value range check', CAST, Literal(null)).

@SparkQA

SparkQA commented Jul 24, 2019

Copy link
Copy Markdown

Test build #108081 has finished for PR 25239 at commit b7fdc93.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class AssignableCast(child: Expression, dataType: DataType, walkedTypePath: Seq[String] = Nil)

@gengliangwang gengliangwang changed the title [SPARK-28495][SQL] AssignableCast: A new type coercion following store assignment rules of ANSI SQL [SPARK-28495][SQL] Table insertion: follow store assignment rules of ANSI SQL Jul 24, 2019
@gengliangwang gengliangwang changed the title [SPARK-28495][SQL] Table insertion: follow store assignment rules of ANSI SQL [WIP][SPARK-28495][SQL] Table insertion: follow store assignment rules of ANSI SQL Jul 24, 2019
@SparkQA

SparkQA commented Jul 24, 2019

Copy link
Copy Markdown

Test build #108100 has finished for PR 25239 at commit 84c9200.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Jul 24, 2019

Copy link
Copy Markdown

Test build #108101 has finished for PR 25239 at commit 1abcc9d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue

rdblue commented Jul 24, 2019

Copy link
Copy Markdown
Contributor

Later on, we can add a configuration mode for throwing exceptions on out-of-range conversion.

Is this later on in this PR, or in a follow-up change? If it is in a follow-up, what is the JIRA issue? I think that should be a blocker for 3.0 if it isn't included in this PR.

@rdblue

rdblue commented Jul 24, 2019

Copy link
Copy Markdown
Contributor

@gengliangwang, can you bring this up in a DISCUSS thread on the dev list? I think the decision about this behavior should include more people and requires a vote.

@gengliangwang

Copy link
Copy Markdown
Member Author

Is this later on in this PR, or in a follow-up change? If it is in a follow-up, what is the JIRA issue? I think that should be a blocker for 3.0 if it isn't included in this PR.

I meant it will be in a follow-up. I can create one.

@gengliangwang, can you bring this up in a DISCUSS thread on the dev list? I think the decision about this behavior should include more people and requires a vote.

Good suggestion. I will do it :)

@gengliangwang

Copy link
Copy Markdown
Member Author

@maropu @rdblue I have created a JIRA for the new optional behavior that throws runtime exceptions on casting failures: https://issues.apache.org/jira/browse/SPARK-28512

@rdblue

rdblue commented Aug 20, 2019

Copy link
Copy Markdown
Contributor

@gengliangwang, it looks like this intended to be added after #25453, is that correct?

@cloud-fan

Copy link
Copy Markdown
Contributor

Yes, it's blocked by #25453

@gengliangwang

Copy link
Copy Markdown
Member Author

Yes, this will be added after #25453 is merged.

@gatorsmile

Copy link
Copy Markdown
Member

#25453 has been merged. We can start working on this.

@gengliangwang

Copy link
Copy Markdown
Member Author

Close this one and open #25581

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants