[SPARK-42855][SQL] Use runtime null checks in TableOutputResolver#40655
[SPARK-42855][SQL] Use runtime null checks in TableOutputResolver#40655aokolnychyi wants to merge 2 commits into
Conversation
| } | ||
| (matchedCol.dataType, expectedCol.dataType) match { | ||
| case (matchedType: StructType, expectedType: StructType) => | ||
| checkNullability(matchedCol, expectedCol, conf, addError, newColPath) |
There was a problem hiding this comment.
Moved nullability checks inside resolveXXX methods to share them with the position-based path.
| } | ||
| } | ||
|
|
||
| private def resolveColumnsByPosition( |
There was a problem hiding this comment.
Similar recursion to what we have in the name-based path but using positions.
| val extraColsStr = inputCols.takeRight(inputCols.size - expectedCols.size) | ||
| .map(col => s"'${col.name}'") | ||
| .mkString(", ") | ||
| addError(s"Cannot write extra fields to struct '${colPath.quoted}': $extraColsStr") |
There was a problem hiding this comment.
Kept the same error messages we have in DataType$canWrite to reduce the number of test changes.
| !Cast.canUpCast(cast.child.dataType, cast.dataType) | ||
| } | ||
|
|
||
| private def isCompatible(tableAttr: Attribute, queryExpr: NamedExpression): Boolean = { |
There was a problem hiding this comment.
Moved the existing logic into a separate method as I added a nested if/else block and it became hard to read.
| val xRequiredTable = TestRelation(StructType(Seq( | ||
| StructField("x", FloatType, nullable = false), | ||
| StructField("y", DoubleType))).toAttributes) | ||
| StructField("y", FloatType))).toAttributes) |
There was a problem hiding this comment.
Had to change this to still have 2 error messages as this test verifies multiple errors are reported.
| import org.apache.spark.sql.test.{SharedSparkSession, SQLTestUtils} | ||
| import org.apache.spark.sql.types.{ArrayType, IntegerType, MapType, StructType} | ||
|
|
||
| class RuntimeNullChecksV2Writes extends QueryTest with SQLTestUtils with SharedSparkSession { |
There was a problem hiding this comment.
The suite above verifies asserts are added and covers all V2 write commands. This one focuses on actually failing queries and preserving null values during rewrites where needed.
| addError(s"Cannot write nullable values to non-null column '${colPath.quoted}'") | ||
| colPath: Seq[String]): Expression = { | ||
| if (requiresNullChecks(input, expected, conf)) { | ||
| AssertNotNull(input, colPath) |
There was a problem hiding this comment.
I have two concerns about the current behavior of AssertNotNull.
- It throws a generic NPE, which I believe triggers task retries.
- The way column path is formatted using new lines is a bit hard to read. I would consider switching to what we have here (e.g.
'col1.nested_col1.deeply_nested_col1').
Any thoughts?
There was a problem hiding this comment.
It throws a generic NPE, which I believe triggers task retries.
It seems to be an existing problem, for example ANSI mode. @gengliangwang shall we update the task retry logic to not retry if the exception has an error class which means a user error?
The way column path is formatted using new lines is a bit hard to read.
We probably need to do both. The error reporting is also for dataset operation and the new line is better to display the object path.
There was a problem hiding this comment.
+1 for updating the task retry logic to avoid unnecessary retries.
+1 for displaying both formats but it is minor. We can keep it as is too.
There was a problem hiding this comment.
I created SPARK-43033 and SPARK-43034. I consider those as improvements and they shouldn't be blockers.
There was a problem hiding this comment.
@cloud-fan good point!
@aokolnychyi yeah we can improve it later. The failed insertion won't create partial records in the target directory anyway.
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Could you check the UT failures, @aokolnychyi ? This looks relevant.
[info] - columnresolution.sql_analyzer_test *** FAILED *** (834 milliseconds)
[info] columnresolution.sql_analyzer_test
[info] Expected "...#x as int) AS i1#x, [cast(col2#x as struct<i1:int,i2:int>]) AS t5#x]
[info] +- Loc...", but got "...#x as int) AS i1#x, [named_struct(i1, cast(col2#x.col1 as int), i2, cast(col2#x.col2 as int)]) AS t5#x]
[info] +- Loc..." Result did not match for query #41
[info] INSERT INTO t5 VALUES(1, (2, 3)) (SQLQueryTestSuite.scala:777)
|
@dongjoon-hyun, let me look into test failures. |
|
Ok, all tests have been adapted. This PR is ready for a detailed review. |
|
@aokolnychyi @cloud-fan I am +0 for changing the behavior since I haven't heard complaints about this from end-users. Instead, relaxing the strict compiler check can bring complaints. Do we consider other alternatives? For example, we can have a new function |
|
@gengliangwang, this PR is based on the consensus we reached in this thread. Each approach has its own pros/cons. The primary problem is that our behavior is not consistent (e.g. inserts and updates behave differently). In that thread, it seemed the best way forward is to use runtime checks everywhere. If I were to pick one approach, I think runtime checks are a bit better as they only fail if we really have null values. Otherwise, we rely on null propagation, which may not be that reliable. The primary motivation for this PR is to have consistent behavior rather than replace static checks with runtime checks as those are better. Let me know what you think! |
|
@aokolnychyi Yes I got it. My concern was around the behavior change. I am OK with the idea and merging this one. |
|
@gengliangwang, got it. I was initially concerned as well but I believe this is the right thing to do after we discussed it. Thanks for taking a look! |
|
thanks, merging to master |
This PR migrates `TableOutputResolver` to use runtime NOT NULL checks instead of checking type compatibility during the analysis phase. These changes are needed per discussion that happened [here](apache#40308 (comment)). Nullability exceptions will be thrown at runtime (instead of analysis) but there is no API change. This PR comes with tests. Closes apache#40655 from aokolnychyi/spark-42855-v2. Authored-by: aokolnychyi <aokolnychyi@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 4ad55b6)
What changes were proposed in this pull request?
This PR migrates
TableOutputResolverto use runtime NOT NULL checks instead of checking type compatibility during the analysis phase.Why are the changes needed?
These changes are needed per discussion that happened here.
Does this PR introduce any user-facing change?
Nullability exceptions will be thrown at runtime (instead of analysis) but there is no API change.
How was this patch tested?
This PR comes with tests.