Skip to content

[SPARK-42855][SQL] Use runtime null checks in TableOutputResolver#40655

Closed
aokolnychyi wants to merge 2 commits into
apache:masterfrom
aokolnychyi:spark-42855-v2
Closed

[SPARK-42855][SQL] Use runtime null checks in TableOutputResolver#40655
aokolnychyi wants to merge 2 commits into
apache:masterfrom
aokolnychyi:spark-42855-v2

Conversation

@aokolnychyi

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

This PR migrates TableOutputResolver to use runtime NOT NULL checks instead of checking type compatibility during the analysis phase.

Why are the changes needed?

These changes are needed per discussion that happened here.

Does this PR introduce any user-facing change?

Nullability exceptions will be thrown at runtime (instead of analysis) but there is no API change.

How was this patch tested?

This PR comes with tests.

@github-actions github-actions Bot added the SQL label Apr 4, 2023
}
(matchedCol.dataType, expectedCol.dataType) match {
case (matchedType: StructType, expectedType: StructType) =>
checkNullability(matchedCol, expectedCol, conf, addError, newColPath)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved nullability checks inside resolveXXX methods to share them with the position-based path.

}
}

private def resolveColumnsByPosition(

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar recursion to what we have in the name-based path but using positions.

val extraColsStr = inputCols.takeRight(inputCols.size - expectedCols.size)
.map(col => s"'${col.name}'")
.mkString(", ")
addError(s"Cannot write extra fields to struct '${colPath.quoted}': $extraColsStr")

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kept the same error messages we have in DataType$canWrite to reduce the number of test changes.

!Cast.canUpCast(cast.child.dataType, cast.dataType)
}

private def isCompatible(tableAttr: Attribute, queryExpr: NamedExpression): Boolean = {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the existing logic into a separate method as I added a nested if/else block and it became hard to read.

val xRequiredTable = TestRelation(StructType(Seq(
StructField("x", FloatType, nullable = false),
StructField("y", DoubleType))).toAttributes)
StructField("y", FloatType))).toAttributes)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to change this to still have 2 error messages as this test verifies multiple errors are reported.

import org.apache.spark.sql.test.{SharedSparkSession, SQLTestUtils}
import org.apache.spark.sql.types.{ArrayType, IntegerType, MapType, StructType}

class RuntimeNullChecksV2Writes extends QueryTest with SQLTestUtils with SharedSparkSession {

@aokolnychyi aokolnychyi Apr 4, 2023

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The suite above verifies asserts are added and covers all V2 write commands. This one focuses on actually failing queries and preserving null values during rewrites where needed.

@aokolnychyi

Copy link
Copy Markdown
Contributor Author

addError(s"Cannot write nullable values to non-null column '${colPath.quoted}'")
colPath: Seq[String]): Expression = {
if (requiresNullChecks(input, expected, conf)) {
AssertNotNull(input, colPath)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have two concerns about the current behavior of AssertNotNull.

  • It throws a generic NPE, which I believe triggers task retries.
  • The way column path is formatted using new lines is a bit hard to read. I would consider switching to what we have here (e.g. 'col1.nested_col1.deeply_nested_col1').

Any thoughts?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It throws a generic NPE, which I believe triggers task retries.

It seems to be an existing problem, for example ANSI mode. @gengliangwang shall we update the task retry logic to not retry if the exception has an error class which means a user error?

The way column path is formatted using new lines is a bit hard to read.

We probably need to do both. The error reporting is also for dataset operation and the new line is better to display the object path.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for updating the task retry logic to avoid unnecessary retries.

+1 for displaying both formats but it is minor. We can keep it as is too.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, too.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created SPARK-43033 and SPARK-43034. I consider those as improvements and they shouldn't be blockers.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan good point!
@aokolnychyi yeah we can improve it later. The failed insertion won't create partial records in the target directory anyway.

@dongjoon-hyun dongjoon-hyun left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you check the UT failures, @aokolnychyi ? This looks relevant.

[info] - columnresolution.sql_analyzer_test *** FAILED *** (834 milliseconds)
[info]   columnresolution.sql_analyzer_test
[info]   Expected "...#x as int) AS i1#x, [cast(col2#x as struct<i1:int,i2:int>]) AS t5#x]
[info]      +- Loc...", but got "...#x as int) AS i1#x, [named_struct(i1, cast(col2#x.col1 as int), i2, cast(col2#x.col2 as int)]) AS t5#x]
[info]      +- Loc..." Result did not match for query #41
[info]   INSERT INTO t5 VALUES(1, (2, 3)) (SQLQueryTestSuite.scala:777)

@aokolnychyi

Copy link
Copy Markdown
Contributor Author

@dongjoon-hyun, let me look into test failures.

@aokolnychyi

Copy link
Copy Markdown
Contributor Author

Ok, all tests have been adapted. This PR is ready for a detailed review.

@gengliangwang

Copy link
Copy Markdown
Member

@aokolnychyi @cloud-fan I am +0 for changing the behavior since I haven't heard complaints about this from end-users. Instead, relaxing the strict compiler check can bring complaints.

Do we consider other alternatives? For example, we can have a new function as_not_null appending on the input value/columns to bypass the static checks? E.g.

insert into target select as_not_null(null_column_name) from source

@aokolnychyi

Copy link
Copy Markdown
Contributor Author

@gengliangwang, this PR is based on the consensus we reached in this thread. Each approach has its own pros/cons. The primary problem is that our behavior is not consistent (e.g. inserts and updates behave differently). In that thread, it seemed the best way forward is to use runtime checks everywhere. If I were to pick one approach, I think runtime checks are a bit better as they only fail if we really have null values. Otherwise, we rely on null propagation, which may not be that reliable. The primary motivation for this PR is to have consistent behavior rather than replace static checks with runtime checks as those are better.

Let me know what you think!

@gengliangwang

Copy link
Copy Markdown
Member

@aokolnychyi Yes I got it. My concern was around the behavior change. I am OK with the idea and merging this one.

@aokolnychyi

aokolnychyi commented Apr 5, 2023

Copy link
Copy Markdown
Contributor Author

@gengliangwang, got it. I was initially concerned as well but I believe this is the right thing to do after we discussed it. Thanks for taking a look!

@cloud-fan

Copy link
Copy Markdown
Contributor

thanks, merging to master

@cloud-fan cloud-fan closed this in 4ad55b6 Apr 5, 2023
viirya pushed a commit to viirya/spark-1 that referenced this pull request Oct 19, 2023
This PR migrates `TableOutputResolver` to use runtime NOT NULL checks instead of checking type compatibility during the analysis phase.

These changes are needed per discussion that happened [here](apache#40308 (comment)).

Nullability exceptions will be thrown at runtime (instead of analysis) but there is no API change.

This PR comes with tests.

Closes apache#40655 from aokolnychyi/spark-42855-v2.

Authored-by: aokolnychyi <aokolnychyi@apple.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 4ad55b6)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants