[SPARK-42855][SQL] Use runtime null checks in TableOutputResolver by aokolnychyi · Pull Request #40655 · apache/spark

aokolnychyi · 2023-04-04T03:23:51Z

What changes were proposed in this pull request?

This PR migrates TableOutputResolver to use runtime NOT NULL checks instead of checking type compatibility during the analysis phase.

Why are the changes needed?

These changes are needed per discussion that happened here.

Does this PR introduce any user-facing change?

Nullability exceptions will be thrown at runtime (instead of analysis) but there is no API change.

How was this patch tested?

This PR comes with tests.

aokolnychyi · 2023-04-04T03:24:39Z

        }
        (matchedCol.dataType, expectedCol.dataType) match {
          case (matchedType: StructType, expectedType: StructType) =>
-            checkNullability(matchedCol, expectedCol, conf, addError, newColPath)


Moved nullability checks inside resolveXXX methods to share them with the position-based path.

aokolnychyi · 2023-04-04T03:25:25Z

    }
  }

+  private def resolveColumnsByPosition(


Similar recursion to what we have in the name-based path but using positions.

aokolnychyi · 2023-04-04T03:25:59Z

+      val extraColsStr = inputCols.takeRight(inputCols.size - expectedCols.size)
+        .map(col => s"'${col.name}'")
+        .mkString(", ")
+      addError(s"Cannot write extra fields to struct '${colPath.quoted}': $extraColsStr")


Kept the same error messages we have in DataType$canWrite to reduce the number of test changes.

aokolnychyi · 2023-04-04T03:28:23Z

      !Cast.canUpCast(cast.child.dataType, cast.dataType)
  }

+  private def isCompatible(tableAttr: Attribute, queryExpr: NamedExpression): Boolean = {


Moved the existing logic into a separate method as I added a nested if/else block and it became hard to read.

aokolnychyi · 2023-04-04T03:29:13Z

    val xRequiredTable = TestRelation(StructType(Seq(
      StructField("x", FloatType, nullable = false),
-      StructField("y", DoubleType))).toAttributes)
+      StructField("y", FloatType))).toAttributes)


Had to change this to still have 2 error messages as this test verifies multiple errors are reported.

aokolnychyi · 2023-04-04T03:31:00Z

+import org.apache.spark.sql.test.{SharedSparkSession, SQLTestUtils}
+import org.apache.spark.sql.types.{ArrayType, IntegerType, MapType, StructType}
+
+class RuntimeNullChecksV2Writes extends QueryTest with SQLTestUtils with SharedSparkSession {


The suite above verifies asserts are added and covers all V2 write commands. This one focuses on actually failing queries and preserving null values during rewrites where needed.

aokolnychyi · 2023-04-04T03:33:37Z

cc @gengliangwang @cloud-fan @dongjoon-hyun @viirya @huaxingao @sunchao

aokolnychyi · 2023-04-04T03:37:43Z

-      addError(s"Cannot write nullable values to non-null column '${colPath.quoted}'")
+      colPath: Seq[String]): Expression = {
+    if (requiresNullChecks(input, expected, conf)) {
+      AssertNotNull(input, colPath)


I have two concerns about the current behavior of AssertNotNull.

It throws a generic NPE, which I believe triggers task retries.

The way column path is formatted using new lines is a bit hard to read. I would consider switching to what we have here (e.g. 'col1.nested_col1.deeply_nested_col1').

Any thoughts?

It throws a generic NPE, which I believe triggers task retries.

It seems to be an existing problem, for example ANSI mode. @gengliangwang shall we update the task retry logic to not retry if the exception has an error class which means a user error?

The way column path is formatted using new lines is a bit hard to read.

We probably need to do both. The error reporting is also for dataset operation and the new line is better to display the object path.

+1 for updating the task retry logic to avoid unnecessary retries.

+1 for displaying both formats but it is minor. We can keep it as is too.

I created SPARK-43033 and SPARK-43034. I consider those as improvements and they shouldn't be blockers.

@cloud-fan good point!
@aokolnychyi yeah we can improve it later. The failed insertion won't create partial records in the target directory anyway.

dongjoon-hyun

Could you check the UT failures, @aokolnychyi ? This looks relevant.

[info] - columnresolution.sql_analyzer_test *** FAILED *** (834 milliseconds)
[info]   columnresolution.sql_analyzer_test
[info]   Expected "...#x as int) AS i1#x, [cast(col2#x as struct<i1:int,i2:int>]) AS t5#x]
[info]      +- Loc...", but got "...#x as int) AS i1#x, [named_struct(i1, cast(col2#x.col1 as int), i2, cast(col2#x.col2 as int)]) AS t5#x]
[info]      +- Loc..." Result did not match for query #41
[info]   INSERT INTO t5 VALUES(1, (2, 3)) (SQLQueryTestSuite.scala:777)

aokolnychyi · 2023-04-04T16:47:20Z

@dongjoon-hyun, let me look into test failures.

aokolnychyi · 2023-04-04T22:22:23Z

Ok, all tests have been adapted. This PR is ready for a detailed review.

gengliangwang · 2023-04-04T23:35:51Z

@aokolnychyi @cloud-fan I am +0 for changing the behavior since I haven't heard complaints about this from end-users. Instead, relaxing the strict compiler check can bring complaints.

Do we consider other alternatives? For example, we can have a new function as_not_null appending on the input value/columns to bypass the static checks? E.g.

insert into target select as_not_null(null_column_name) from source

aokolnychyi · 2023-04-05T00:42:53Z

@gengliangwang, this PR is based on the consensus we reached in this thread. Each approach has its own pros/cons. The primary problem is that our behavior is not consistent (e.g. inserts and updates behave differently). In that thread, it seemed the best way forward is to use runtime checks everywhere. If I were to pick one approach, I think runtime checks are a bit better as they only fail if we really have null values. Otherwise, we rely on null propagation, which may not be that reliable. The primary motivation for this PR is to have consistent behavior rather than replace static checks with runtime checks as those are better.

Let me know what you think!

gengliangwang · 2023-04-05T02:48:07Z

@aokolnychyi Yes I got it. My concern was around the behavior change. I am OK with the idea and merging this one.

aokolnychyi · 2023-04-05T05:20:28Z

@gengliangwang, got it. I was initially concerned as well but I believe this is the right thing to do after we discussed it. Thanks for taking a look!

cloud-fan · 2023-04-05T08:17:13Z

thanks, merging to master

This PR migrates `TableOutputResolver` to use runtime NOT NULL checks instead of checking type compatibility during the analysis phase. These changes are needed per discussion that happened [here](apache#40308 (comment)). Nullability exceptions will be thrown at runtime (instead of analysis) but there is no API change. This PR comes with tests. Closes apache#40655 from aokolnychyi/spark-42855-v2. Authored-by: aokolnychyi <aokolnychyi@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 4ad55b6)

[SPARK-42855][SQL] Use runtime null checks in TableOutputResolver

7d9ca9c

github-actions Bot added the SQL label Apr 4, 2023

aokolnychyi commented Apr 4, 2023

View reviewed changes

aokolnychyi mentioned this pull request Apr 4, 2023

[SPARK-42151][SQL] Align UPDATE assignments with table attributes #40308

Closed

dongjoon-hyun reviewed Apr 4, 2023

View reviewed changes

Adapt tests to new logic

622ee8a

cloud-fan approved these changes Apr 5, 2023

View reviewed changes

cloud-fan closed this in 4ad55b6 Apr 5, 2023

This was referenced Apr 7, 2023

[SPARK-43033][SQL] Avoid task retries due to AssertNotNull checks #40703

Closed

[SPARK-43033][SQL] Avoid task retries due to AssertNotNull checks #40707

Closed

dongjoon-hyun mentioned this pull request Aug 22, 2024

[SPARK-49352][SQL] Avoid redundant array transform for identical expression #47843

Closed

dongjoon-hyun mentioned this pull request Feb 27, 2026

[SPARK-55716][SQL] Support NOT NULL constraint enforcement for V1 file source table inserts #54517

Closed

Uh oh!

Conversation

aokolnychyi commented Apr 4, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Apr 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi commented Apr 4, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

aokolnychyi commented Apr 4, 2023

Uh oh!

aokolnychyi commented Apr 4, 2023

Uh oh!

gengliangwang commented Apr 4, 2023

Uh oh!

aokolnychyi commented Apr 5, 2023

Uh oh!

gengliangwang commented Apr 5, 2023

Uh oh!

aokolnychyi commented Apr 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Apr 5, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

aokolnychyi Apr 4, 2023 •

edited

Loading

aokolnychyi commented Apr 5, 2023 •

edited

Loading