[SPARK-42151][SQL] Align UPDATE assignments with table attributes by aokolnychyi · Pull Request #40308 · apache/spark

aokolnychyi · 2023-03-06T21:51:16Z

What changes were proposed in this pull request?

This PR adds a rule to align UPDATE assignments with table attributes.

Why are the changes needed?

These changes are needed so that we can rewrite UPDATE statements into executable plans for tables that support row-level operations. In particular, our row-level mutation framework assumes Spark is responsible for building an updated version of each affected row and that row is passed back to the data source.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

This PR comes with tests.

aokolnychyi · 2023-03-06T21:59:17Z

We need to think about a reliable way to check if default values have been resolved. Right now, it simply relies on the order of rules, which is fragile. Ideas are welcome.

aokolnychyi · 2023-03-06T21:59:55Z

I follow what we do for V2 inserts.

aokolnychyi · 2023-03-07T04:51:28Z

The logic in this method tries to follow by name resolution we have in V2 tables.

aokolnychyi · 2023-03-07T05:01:53Z

cc @huaxingao @cloud-fan @dongjoon-hyun @sunchao @viirya @gengliangwang

aokolnychyi · 2023-03-07T05:06:51Z

I am using DATATYPE_MISMATCH as it seems appropriate.

aokolnychyi · 2023-03-08T03:44:57Z

@cloud-fan, this doc gives a bit more details about why this PR is a prerequisite for rewriting UPDATEs. Let me know if this makes sense!

aokolnychyi · 2023-03-08T03:48:19Z

We can apply this rule only if table implements SupportsRowLevelOperations if that feels safer?

What happens if the table doesn't implement SupportsRowLevelOperations?

I plan to ignore such statements when rewriting UPDATEs into executable plans, like we do today for DELETE. This would allow data sources to inject their own handling.

cloud-fan · 2023-03-10T14:45:02Z

hmm, do we expect a data source that can directly update an inner field? For such data sources, this is a regression.

Thinking about this more, I think this is required by the row level operation framework so we have no choice. Data sources can skip it (skipSchemaResolution return true) and use a more advanced implementation if they can.

Correct, the existing row-level APIs assume Spark is responsible for building an updated version of the row. That should work for Delta, Iceberg, Hudi, Hive ACID.

Once there is another use case, we should be able to extend the framework to cover it.

Let me also know if you think we should only apply this to implementations of SupportsRowLevelOperations.
Here is the original question.

cloud-fan · 2023-03-10T14:56:06Z

how about GetArrayStructField?

For ALTER COLUMN we support a special syntax to reference any inner field, for example, array_col.element.field1, map_col.key.field2, etc. Shall we support this syntax in UPDATE as well? The related code is in StructType.findNestedField

We should eventually. This PR doesn't support updating arrays or maps, though. I wanted to work on it later and unblock further row-level operation development for now. For now, I throw an exception and support only nested fields in structs.

Actually, I can add support for those expressions here but fail temporary in the rewrite logic.

viirya · 2023-03-13T00:30:42Z

Maybe add a comment that this rule cannot be changed in order for now.

Added a comment above.

johanl-db · 2023-03-13T15:27:10Z

There should be a test case for UPDATE nested_struct_table SET s.n_i = 1" that ensures the struct s.n_s is preserved as a whole instead of recursing and generating assignments for each of its children.

This is important if s.n_s contain null values: the assignments must be (s.n_i = 1, s.n_s = s.n_s), not (s.n_i = 1, s.n_s.dn_i = s.n_s.dn_i, s.n_s.dn_l = s.n_s.dn_l) so that s.n_s is still null after the update.

Will make sure there is test case for this.

### What changes were proposed in this pull request? This PR migrates `TableOutputResolver` to use runtime NOT NULL checks instead of checking type compatibility during the analysis phase. ### Why are the changes needed? These changes are needed per discussion that happened [here](#40308 (comment)). ### Does this PR introduce _any_ user-facing change? Nullability exceptions will be thrown at runtime (instead of analysis) but there is no API change. ### How was this patch tested? This PR comes with tests. Closes #40655 from aokolnychyi/spark-42855-v2. Authored-by: aokolnychyi <aokolnychyi@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

aokolnychyi · 2023-04-07T00:06:48Z

I have removed this resolution but the logic is same: runtime null checks, varchar/char length checks, etc.

aokolnychyi · 2023-04-07T00:18:04Z

@johanl-db, here is the test we talked about. If you have time to contribute any other tests or to check the alignment logic works for Delta, it would be great!

aokolnychyi · 2023-04-07T16:29:16Z

Failures don't seem to be related.

dongjoon-hyun

Thank you for updating PR, @aokolnychyi .

cloud-fan · 2023-04-10T15:53:42Z

A few ideas to make the code more robust:

I think it's better to operate on the resolved column expressions, instead of turning back the expression to Seq[String]

Given the parser rule for the UPDATE command, the column expression can only be AttributeReference or accessing (array of) struct's fields. We can group by expr.references.head to get a map from AttributeReference to Seq[Expression] and the corresponding update expressions.

We validate the map we got in step 2: for each top-level column, its expressions must be of the same tree height (to avoid updating both 'a.b' and 'a.b.c'), and must be different from each other.

Now it's easy to build the new update expressions: for each top-level column, if it doesn't have a match in the map, use the actual column value as the update expression, else ... (same algorithm below)

it's better to operate on the resolved column expressions

I agree, let's see if we can avoid the conversion to references.

We validate the map we got in step 2: for each top-level column, its expressions must be of the same tree height (to avoid updating both 'a.b' and 'a.b.c')

Could you elaborate a bit on how you see the tree height check? Like add a separate method for computing expression height? What about cases when it is OK to have different expression heights like 'a.b.n1' and 'a.c' where a, b, c are all structs?

ah you are right, we can't simply check the tree height. I think a better way is to use a ExpressionSet to make sure these column expressions have no duplication.

Using ExpressionSet to detect duplicate assignments to a.b.c and a.b.c would be easy. What about cases like a.b and a.b.c where we assign a value to a struct and its field at the same time? Are you thinking of recursively adding all subparts of each column key to ExpressionSet? For instance, we would need to add a.b, a.b.c, a.b.c.d to ExpressionSet for a.b.c.d?

We could probably build ExpressionSet for each update key per top-level attribute and check the intersection across all ExpressionSet is empty. Let me know if that's similar to what you thought.

yea this SGTM.

@cloud-fan, any ideas on how to avoid deconstructing Seq[String] when applying a set of assignments to a top-level attribute? The problem is that we recurse top to bottom in applyUpdates whereas assignment.key is a set of nested GetStructField calls with the outer expression referring the leaf column.

I can see ways to perform the validation without converting keys to Seq[String] but I don't see an easy way to avoid that in applyUpdates.

I'm thinking about this

def alignAssignments( assignments: Seq[Assignment], attrs: Seq[Attribute]): Seq[Assignment] = { // use ExpressionSet to check assignments have no duplication ... attrs.map { attr => Assignment(attr, applyUpdates(assignments, attr)) } } def applyUpdates( assignments: Seq[Assignment], col: Expression): Expression = { val (exactAssigments, others) = assignments.partition { assignment => assigment.key.semanticEquals(col) } val relatedAssignments = others.filter { assignment => assigment.key.exists(_.semanticEquals(col)) } assert(exactAssigments.length <= 1) if (exactAssigments.nonEmpty) { if (relatedAssignments.nonEmpty) fail... exactAssigments.head.value } else { if (relatedAssignments.isEmpty) { col } else { assert(col.dataType.isInstanceOf[StructType]) CreatedStruct(col.dataType.asInstanceOf[StructType].fields.flatMap { field => Literal(field.name) :: applyUpdates(relatedAssignments, GetStruct(col, field.name)) :: Nil }) } } }

Perfect, I forgot about exists. Thanks!

aokolnychyi · 2023-04-15T00:50:37Z

+      assignment.key.exists(_.semanticEquals(colExpr))
+    }
+
+    if (exactAssignments.size > 1) {


@cloud-fan, I've changed the approach to avoid deconstructing references. However, I decided to keep the validation while recursing vs doing this in a separate step as we discussed. When I tried to implement that idea, it turned out to be pretty involved with lots of edge cases. For instance, we can't have multiple assignments per top-level key but keys can reference top-level fields many times, a.b.c and a.b.d are allowed but a.b and a.b.c are not.

It felt easier to validate while recursing, similar to TableOutputResolver.

Let me know what you think.

aokolnychyi · 2023-04-15T00:57:24Z

  override def dataType: DataType = throw new UnresolvedException("nullable")
  override def left: Expression = key
  override def right: Expression = value
+  override def sql: String = s"${key.sql} = ${value.sql}"


Added this for better error messages.

aokolnychyi · 2023-04-15T01:19:22Z

+  private def requiresAlignment(table: LogicalPlan): Boolean = {
+    EliminateSubqueryAliases(table) match {
+      case r: NamedRelation if r.skipSchemaResolution => false
+      case DataSourceV2Relation(_: SupportsRowLevelOperations, _, _, _, _) => true


@viirya, I decided not to align assignments if tables don't extend SupportsRowLevelOperations. That way, data sources using their own implementations won't be affected. They can still use AssignmentUtils.

aokolnychyi · 2023-04-17T17:21:43Z

+    }
+  }
+
+  private def resolveAssignments(p: LogicalPlan): LogicalPlan = {


Copied from ResolveOutputRelation to preserve the existing behavior for data sources that rely on custom implementations.

aokolnychyi · 2023-04-17T23:54:05Z

Failures in streaming tests don't seem related.

cloud-fan · 2023-04-18T07:59:06Z

thanks, merging to master!

aokolnychyi · 2023-04-18T14:33:24Z

Thanks for reviewing, @cloud-fan @huaxingao @dongjoon-hyun @viirya @johanl-db!

This PR migrates `TableOutputResolver` to use runtime NOT NULL checks instead of checking type compatibility during the analysis phase. These changes are needed per discussion that happened [here](apache#40308 (comment)). Nullability exceptions will be thrown at runtime (instead of analysis) but there is no API change. This PR comes with tests. Closes apache#40655 from aokolnychyi/spark-42855-v2. Authored-by: aokolnychyi <aokolnychyi@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 4ad55b6)

### What changes were proposed in this pull request? This PR adds a rule to align UPDATE assignments with table attributes. ### Why are the changes needed? These changes are needed so that we can rewrite UPDATE statements into executable plans for tables that support row-level operations. In particular, our row-level mutation framework assumes Spark is responsible for building an updated version of each affected row and that row is passed back to the data source. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This PR comes with tests. Closes apache#40308 from aokolnychyi/spark-42151-v2. Authored-by: aokolnychyi <aokolnychyi@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 1c057f5)

github-actions Bot added CORE SQL labels Mar 6, 2023

aokolnychyi commented Mar 6, 2023

View reviewed changes

Comment thread sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala Outdated

aokolnychyi force-pushed the spark-42151-v2 branch from 278247f to 8a7d17b Compare March 7, 2023 04:49

aokolnychyi commented Mar 7, 2023

View reviewed changes

Comment thread sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AssignmentUtils.scala Outdated

aokolnychyi commented Mar 7, 2023

View reviewed changes

Comment thread sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableOutputResolver.scala Outdated

aokolnychyi force-pushed the spark-42151-v2 branch from 8a7d17b to 05f4e91 Compare March 7, 2023 04:57

aokolnychyi commented Mar 7, 2023

View reviewed changes

aokolnychyi commented Mar 8, 2023

View reviewed changes

Comment thread sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AssignmentUtils.scala Outdated

aokolnychyi commented Mar 8, 2023

View reviewed changes

cloud-fan reviewed Mar 10, 2023

View reviewed changes

viirya reviewed Mar 13, 2023

View reviewed changes

johanl-db reviewed Mar 13, 2023

View reviewed changes

aokolnychyi mentioned this pull request Apr 4, 2023

[SPARK-42855][SQL] Use runtime null checks in TableOutputResolver #40655

Closed

aokolnychyi force-pushed the spark-42151-v2 branch from 05f4e91 to e8ac4ed Compare April 6, 2023 21:58

aokolnychyi commented Apr 7, 2023

View reviewed changes

aokolnychyi force-pushed the spark-42151-v2 branch from e8ac4ed to 4b9417d Compare April 7, 2023 04:29

aokolnychyi force-pushed the spark-42151-v2 branch from 4b9417d to 0eb2e2f Compare April 7, 2023 19:51

dongjoon-hyun reviewed Apr 7, 2023

View reviewed changes

cloud-fan reviewed Apr 10, 2023

View reviewed changes

[SPARK-42151][SQL] Align UPDATE assignments with table attributes

5f89106

Change approach

2ebfcef

aokolnychyi force-pushed the spark-42151-v2 branch from 0eb2e2f to 2ebfcef Compare April 15, 2023 00:42

aokolnychyi commented Apr 15, 2023

View reviewed changes

aokolnychyi added 2 commits April 17, 2023 10:05

Minor updates

70de747

Remove not needed comment

162de02

aokolnychyi commented Apr 17, 2023

View reviewed changes

cloud-fan approved these changes Apr 18, 2023

View reviewed changes

cloud-fan closed this in 1c057f5 Apr 18, 2023

aokolnychyi mentioned this pull request Apr 24, 2023

[SPARK-43204][SQL] Align MERGE assignments with table attributes #40919

Closed

revans2 mentioned this pull request Jun 24, 2024

Figure out why MapFromArrays appears in the tests for hive parquet write NVIDIA/cudf-spark#10948

Closed

dongjoon-hyun mentioned this pull request Feb 27, 2026

[SPARK-55716][SQL] Support NOT NULL constraint enforcement for V1 file source table inserts #54517

Closed

Uh oh!

Conversation

aokolnychyi commented Mar 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

aokolnychyi Mar 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

aokolnychyi commented Mar 7, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Mar 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Mar 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi commented Apr 7, 2023

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Apr 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

aokolnychyi commented Mar 6, 2023 •

edited

Loading

aokolnychyi Mar 6, 2023 •

edited

Loading

aokolnychyi Mar 10, 2023 •

edited

Loading

aokolnychyi Mar 10, 2023 •

edited

Loading

aokolnychyi Apr 10, 2023 •

edited

Loading

aokolnychyi Apr 14, 2023 •

edited

Loading

cloud-fan Apr 14, 2023 •

edited

Loading

aokolnychyi Apr 15, 2023 •

edited

Loading

aokolnychyi Apr 17, 2023 •

edited

Loading

aokolnychyi commented Apr 17, 2023 •

edited

Loading