[SPARK-28892][SQL] support UPDATE in the parser and add the corresponding logical plan by xy-xin · Pull Request #25626 · apache/spark

xy-xin · 2019-08-30T02:27:51Z

What changes were proposed in this pull request?

This PR supports UPDATE in the parser and add the corresponding logical plan. The SQL syntax is a standard UPDATE statement:

UPDATE tableName tableAlias SET colName=value [, colName=value]+ WHERE predicate?

Why are the changes needed?

With this change, we can start to implement UPDATE in builtin sources and think about how to design the update API in DS v2.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New test cases added.

xy-xin · 2019-08-30T02:54:43Z

cc @cloud-fan @rdblue

cloud-fan · 2019-08-30T07:03:12Z

ok to test

cloud-fan · 2019-08-30T07:03:19Z

add to whitelist

cloud-fan · 2019-08-30T07:20:01Z

      DeleteFromTableExec(r.table.asDeletable, filters) :: Nil

+    case UpdateTable(r: DataSourceV2Relation, attrs, values, condition) =>
+      val attrsNames = attrs.map(_.name)


shall we fail if some attrs are resolved to a nested field?

Add check for nested fields and throw exception if it is nested.

BTW, I believe it would be helpful if we support nested fields update, but it may be difficult for supporting that, like, how can we pass the nested field to datasource.

SparkQA · 2019-08-30T10:35:48Z

Test build #109938 has finished for PR 25626 at commit a58a87b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile

The test coverage is not good enough. For example, all the test cases are just updating a single column? Try to check the test cases in the other open source databases?

xy-xin · 2019-09-02T12:06:49Z

The test coverage is not good enough. For example, all the test cases are just updating a single column? Try to check the test cases in the other open source databases?

Hi @gatorsmile , I added more test cases for update, including multi-fields updating, nested fields updating, etc. pls review.
Also checked the test cases in postgresql. There're many cases for other syntax like updating via sub-query, updating multi-fields with one assignment which is not supported current by spark sql.
More test cases will be add once we support more powerful update, like update a field with an expression, update multi-fields with select/multi-values in one assignment metioned above.

SparkQA · 2019-09-02T15:27:16Z

Test build #110012 has finished for PR 25626 at commit a76062c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-09-02T17:40:35Z

Test build #110016 has finished for PR 25626 at commit 5b738cf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-09-03T05:28:31Z

      DeleteFromTableExec(r.table.asDeletable, filters) :: Nil

+    case UpdateTable(r: DataSourceV2Relation, attrs, values, condition) =>
+      val nested = attrs.asInstanceOf[Seq[Any]].filterNot(_.isInstanceOf[AttributeReference])


why do we need the .asInstanceOf[Seq[Any]]?

If not, it will regard attrs to be of type Seq[Attribute], and filterNot will treat each element as Attribute which will throw a cast exception.
As you metioned in #25626 (comment), a nested filed will not be resolved to Attribute, but GetStructField or something other, so if we change the type of attrs to Seq[Expression], we can eliminate .asInstanceOf[Seq[Any]] here.

cloud-fan · 2019-09-03T05:33:30Z


+case class UpdateTable(
+    child: LogicalPlan,
+    attrs: Seq[Attribute],


can we really use Seq[Attribute]? When Spark resolves it to nested field, it will be Alias which is not an Attribute, and we will get weird errors.

Metioned in #25626 (comment), Seq[Attribute] involves the ugly .asInstanceOf[Seq[Any]] when checking the type of fields. But need we update it to Seq[Expression]? My feeling is the latter is too general to represent a field.

how about Seq[NamedExpression]?

Seq[NamedExpression] is ok for me, and I updated the code. But Seq[NamedExpression] is not a perfect solution, as some nested field does not have a name, which would be resolved to something like GetStructField, and it is a subclass of Expression.
Updating a nested field is complex. For a named field user can update it by specifying the name, but for an array, or a more complex json which composed by nested objects and arrays, it's hard to specify the field to be updated.
So currrently, my think is we can limit the updating to the scope of non-nested field, which can be resolved to a NamedExpression. What's your opnion? @cloud-fan

Then I think we have no choice but use Seq[Expression]. We should add comments to explain that attrs here holds the expressions that maybe attributes.

Or we can change the analyzer to only resolve column names in UpdateTable as attributes, which I don't think is worthwhile.

Updated it to Seq[Expression] and added some explaining doc, pls review.

SparkQA · 2019-09-03T11:11:08Z

Test build #110037 has finished for PR 25626 at commit 4d048b3.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2019-09-03T23:07:02Z

@xianyinxin, can you explain the required semantics for your proposed API?

void updateWhere(Map<String, Expression> sets, Filter[] filters);

It isn't clear what the requirement for a source would be.

In addition, Expression is internal to catalyst and should be removed from the API.

cloud-fan · 2019-09-04T00:49:40Z

@rdblue this PR uses the public expression org.apache.spark.sql.catalog.v2.expressions.Expression

xy-xin · 2019-09-04T06:21:41Z

@xianyinxin, can you explain the required semantics for your proposed API?
void updateWhere(Map<String, Expression> sets, Filter[] filters);
It isn't clear what the requirement for a source would be.

In addition, Expression is internal to catalyst and should be removed from the API.

Thank you @rdblue . Here Expression is the public datasource expression org.apache.spark.sql.catalog.v2.expressions.Expression. Datasource needs to know the value that the field be updated to, so here the key of sets specifies the field to be updated, and the value of sets is the updated value. The value can be an expression, not just a literal, IMHO, like the case `UPDATE tbl SET a=a+1 WHERE ...

SparkQA · 2019-09-04T07:12:06Z

Test build #110102 has finished for PR 25626 at commit 0fda5c8.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-09-04T07:16:55Z

      DeleteFromTableExec(r.table.asDeletable, filters) :: Nil

+    case UpdateTable(r: DataSourceV2Relation, attrs, values, condition) =>
+      val nested = attrs.asInstanceOf[Seq[Any]].filterNot(_.isInstanceOf[AttributeReference])


Can we remove .asInstanceOf[Seq[Any]] now?

No. As I explained above, Seq[NamedExpression] can not resolve the problem. See #25626 (comment).

cloud-fan · 2019-09-20T03:04:52Z

I'm preparing JDBC v2 as a showcase of Data Source V2, which supports many catalog functionalities (CREATE/ALTER/DROP TABLE) and DELETE/UPDATE, so that people can know the benefits of DS V2 for end-users. I'd like to get this in if there is no objection. @xianyinxin can you resolve the code conflicts?

SparkQA · 2019-09-20T03:15:51Z

Test build #111041 has finished for PR 25626 at commit bcafbdb.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class UpdateTable(
case class UpdateTableStatement(
case class UpdateTableExec(

SparkQA · 2019-09-20T07:05:02Z

Test build #111043 has finished for PR 25626 at commit 84fcd91.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class UpdateTable(
case class UpdateTableStatement(
case class UpdateTableExec(

cloud-fan · 2019-09-20T07:16:54Z

retest this please

SparkQA · 2019-09-20T12:03:03Z

Test build #111055 has finished for PR 25626 at commit 84fcd91.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class UpdateTable(
case class UpdateTableStatement(
case class UpdateTableExec(

rdblue · 2019-09-20T17:58:22Z

I don't think that an updated JDBC with UPDATE support qualifies as urgent. I would support adding this without the DSv2 API if you want to get just the parser changes in, but I think it is a bad idea to add a pushdown API without an implementation.

cloud-fan · 2019-09-23T03:32:08Z

Yes I can implement JDBC update without the DS v2 API as JDBC is an internal source. @xianyinxin can you exclude the DS v2 API changes from this PR? i.e. only keep the parser change and the UpdateStatement, as well as the parser tests.

When I finish the JDBC update, we can revisit it and see how to generalize the UPDATE API design.

xy-xin · 2019-09-23T04:57:26Z

Ok, it looks a nice plan. I'll update the code soon.

SparkQA · 2019-09-23T11:09:26Z

Test build #111196 has finished for PR 25626 at commit 3732a18.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class UpdateTableStatement(

cloud-fan · 2019-09-23T11:26:10Z

thanks, merging to master!

rdblue · 2019-09-23T16:53:46Z

Thanks for coming up with a compromise, @xianyinxin and @cloud-fan!

xy-xin · 2019-09-24T00:50:03Z

Thank you for comments & suggestions! @cloud-fan @rdblue

### What changes were proposed in this pull request? Add back the resolved logical plan for UPDATE TABLE. It was in #25626 before but was removed later. ### Why are the changes needed? In #25626 , we decided to not add the update API in DS v2, but we still want to implement UPDATE for builtin source like JDBC. We should at least add the resolved logical plan. ### Does this PR introduce any user-facing change? no, UPDATE is still not supported yet. ### How was this patch tested? new tests. Closes #26025 from cloud-fan/update. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>

### What changes were proposed in this pull request? Add back the resolved logical plan for UPDATE TABLE. It was in apache#25626 before but was removed later. ### Why are the changes needed? In apache#25626 , we decided to not add the update API in DS v2, but we still want to implement UPDATE for builtin source like JDBC. We should at least add the resolved logical plan. ### Does this PR introduce any user-facing change? no, UPDATE is still not supported yet. ### How was this patch tested? new tests. Closes apache#26025 from cloud-fan/update. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>

xy-xin force-pushed the SPARK-28892 branch from ad42e13 to 57cfcdb Compare August 30, 2019 02:42

dongjoon-hyun added the SQL label Aug 30, 2019

cloud-fan reviewed Aug 30, 2019

View reviewed changes

gatorsmile reviewed Aug 31, 2019

View reviewed changes

Comment thread sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 Outdated

gatorsmile reviewed Aug 31, 2019

View reviewed changes

Comment thread sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 Outdated

gatorsmile reviewed Aug 31, 2019

View reviewed changes

Comment thread sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 Outdated

gatorsmile requested changes Aug 31, 2019

View reviewed changes

cloud-fan reviewed Sep 3, 2019

View reviewed changes

Comment thread sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala Outdated

cloud-fan reviewed Sep 3, 2019

View reviewed changes

Comment thread ...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala Outdated

cloud-fan reviewed Sep 3, 2019

View reviewed changes

Comment thread sql/core/src/test/scala/org/apache/spark/sql/sources/v2/DataSourceV2SQLSuite.scala Outdated

cloud-fan reviewed Sep 3, 2019

View reviewed changes