Skip to content

[SPARK-28892][SQL] support UPDATE in the parser and add the corresponding logical plan#25626

Closed
xy-xin wants to merge 1 commit into
apache:masterfrom
xy-xin:SPARK-28892
Closed

[SPARK-28892][SQL] support UPDATE in the parser and add the corresponding logical plan#25626
xy-xin wants to merge 1 commit into
apache:masterfrom
xy-xin:SPARK-28892

Conversation

@xy-xin

@xy-xin xy-xin commented Aug 30, 2019

Copy link
Copy Markdown

What changes were proposed in this pull request?

This PR supports UPDATE in the parser and add the corresponding logical plan. The SQL syntax is a standard UPDATE statement:

UPDATE tableName tableAlias SET colName=value [, colName=value]+ WHERE predicate?

Why are the changes needed?

With this change, we can start to implement UPDATE in builtin sources and think about how to design the update API in DS v2.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New test cases added.

@xy-xin

xy-xin commented Aug 30, 2019

Copy link
Copy Markdown
Author

cc @cloud-fan @rdblue

@cloud-fan

Copy link
Copy Markdown
Contributor

ok to test

@cloud-fan

Copy link
Copy Markdown
Contributor

add to whitelist

DeleteFromTableExec(r.table.asDeletable, filters) :: Nil

case UpdateTable(r: DataSourceV2Relation, attrs, values, condition) =>
val attrsNames = attrs.map(_.name)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we fail if some attrs are resolved to a nested field?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add check for nested fields and throw exception if it is nested.

BTW, I believe it would be helpful if we support nested fields update, but it may be difficult for supporting that, like, how can we pass the nested field to datasource.

@SparkQA

SparkQA commented Aug 30, 2019

Copy link
Copy Markdown

Test build #109938 has finished for PR 25626 at commit a58a87b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Comment thread sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 Outdated
Comment thread sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 Outdated
Comment thread sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 Outdated

@gatorsmile gatorsmile left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test coverage is not good enough. For example, all the test cases are just updating a single column? Try to check the test cases in the other open source databases?

@xy-xin

xy-xin commented Sep 2, 2019

Copy link
Copy Markdown
Author

The test coverage is not good enough. For example, all the test cases are just updating a single column? Try to check the test cases in the other open source databases?

Hi @gatorsmile , I added more test cases for update, including multi-fields updating, nested fields updating, etc. pls review.
Also checked the test cases in postgresql. There're many cases for other syntax like updating via sub-query, updating multi-fields with one assignment which is not supported current by spark sql.
More test cases will be add once we support more powerful update, like update a field with an expression, update multi-fields with select/multi-values in one assignment metioned above.

@SparkQA

SparkQA commented Sep 2, 2019

Copy link
Copy Markdown

Test build #110012 has finished for PR 25626 at commit a76062c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Sep 2, 2019

Copy link
Copy Markdown

Test build #110016 has finished for PR 25626 at commit 5b738cf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Comment thread sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala Outdated
DeleteFromTableExec(r.table.asDeletable, filters) :: Nil

case UpdateTable(r: DataSourceV2Relation, attrs, values, condition) =>
val nested = attrs.asInstanceOf[Seq[Any]].filterNot(_.isInstanceOf[AttributeReference])

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need the .asInstanceOf[Seq[Any]]?

@xy-xin xy-xin Sep 3, 2019

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not, it will regard attrs to be of type Seq[Attribute], and filterNot will treat each element as Attribute which will throw a cast exception.
As you metioned in #25626 (comment), a nested filed will not be resolved to Attribute, but GetStructField or something other, so if we change the type of attrs to Seq[Expression], we can eliminate .asInstanceOf[Seq[Any]] here.


case class UpdateTable(
child: LogicalPlan,
attrs: Seq[Attribute],

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we really use Seq[Attribute]? When Spark resolves it to nested field, it will be Alias which is not an Attribute, and we will get weird errors.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metioned in #25626 (comment), Seq[Attribute] involves the ugly .asInstanceOf[Seq[Any]] when checking the type of fields. But need we update it to Seq[Expression]? My feeling is the latter is too general to represent a field.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about Seq[NamedExpression]?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seq[NamedExpression] is ok for me, and I updated the code. But Seq[NamedExpression] is not a perfect solution, as some nested field does not have a name, which would be resolved to something like GetStructField, and it is a subclass of Expression.
Updating a nested field is complex. For a named field user can update it by specifying the name, but for an array, or a more complex json which composed by nested objects and arrays, it's hard to specify the field to be updated.
So currrently, my think is we can limit the updating to the scope of non-nested field, which can be resolved to a NamedExpression. What's your opnion? @cloud-fan

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I think we have no choice but use Seq[Expression]. We should add comments to explain that attrs here holds the expressions that maybe attributes.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we can change the analyzer to only resolve column names in UpdateTable as attributes, which I don't think is worthwhile.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated it to Seq[Expression] and added some explaining doc, pls review.

@SparkQA

SparkQA commented Sep 3, 2019

Copy link
Copy Markdown

Test build #110037 has finished for PR 25626 at commit 4d048b3.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue

rdblue commented Sep 3, 2019

Copy link
Copy Markdown
Contributor

@xianyinxin, can you explain the required semantics for your proposed API?

void updateWhere(Map<String, Expression> sets, Filter[] filters);

It isn't clear what the requirement for a source would be.

In addition, Expression is internal to catalyst and should be removed from the API.

@cloud-fan

Copy link
Copy Markdown
Contributor

@rdblue this PR uses the public expression org.apache.spark.sql.catalog.v2.expressions.Expression

@xy-xin

xy-xin commented Sep 4, 2019

Copy link
Copy Markdown
Author

@xianyinxin, can you explain the required semantics for your proposed API?

void updateWhere(Map<String, Expression> sets, Filter[] filters);

It isn't clear what the requirement for a source would be.

In addition, Expression is internal to catalyst and should be removed from the API.

Thank you @rdblue . Here Expression is the public datasource expression org.apache.spark.sql.catalog.v2.expressions.Expression. Datasource needs to know the value that the field be updated to, so here the key of sets specifies the field to be updated, and the value of sets is the updated value. The value can be an expression, not just a literal, IMHO, like the case `UPDATE tbl SET a=a+1 WHERE ...

@SparkQA

SparkQA commented Sep 4, 2019

Copy link
Copy Markdown

Test build #110102 has finished for PR 25626 at commit 0fda5c8.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Comment thread sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala Outdated
Comment thread sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala Outdated
DeleteFromTableExec(r.table.asDeletable, filters) :: Nil

case UpdateTable(r: DataSourceV2Relation, attrs, values, condition) =>
val nested = attrs.asInstanceOf[Seq[Any]].filterNot(_.isInstanceOf[AttributeReference])

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove .asInstanceOf[Seq[Any]] now?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. As I explained above, Seq[NamedExpression] can not resolve the problem. See #25626 (comment).

@cloud-fan

cloud-fan commented Sep 20, 2019

Copy link
Copy Markdown
Contributor

I'm preparing JDBC v2 as a showcase of Data Source V2, which supports many catalog functionalities (CREATE/ALTER/DROP TABLE) and DELETE/UPDATE, so that people can know the benefits of DS V2 for end-users. I'd like to get this in if there is no objection. @xianyinxin can you resolve the code conflicts?

@SparkQA

SparkQA commented Sep 20, 2019

Copy link
Copy Markdown

Test build #111041 has finished for PR 25626 at commit bcafbdb.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class UpdateTable(
  • case class UpdateTableStatement(
  • case class UpdateTableExec(

@SparkQA

SparkQA commented Sep 20, 2019

Copy link
Copy Markdown

Test build #111043 has finished for PR 25626 at commit 84fcd91.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class UpdateTable(
  • case class UpdateTableStatement(
  • case class UpdateTableExec(

@cloud-fan

Copy link
Copy Markdown
Contributor

retest this please

@SparkQA

SparkQA commented Sep 20, 2019

Copy link
Copy Markdown

Test build #111055 has finished for PR 25626 at commit 84fcd91.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class UpdateTable(
  • case class UpdateTableStatement(
  • case class UpdateTableExec(

@rdblue

rdblue commented Sep 20, 2019

Copy link
Copy Markdown
Contributor

I don't think that an updated JDBC with UPDATE support qualifies as urgent. I would support adding this without the DSv2 API if you want to get just the parser changes in, but I think it is a bad idea to add a pushdown API without an implementation.

@cloud-fan

Copy link
Copy Markdown
Contributor

Yes I can implement JDBC update without the DS v2 API as JDBC is an internal source. @xianyinxin can you exclude the DS v2 API changes from this PR? i.e. only keep the parser change and the UpdateStatement, as well as the parser tests.

When I finish the JDBC update, we can revisit it and see how to generalize the UPDATE API design.

@xy-xin

xy-xin commented Sep 23, 2019

Copy link
Copy Markdown
Author

Ok, it looks a nice plan. I'll update the code soon.

@SparkQA

SparkQA commented Sep 23, 2019

Copy link
Copy Markdown

Test build #111196 has finished for PR 25626 at commit 3732a18.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class UpdateTableStatement(

@cloud-fan cloud-fan changed the title [SPARK-28892][SQL] Add UPDATE support for DataSource V2 [SPARK-28892][SQL] support UPDATE in the parser and add the corresponding logical plan Sep 23, 2019
@cloud-fan

Copy link
Copy Markdown
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 655356e Sep 23, 2019
@rdblue

rdblue commented Sep 23, 2019

Copy link
Copy Markdown
Contributor

Thanks for coming up with a compromise, @xianyinxin and @cloud-fan!

@xy-xin

xy-xin commented Sep 24, 2019

Copy link
Copy Markdown
Author

Thank you for comments & suggestions! @cloud-fan @rdblue

gatorsmile pushed a commit that referenced this pull request Oct 8, 2019
### What changes were proposed in this pull request?

Add back the resolved logical plan for UPDATE TABLE. It was in #25626 before but was removed later.

### Why are the changes needed?

In #25626 , we decided to not add the update API in DS v2, but we still want to implement UPDATE for builtin source like JDBC. We should at least add the resolved logical plan.

### Does this PR introduce any user-facing change?

no, UPDATE is still not supported yet.

### How was this patch tested?

new tests.

Closes #26025 from cloud-fan/update.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
atronchi pushed a commit to atronchi/spark that referenced this pull request Oct 23, 2019
### What changes were proposed in this pull request?

Add back the resolved logical plan for UPDATE TABLE. It was in apache#25626 before but was removed later.

### Why are the changes needed?

In apache#25626 , we decided to not add the update API in DS v2, but we still want to implement UPDATE for builtin source like JDBC. We should at least add the resolved logical plan.

### Does this PR introduce any user-facing change?

no, UPDATE is still not supported yet.

### How was this patch tested?

new tests.

Closes apache#26025 from cloud-fan/update.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants