[SPARK-28892][SQL] support UPDATE in the parser and add the corresponding logical plan#25626
[SPARK-28892][SQL] support UPDATE in the parser and add the corresponding logical plan#25626xy-xin wants to merge 1 commit into
Conversation
|
ok to test |
|
add to whitelist |
| DeleteFromTableExec(r.table.asDeletable, filters) :: Nil | ||
|
|
||
| case UpdateTable(r: DataSourceV2Relation, attrs, values, condition) => | ||
| val attrsNames = attrs.map(_.name) |
There was a problem hiding this comment.
shall we fail if some attrs are resolved to a nested field?
There was a problem hiding this comment.
Add check for nested fields and throw exception if it is nested.
BTW, I believe it would be helpful if we support nested fields update, but it may be difficult for supporting that, like, how can we pass the nested field to datasource.
|
Test build #109938 has finished for PR 25626 at commit
|
gatorsmile
left a comment
There was a problem hiding this comment.
The test coverage is not good enough. For example, all the test cases are just updating a single column? Try to check the test cases in the other open source databases?
Hi @gatorsmile , I added more test cases for update, including multi-fields updating, nested fields updating, etc. pls review. |
|
Test build #110012 has finished for PR 25626 at commit
|
|
Test build #110016 has finished for PR 25626 at commit
|
| DeleteFromTableExec(r.table.asDeletable, filters) :: Nil | ||
|
|
||
| case UpdateTable(r: DataSourceV2Relation, attrs, values, condition) => | ||
| val nested = attrs.asInstanceOf[Seq[Any]].filterNot(_.isInstanceOf[AttributeReference]) |
There was a problem hiding this comment.
why do we need the .asInstanceOf[Seq[Any]]?
There was a problem hiding this comment.
If not, it will regard attrs to be of type Seq[Attribute], and filterNot will treat each element as Attribute which will throw a cast exception.
As you metioned in #25626 (comment), a nested filed will not be resolved to Attribute, but GetStructField or something other, so if we change the type of attrs to Seq[Expression], we can eliminate .asInstanceOf[Seq[Any]] here.
|
|
||
| case class UpdateTable( | ||
| child: LogicalPlan, | ||
| attrs: Seq[Attribute], |
There was a problem hiding this comment.
can we really use Seq[Attribute]? When Spark resolves it to nested field, it will be Alias which is not an Attribute, and we will get weird errors.
There was a problem hiding this comment.
Metioned in #25626 (comment), Seq[Attribute] involves the ugly .asInstanceOf[Seq[Any]] when checking the type of fields. But need we update it to Seq[Expression]? My feeling is the latter is too general to represent a field.
There was a problem hiding this comment.
how about Seq[NamedExpression]?
There was a problem hiding this comment.
Seq[NamedExpression] is ok for me, and I updated the code. But Seq[NamedExpression] is not a perfect solution, as some nested field does not have a name, which would be resolved to something like GetStructField, and it is a subclass of Expression.
Updating a nested field is complex. For a named field user can update it by specifying the name, but for an array, or a more complex json which composed by nested objects and arrays, it's hard to specify the field to be updated.
So currrently, my think is we can limit the updating to the scope of non-nested field, which can be resolved to a NamedExpression. What's your opnion? @cloud-fan
There was a problem hiding this comment.
Then I think we have no choice but use Seq[Expression]. We should add comments to explain that attrs here holds the expressions that maybe attributes.
There was a problem hiding this comment.
Or we can change the analyzer to only resolve column names in UpdateTable as attributes, which I don't think is worthwhile.
There was a problem hiding this comment.
Updated it to Seq[Expression] and added some explaining doc, pls review.
|
Test build #110037 has finished for PR 25626 at commit
|
|
@xianyinxin, can you explain the required semantics for your proposed API? It isn't clear what the requirement for a source would be. In addition, |
|
@rdblue this PR uses the public expression |
Thank you @rdblue . Here Expression is the public datasource expression org.apache.spark.sql.catalog.v2.expressions.Expression. Datasource needs to know the value that the field be updated to, so here the key of sets specifies the field to be updated, and the value of sets is the updated value. The value can be an expression, not just a literal, IMHO, like the case `UPDATE tbl SET a=a+1 WHERE ... |
|
Test build #110102 has finished for PR 25626 at commit
|
| DeleteFromTableExec(r.table.asDeletable, filters) :: Nil | ||
|
|
||
| case UpdateTable(r: DataSourceV2Relation, attrs, values, condition) => | ||
| val nested = attrs.asInstanceOf[Seq[Any]].filterNot(_.isInstanceOf[AttributeReference]) |
There was a problem hiding this comment.
Can we remove .asInstanceOf[Seq[Any]] now?
There was a problem hiding this comment.
No. As I explained above, Seq[NamedExpression] can not resolve the problem. See #25626 (comment).
|
I'm preparing JDBC v2 as a showcase of Data Source V2, which supports many catalog functionalities (CREATE/ALTER/DROP TABLE) and DELETE/UPDATE, so that people can know the benefits of DS V2 for end-users. I'd like to get this in if there is no objection. @xianyinxin can you resolve the code conflicts? |
|
Test build #111041 has finished for PR 25626 at commit
|
|
Test build #111043 has finished for PR 25626 at commit
|
|
retest this please |
|
Test build #111055 has finished for PR 25626 at commit
|
|
I don't think that an updated JDBC with UPDATE support qualifies as urgent. I would support adding this without the DSv2 API if you want to get just the parser changes in, but I think it is a bad idea to add a pushdown API without an implementation. |
|
Yes I can implement JDBC update without the DS v2 API as JDBC is an internal source. @xianyinxin can you exclude the DS v2 API changes from this PR? i.e. only keep the parser change and the When I finish the JDBC update, we can revisit it and see how to generalize the UPDATE API design. |
|
Ok, it looks a nice plan. I'll update the code soon. |
|
Test build #111196 has finished for PR 25626 at commit
|
|
thanks, merging to master! |
|
Thanks for coming up with a compromise, @xianyinxin and @cloud-fan! |
|
Thank you for comments & suggestions! @cloud-fan @rdblue |
### What changes were proposed in this pull request? Add back the resolved logical plan for UPDATE TABLE. It was in #25626 before but was removed later. ### Why are the changes needed? In #25626 , we decided to not add the update API in DS v2, but we still want to implement UPDATE for builtin source like JDBC. We should at least add the resolved logical plan. ### Does this PR introduce any user-facing change? no, UPDATE is still not supported yet. ### How was this patch tested? new tests. Closes #26025 from cloud-fan/update. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>
### What changes were proposed in this pull request? Add back the resolved logical plan for UPDATE TABLE. It was in apache#25626 before but was removed later. ### Why are the changes needed? In apache#25626 , we decided to not add the update API in DS v2, but we still want to implement UPDATE for builtin source like JDBC. We should at least add the resolved logical plan. ### Does this PR introduce any user-facing change? no, UPDATE is still not supported yet. ### How was this patch tested? new tests. Closes apache#26025 from cloud-fan/update. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>
What changes were proposed in this pull request?
This PR supports UPDATE in the parser and add the corresponding logical plan. The SQL syntax is a standard UPDATE statement:
Why are the changes needed?
With this change, we can start to implement UPDATE in builtin sources and think about how to design the update API in DS v2.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
New test cases added.