[SPARK-47007][SQL] Add the MapSort expression#45639
Conversation
…sSuite.scala Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
…sSuite.scala Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
| * ascendingOrder - A boolean value describing the order in which the map will be sorted. | ||
| This can be either be ascending (true) or descending (false). | ||
| """, | ||
| examples = """ |
There was a problem hiding this comment.
we probably don't need this part
There was a problem hiding this comment.
Removed ExpressionDescription.
| | Object $o1 = (($simpleEntryType) $o1entry).getKey(); | ||
| | Object $o2 = (($simpleEntryType) $o2entry).getKey(); | ||
| | $comp; | ||
| | return $order ? $c : -$c; |
There was a problem hiding this comment.
sorry for just seeing this, but maybe we should do here the same thing that ArraySort does which is put the ordering into a variable outside of compare and just multiply it with the result?
this way we avoid branching in every comparison
stefankandic
left a comment
There was a problem hiding this comment.
LGTM if tests pass!
| copy(child = newChild) | ||
| } | ||
|
|
||
| case class MapSort(base: Expression, ascendingOrder: Expression) |
There was a problem hiding this comment.
| case class MapSort(base: Expression, ascendingOrder: Expression) | |
| case class MapSort(base: Expression, ascendingOrder: Boolean) |
There was a problem hiding this comment.
Doesn't the BinaryExpression require two expressions here? Do we demote this to UnaryExpression?
EDIT: Expression for ascendingOrder in array sorting has been set as well.
There was a problem hiding this comment.
What's is the internal use-cases for the expression? Do we need this parameter at all?
Seems like you are going to pass true as ascendingOrder always at
https://github.com/apache/spark/pull/45549/files#diff-11264d807efa58054cca2d220aae8fba644ee0f0f2a4722c46d52828394846efR2488
case a @ Aggregate(groupingExpr, x, b) =>
val newGrouping = groupingExpr.map { expr =>
(expr, expr.dataType) match {
case (_: MapSort, _) => expr
case (_, _: MapType) =>
MapSort(expr, Literal.TrueLiteral)
case _ => exprThere was a problem hiding this comment.
From the point of internal use, we don't need it. Refactored expression as UnaryExpression and removed ordering altogether.
| |""".stripMargin | ||
| } | ||
|
|
||
| override def prettyName: String = "map_sort" |
There was a problem hiding this comment.
Remove this since the expression hasn't been bound to the function name.
| copy(child = newChild) | ||
| } | ||
|
|
||
| case class MapSort(base: Expression, ascendingOrder: Expression) |
There was a problem hiding this comment.
What's is the internal use-cases for the expression? Do we need this parameter at all?
Seems like you are going to pass true as ascendingOrder always at
https://github.com/apache/spark/pull/45549/files#diff-11264d807efa58054cca2d220aae8fba644ee0f0f2a4722c46d52828394846efR2488
case a @ Aggregate(groupingExpr, x, b) =>
val newGrouping = groupingExpr.map { expr =>
(expr, expr.dataType) match {
case (_: MapSort, _) => expr
case (_, _: MapType) =>
MapSort(expr, Literal.TrueLiteral)
case _ => expr| case Literal(_: Boolean, BooleanType) => | ||
| TypeCheckResult.TypeCheckSuccess |
There was a problem hiding this comment.
Do you have to be so strict on this argument? For example, could you imagine a case where you want to select the sort order based on the values in another column or the result of an expression? Is this needlessly restrictive?
There was a problem hiding this comment.
For example, could you imagine a case where you want to select the sort order based on the values in another column or the result of an expression?
I can imagine the case but so far we are going to use the expression internally for one case only. Support of ascendingOrder = false or even an arbitrary boolean expression just overcomplicates the code.
MapSort expression
|
+1, LGTM. Merging to master. |
### What changes were proposed in this pull request? Changes proposed in this PR include: - Relaxed checks that prevent aggregating of map types - Added new analyzer rule that uses `MapSort` expression proposed in [this PR](#45639) - Created codegen that compares two sorted maps ### Why are the changes needed? Adding new functionality to GROUP BY map types ### Does this PR introduce _any_ user-facing change? Yes, ability to use `GROUP BY MapType` ### How was this patch tested? With new UTs ### Was this patch authored or co-authored using generative AI tooling? No Closes #45549 from stevomitric/stevomitric/map-group-by. Lead-authored-by: Stevo Mitric <stevo.mitric@databricks.com> Co-authored-by: Stefan Kandic <stefan.kandic@databricks.com> Co-authored-by: Stevo Mitric <stevomitric2000@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
Added the new
MapSortexpression inCollationOperationsalongside new UTs.Why are the changes needed?
In order to add the ability to do GROUP BY on map types we first have to be able to sort the maps by their key
Does this PR introduce any user-facing change?
No
How was this patch tested?
New tests in
CollectionExpressionSuiteWas this patch authored or co-authored using generative AI tooling?
No