[SPARK-47007][SQL] Add the `MapSort` expression by stevomitric · Pull Request #45639 · apache/spark

stevomitric · 2024-03-21T13:11:03Z

What changes were proposed in this pull request?

Added the new MapSort expression in CollationOperations alongside new UTs.

Why are the changes needed?

In order to add the ability to do GROUP BY on map types we first have to be able to sort the maps by their key

Does this PR introduce any user-facing change?

No

How was this patch tested?

New tests in CollectionExpressionSuite

Was this patch authored or co-authored using generative AI tooling?

No

…sSuite.scala Co-authored-by: Maxim Gekk <max.gekk@gmail.com>

Co-authored-by: Ruifeng Zheng <ruifengz@foxmail.com>

stefankandic · 2024-03-21T13:13:07Z

+      * ascendingOrder - A boolean value describing the order in which the map will be sorted.
+          This can be either be ascending (true) or descending (false).
+  """,
+  examples = """


we probably don't need this part

Removed ExpressionDescription.

stefankandic · 2024-03-21T13:18:49Z

+       |    Object $o1 = (($simpleEntryType) $o1entry).getKey();
+       |    Object $o2 = (($simpleEntryType) $o2entry).getKey();
+       |    $comp;
+       |    return $order ? $c : -$c;


sorry for just seeing this, but maybe we should do here the same thing that ArraySort does which is put the ordering into a variable outside of compare and just multiply it with the result?

this way we avoid branching in every comparison

stefankandic

LGTM if tests pass!

cloud-fan · 2024-03-21T14:01:06Z

    copy(child = newChild)
 }

+case class MapSort(base: Expression, ascendingOrder: Expression)


Suggested change

case class MapSort(base: Expression, ascendingOrder: Expression)

case class MapSort(base: Expression, ascendingOrder: Boolean)

Doesn't the BinaryExpression require two expressions here? Do we demote this to UnaryExpression?

EDIT: Expression for ascendingOrder in array sorting has been set as well.

What's is the internal use-cases for the expression? Do we need this parameter at all?

Seems like you are going to pass true as ascendingOrder always at
https://github.com/apache/spark/pull/45549/files#diff-11264d807efa58054cca2d220aae8fba644ee0f0f2a4722c46d52828394846efR2488

case a @ Aggregate(groupingExpr, x, b) => val newGrouping = groupingExpr.map { expr => (expr, expr.dataType) match { case (_: MapSort, _) => expr case (_, _: MapType) => MapSort(expr, Literal.TrueLiteral) case _ => expr

From the point of internal use, we don't need it. Refactored expression as UnaryExpression and removed ordering altogether.

MaxGekk · 2024-03-21T16:53:22Z

+       |""".stripMargin
+  }
+
+  override def prettyName: String = "map_sort"


Remove this since the expression hasn't been bound to the function name.

MaxGekk · 2024-03-21T16:59:08Z

    copy(child = newChild)
 }

+case class MapSort(base: Expression, ascendingOrder: Expression)


What's is the internal use-cases for the expression? Do we need this parameter at all?

Seems like you are going to pass true as ascendingOrder always at
https://github.com/apache/spark/pull/45549/files#diff-11264d807efa58054cca2d220aae8fba644ee0f0f2a4722c46d52828394846efR2488

case a @ Aggregate(groupingExpr, x, b) => val newGrouping = groupingExpr.map { expr => (expr, expr.dataType) match { case (_: MapSort, _) => expr case (_, _: MapType) => MapSort(expr, Literal.TrueLiteral) case _ => expr

markj-db · 2024-03-21T18:35:51Z

+        case Literal(_: Boolean, BooleanType) =>
+          TypeCheckResult.TypeCheckSuccess


Do you have to be so strict on this argument? For example, could you imagine a case where you want to select the sort order based on the values in another column or the result of an expression? Is this needlessly restrictive?

For example, could you imagine a case where you want to select the sort order based on the values in another column or the result of an expression?

I can imagine the case but so far we are going to use the expression internally for one case only. Support of ascendingOrder = false or even an arbitrary boolean expression just overcomplicates the code.

MaxGekk · 2024-03-22T07:43:59Z

+1, LGTM. Merging to master.
Thank you, @stevomitric and @stefankandic @cloud-fan @markj-db for review.

### What changes were proposed in this pull request? Changes proposed in this PR include: - Relaxed checks that prevent aggregating of map types - Added new analyzer rule that uses `MapSort` expression proposed in [this PR](#45639) - Created codegen that compares two sorted maps ### Why are the changes needed? Adding new functionality to GROUP BY map types ### Does this PR introduce _any_ user-facing change? Yes, ability to use `GROUP BY MapType` ### How was this patch tested? With new UTs ### Was this patch authored or co-authored using generative AI tooling? No Closes #45549 from stevomitric/stevomitric/map-group-by. Lead-authored-by: Stevo Mitric <stevo.mitric@databricks.com> Co-authored-by: Stefan Kandic <stefan.kandic@databricks.com> Co-authored-by: Stevo Mitric <stevomitric2000@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

stefankandic and others added 25 commits February 29, 2024 09:48

initial working version

a081649

add golden files

1441549

add map sort to other languages

1be06e3

fix typoes

249e903

fix scalastyle issue

aaae883

add proto golden files

acaf95e

fix python function call

5619fdb

fix ci errors

7754c14

fix ci checks

f0ebf5d

Optimized map-sort by switching to array sorting

1f78167

Potential tests fix

a5eb480

Potential tests fix 2

9497f99

Removed TODOs and changed parmIndex to ordinal

5e7a033

Shortened map sort function and added more docs

ab70f1e

updated map_sort test suite

e79d65c

Update sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunction…

a435355

…sSuite.scala Co-authored-by: Maxim Gekk <max.gekk@gmail.com>

Update sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunction…

c9901d0

…sSuite.scala Co-authored-by: Maxim Gekk <max.gekk@gmail.com>

docs fix

da6a710

Updated codegen and removed once test-case

81008c2

Update python/pyspark/sql/functions/builtin.py

86b29c5

Co-authored-by: Ruifeng Zheng <ruifengz@foxmail.com>

Updated 'select.show' to give more info in map_sort desc

c08ab6c

Restructured docs, removed unused variable and refactored code

31a797c

Removed map_sort function but left the MapSort expression

69e3b48

Merge branch 'master' into stevomitric/map-expr

51ab204

aditional erasions

8d9ac51

github-actions Bot added the SQL label Mar 21, 2024

stefankandic reviewed Mar 21, 2024

View reviewed changes

removed ExpressionDescription

2951bcc

stevomitric mentioned this pull request Mar 21, 2024

[SPARK-47007][SQL][PYTHON][R][CONNECT] Add the map_sort function #45069

Closed

stefankandic reviewed Mar 21, 2024

View reviewed changes

Moved ordering outside of comapre function

0fc3c6a

stefankandic approved these changes Mar 21, 2024

View reviewed changes

cloud-fan reviewed Mar 21, 2024

View reviewed changes

stevomitric requested a review from cloud-fan March 21, 2024 14:35

MaxGekk reviewed Mar 21, 2024

View reviewed changes

markj-db reviewed Mar 21, 2024

View reviewed changes

Removed oredering type

0c7d21a

stevomitric requested a review from MaxGekk March 21, 2024 21:58

MaxGekk changed the title ~~[SPARK-47007] Added MapSort expression~~ [SPARK-47007][SQL] Add the MapSort expression Mar 22, 2024

MaxGekk approved these changes Mar 22, 2024

View reviewed changes

cloud-fan approved these changes Mar 22, 2024

View reviewed changes

MaxGekk closed this in c94090e Mar 22, 2024

stevomitric mentioned this pull request Mar 22, 2024

[SPARK-47430][SQL] Support GROUP BY for MapType #45549

Closed

	case class MapSort(base: Expression, ascendingOrder: Expression)
	case class MapSort(base: Expression, ascendingOrder: Boolean)

		case Literal(_: Boolean, BooleanType) =>
		TypeCheckResult.TypeCheckSuccess

Uh oh!

Conversation

stevomitric commented Mar 21, 2024 • edited by MaxGekk Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stefankandic left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stevomitric Mar 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGekk commented Mar 22, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

stevomitric commented Mar 21, 2024 •

edited by MaxGekk

Loading

stevomitric Mar 21, 2024 •

edited

Loading