[SPARK-47007][SQL][PYTHON][R][CONNECT] Add the `map_sort` function by stefankandic · Pull Request #45069 · apache/spark

stefankandic · 2024-02-08T13:58:42Z

What changes were proposed in this pull request?

Adding a new function map_sort to:

Scala API
Python API
R API
Spark Connect Scala Client
Spark Connect Python Client

Why are the changes needed?

In order to add the ability to do GROUP BY on map types we first have to be able to sort the maps by their key

Does this PR introduce any user-facing change?

Yes, new function map_sort

How was this patch tested?

With new UTs

Was this patch authored or co-authored using generative AI tooling?

No

LuciferYang · 2024-02-10T02:12:13Z

Should re-run SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *ExpressionsSchemaSuite" to re-generate golden files

zhengruifeng · 2024-03-07T04:27:03Z

updated the title since it also touch python/r/connect

HyukjinKwon · 2024-03-14T01:41:13Z



+@_try_remote_functions
+def map_sort(col: "ColumnOrName", asc: bool = True) -> Column:


let's also put it in the docs at python/docs/source/reference/pyspark.sql/functions.rst.

HyukjinKwon · 2024-03-14T01:42:17Z

+   * @since 4.0.0
+   */
+  def map_sort(e: Column): Column = map_sort(e, asc = true)
+  // TODO: add test for this


Let's remove this.

HyukjinKwon · 2024-03-14T01:42:48Z

cc @cloud-fan too

MaxGekk · 2024-03-15T15:45:02Z

+          DataTypeMismatch(
+            errorSubClass = "UNEXPECTED_INPUT_TYPE",
+            messageParameters = Map(
+              "paramIndex" -> "2",


Please, use ordinalNumber(1). See #45177

MaxGekk · 2024-03-15T15:45:42Z

+      DataTypeMismatch(
+        errorSubClass = "UNEXPECTED_INPUT_TYPE",
+        messageParameters = Map(
+          "paramIndex" -> "1",


MaxGekk · 2024-03-15T15:46:14Z

+        errorSubClass = "UNEXPECTED_INPUT_TYPE",
+        messageParameters = Map(
+          "paramIndex" -> "1",
+          "requiredType" -> toSQLType(ArrayType),


array? not the map type?

MaxGekk · 2024-03-15T15:47:26Z

+  override def checkInputDataTypes(): TypeCheckResult = base.dataType match {
+    case MapType(kt, _, _) if RowOrdering.isOrderable(kt) =>
+      ascendingOrder match {
+        case Literal(_: Boolean, BooleanType) =>


Is the requirement of Literal too much, how about foldable?

This was set due to convention of SortArray.checkInputDataTypes which uses Literal as well.

MaxGekk · 2024-03-17T16:33:11Z

+    val sortedKeys = Array
+      .tabulate(numElements)(i => (keys.get(i, keyType).asInstanceOf[Any], i))
+      .sortBy(_._1)(ordering)
+
+    val newKeys = new Array[Any](numElements)
+    val newValues = new Array[Any](numElements)
+
+    sortedKeys.zipWithIndex.foreach { case (elem, index) =>
+      newKeys(index) = keys.get(elem._2, keyType)
+      newValues(index) = values.get(elem._2, valueType)
+    }
+
+    new ArrayBasedMapData(new GenericArrayData(newKeys), new GenericArrayData(newValues))


Just wonder wouldn't it be easier?

val sorted = Array .tabulate(numElements)(i => (keys.get(i, keyType), values.get(i, valueType))) .sortBy(_._1)(ordering) ArrayBasedMapData(sorted.map(_._1), sorted.map(_._2))

or there is another reason like simpler Java implementation?

Just wanted to replace existing tree-based sorting.
Replaced with this shortened impl.

MaxGekk · 2024-03-18T06:41:21Z


+@ExpressionDescription(
+  usage = """
+    _FUNC_(map[, ascendingOrder]) - Sorts the input map in ascending or descending order


Could you leave a high-level description of the expression here, and add arguments w/ detailed description of map and ascendingOrder.

Since the expression focuses on sorting, describe characteristics of the sort alg: stable or not, and etc.

MaxGekk · 2024-03-18T06:55:04Z

    )
  }

+  test("map_sort function") {


Could you cover more corner cases like:

empty map

duplicate keys, see the config spark.sql.mapKeyDedupPolicy

null keys

check the error class INVALID_ORDERING_TYPE

Updated map_sort tests

MaxGekk · 2024-03-18T14:36:03Z

+      exception = intercept[SparkRuntimeException] {
+        sql("SELECT map_sort(map(1, 1, 2, 2, 1, 1))").collect()
+      },
+      errorClass = "DUPLICATED_MAP_KEY",


What's the behaviour when spark.sql.mapKeyDedupPolicy is LAST_WIN?

It is my understanding that MAP_KEY_DEDUP_POLICY only applies to map creation - map_sort accepts already created maps so the policy will be caught before it even gets to the logic for this function (?)

ok, let's remove this test.

MaxGekk · 2024-03-18T14:39:36Z

+  usage = """
+    _FUNC_(map[, ascendingOrder]) - Sorts the input map in ascending or descending order
+      according to the natural ordering of the map keys. The sorting algorithm used is
+      an adaptive, stable and iterative merge sort algorithm. If the input map is empty,


No need to expose the implementation details like merge-sort. It is enough to mention user-visible behavior like stable.

MaxGekk · 2024-03-18T14:41:35Z

+    """
+    Arguments:
+      * map - an expression. The map that will be sorted.
+      * ascendingOrder - an expression. The ordering in which the map will be sorted.


All args are expression in general. Could you mention that it is a boolean parameter, and true means ascending order (apparently false is descending order).

I was following the template for Sequence ExpressionDefinision which states - an expression.

Restructured the docs a bit now.

…sSuite.scala Co-authored-by: Maxim Gekk <max.gekk@gmail.com>

MaxGekk · 2024-03-19T06:04:21Z

+  }
+
+  override def nullSafeEval(array: Any, ascending: Any): Any = {
+    // put keys and their respective indices inside a tuple


not relevant comment anymore

MaxGekk · 2024-03-19T06:14:03Z

+    val boxedKeyType = CodeGenerator.boxedType(keyType)
+    val javaKeyType = CodeGenerator.javaType(keyType)
+
+    val simpleEntryType = s"java.util.AbstractMap.SimpleEntry<$boxedKeyType, Integer>"


Could you align the impl to none-codegen, and put values instead of indexes.

MaxGekk · 2024-03-19T06:22:38Z

+      exception = intercept[SparkRuntimeException] {
+        sql("SELECT map_sort(map(1, 1, 2, 2, 1, 1))").collect()
+      },
+      errorClass = "DUPLICATED_MAP_KEY",


ok, let's remove this test.

zhengruifeng · 2024-03-19T08:10:59Z

+
+    >>> import pyspark.sql.functions as sf
+    >>> df = spark.sql("SELECT map(3, 'c', 1, 'a', 2, 'b') as data")
+    >>> df.select(sf.map_sort(df.data, False)).show()


MaxGekk

@stevomitric @stefankandic Could you update PR's description and title, and fix function and expression names according your changes:

Function:

map_sort

Expression:

case class MapSort

Co-authored-by: Ruifeng Zheng <ruifengz@foxmail.com>

MaxGekk · 2024-03-19T15:54:38Z

+    +--------------------+
+    |map_sort(data, true)|
+    +--------------------+
+    |{1 -> a, 2 -> b, ...|


If truncate = False, why it is truncated?

I have double checked locally:

>>> df.select(sf.map_sort(df.data)).show(truncate=False) +------------------------+ |map_sort(data, true) | +------------------------+ |{1 -> a, 2 -> b, 3 -> c}| +------------------------+

MaxGekk · 2024-03-19T15:54:56Z

+    +---------------------+
+    |map_sort(data, false)|
+    +---------------------+
+    | {3 -> c, 2 -> b, ...|


Probably, need to fix the output.

MaxGekk · 2024-03-19T15:57:19Z

+              "inputType" -> toSQLType(ascendingOrder.dataType))
+          )
+      }
+    case MapType(_, _, _) =>


let's don't depend on the number of parameters in MapType:

Suggested change

case MapType(_, _, _) =>

case _: MapType =>

See https://github.com/databricks/scala-style-guide?tab=readme-ov-file#pattern-matching

MaxGekk · 2024-03-19T16:01:56Z

+  override def dataType: DataType = base.dataType
+
+  override def checkInputDataTypes(): TypeCheckResult = base.dataType match {
+    case MapType(kt, _, _) if RowOrdering.isOrderable(kt) =>


Suggested change

case MapType(kt, _, _) if RowOrdering.isOrderable(kt) =>

case m: MapType if RowOrdering.isOrderable(m.keyType) =>

MaxGekk · 2024-03-19T16:03:43Z

+    val c = ctx.freshName("c")
+    val newKeys = ctx.freshName("newKeys")
+    val newValues = ctx.freshName("newValues")
+    val originalIndex = ctx.freshName("originalIndex")


Is it used somewhere?

MaxGekk · 2024-03-20T04:59:06Z

+1, LGTM. Merging to master.
Thank you, @stevomitric @stefankandic and @HyukjinKwon @zhengruifeng for review.

cloud-fan · 2024-03-20T10:52:14Z

Sorry I missed this. Why do we add this public function? Do other systems have it? To support GROUP BY map type, an internal MapSort expression is sufficient.

MaxGekk · 2024-03-20T19:10:51Z

Do other systems have it?

@stevomitric @stefankandic Could you check other systems, please.

cloud-fan · 2024-03-21T02:49:40Z

I can't find it in other systems, and it does not make sense as map elements are order-less. I'm reverting it, please re-submit it without exposing the function publicly.

dongjoon-hyun · 2024-03-21T03:39:17Z

+1 for Wenchen's decision. Thank you for reverting.

stevomitric · 2024-03-21T13:17:50Z

Created new PR which omits creating new map_sort function - @MaxGekk @cloud-fan

github-actions Bot added the SQL label Feb 8, 2024

HyukjinKwon changed the title ~~[SPARK-47007] SortMap function~~ [SPARK-47007][SQL] SortMap function Feb 10, 2024

HyukjinKwon reviewed Feb 10, 2024

View reviewed changes

Comment thread sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala Outdated

github-actions Bot added DOCS PYTHON R CONNECT labels Feb 14, 2024

stefankandic changed the title ~~[SPARK-47007][SQL] SortMap function~~ [SPARK-47007][SQL] MapSort function Feb 28, 2024

stefankandic requested a review from HyukjinKwon February 28, 2024 16:39

stefankandic added 8 commits February 29, 2024 09:48

initial working version

a081649

add golden files

1441549

add map sort to other languages

1be06e3

fix typoes

249e903

fix scalastyle issue

aaae883

add proto golden files

acaf95e

fix python function call

5619fdb

fix ci errors

7754c14

stefankandic force-pushed the SPARK-47007 branch from 8e4b9ef to 7754c14 Compare February 29, 2024 08:49

fix ci checks

f0ebf5d

zhengruifeng changed the title ~~[SPARK-47007][SQL] MapSort function~~ [SPARK-47007][SQL][PYTHON][R][CONNECT] MapSort function Mar 7, 2024

stevomitric added 3 commits March 12, 2024 17:44

Optimized map-sort by switching to array sorting

1f78167

Potential tests fix

a5eb480

Potential tests fix 2

9497f99

HyukjinKwon reviewed Mar 14, 2024

View reviewed changes

HyukjinKwon approved these changes Mar 14, 2024

View reviewed changes

MaxGekk requested changes Mar 15, 2024

View reviewed changes

MaxGekk reviewed Mar 18, 2024

View reviewed changes

MaxGekk requested changes Mar 18, 2024

View reviewed changes

stevomitric added 2 commits March 18, 2024 10:36

Shortened map sort function and added more docs

ab70f1e

updated map_sort test suite

e79d65c

MaxGekk reviewed Mar 18, 2024

View reviewed changes

stevomitric and others added 3 commits March 18, 2024 16:08

Update sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunction…

a435355

…sSuite.scala Co-authored-by: Maxim Gekk <max.gekk@gmail.com>

Update sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunction…

c9901d0

…sSuite.scala Co-authored-by: Maxim Gekk <max.gekk@gmail.com>

docs fix

da6a710

MaxGekk reviewed Mar 19, 2024

View reviewed changes

zhengruifeng reviewed Mar 19, 2024

View reviewed changes

Comment thread python/pyspark/sql/functions/builtin.py Outdated

zhengruifeng reviewed Mar 19, 2024

View reviewed changes

MaxGekk requested changes Mar 19, 2024

View reviewed changes

stevomitric and others added 3 commits March 19, 2024 09:26

Updated codegen and removed once test-case

81008c2

Update python/pyspark/sql/functions/builtin.py

86b29c5

Co-authored-by: Ruifeng Zheng <ruifengz@foxmail.com>

Updated 'select.show' to give more info in map_sort desc

c08ab6c

stefankandic changed the title ~~[SPARK-47007][SQL][PYTHON][R][CONNECT] MapSort function~~ [SPARK-47007][SQL][PYTHON][R][CONNECT] map_sort function Mar 19, 2024

MaxGekk reviewed Mar 19, 2024

View reviewed changes

MaxGekk changed the title ~~[SPARK-47007][SQL][PYTHON][R][CONNECT] map_sort function~~ [SPARK-47007][SQL][PYTHON][R][CONNECT] Add the map_sort function Mar 19, 2024

MaxGekk requested changes Mar 19, 2024

View reviewed changes

Restructured docs, removed unused variable and refactored code

31a797c

MaxGekk approved these changes Mar 20, 2024

View reviewed changes

MaxGekk closed this in 747846b Mar 20, 2024



		@_try_remote_functions
		def map_sort(col: "ColumnOrName", asc: bool = True) -> Column:

	case MapType(kt, _, _) if RowOrdering.isOrderable(kt) =>
	case m: MapType if RowOrdering.isOrderable(m.keyType) =>

Uh oh!

Conversation

stefankandic commented Feb 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Uh oh!

LuciferYang commented Feb 10, 2024

Uh oh!

zhengruifeng commented Mar 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Mar 14, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGekk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGekk commented Mar 20, 2024

Uh oh!

stefankandic commented Feb 8, 2024 •

edited

Loading

zhengruifeng commented Mar 7, 2024 •

edited

Loading