[SPARK-47007][SQL][PYTHON][R][CONNECT] Add the map_sort function#45069
[SPARK-47007][SQL][PYTHON][R][CONNECT] Add the map_sort function#45069stefankandic wants to merge 22 commits into
map_sort function#45069Conversation
|
Should re-run |
8e4b9ef to
7754c14
Compare
|
updated the title since it also touch python/r/connect |
|
|
||
|
|
||
| @_try_remote_functions | ||
| def map_sort(col: "ColumnOrName", asc: bool = True) -> Column: |
There was a problem hiding this comment.
let's also put it in the docs at python/docs/source/reference/pyspark.sql/functions.rst.
| * @since 4.0.0 | ||
| */ | ||
| def map_sort(e: Column): Column = map_sort(e, asc = true) | ||
| // TODO: add test for this |
|
cc @cloud-fan too |
| DataTypeMismatch( | ||
| errorSubClass = "UNEXPECTED_INPUT_TYPE", | ||
| messageParameters = Map( | ||
| "paramIndex" -> "2", |
| DataTypeMismatch( | ||
| errorSubClass = "UNEXPECTED_INPUT_TYPE", | ||
| messageParameters = Map( | ||
| "paramIndex" -> "1", |
| errorSubClass = "UNEXPECTED_INPUT_TYPE", | ||
| messageParameters = Map( | ||
| "paramIndex" -> "1", | ||
| "requiredType" -> toSQLType(ArrayType), |
| override def checkInputDataTypes(): TypeCheckResult = base.dataType match { | ||
| case MapType(kt, _, _) if RowOrdering.isOrderable(kt) => | ||
| ascendingOrder match { | ||
| case Literal(_: Boolean, BooleanType) => |
There was a problem hiding this comment.
Is the requirement of Literal too much, how about foldable?
There was a problem hiding this comment.
This was set due to convention of SortArray.checkInputDataTypes which uses Literal as well.
| val sortedKeys = Array | ||
| .tabulate(numElements)(i => (keys.get(i, keyType).asInstanceOf[Any], i)) | ||
| .sortBy(_._1)(ordering) | ||
|
|
||
| val newKeys = new Array[Any](numElements) | ||
| val newValues = new Array[Any](numElements) | ||
|
|
||
| sortedKeys.zipWithIndex.foreach { case (elem, index) => | ||
| newKeys(index) = keys.get(elem._2, keyType) | ||
| newValues(index) = values.get(elem._2, valueType) | ||
| } | ||
|
|
||
| new ArrayBasedMapData(new GenericArrayData(newKeys), new GenericArrayData(newValues)) |
There was a problem hiding this comment.
Just wonder wouldn't it be easier?
val sorted = Array
.tabulate(numElements)(i => (keys.get(i, keyType), values.get(i, valueType)))
.sortBy(_._1)(ordering)
ArrayBasedMapData(sorted.map(_._1), sorted.map(_._2))or there is another reason like simpler Java implementation?
There was a problem hiding this comment.
Just wanted to replace existing tree-based sorting.
Replaced with this shortened impl.
|
|
||
| @ExpressionDescription( | ||
| usage = """ | ||
| _FUNC_(map[, ascendingOrder]) - Sorts the input map in ascending or descending order |
There was a problem hiding this comment.
Could you leave a high-level description of the expression here, and add arguments w/ detailed description of map and ascendingOrder.
Since the expression focuses on sorting, describe characteristics of the sort alg: stable or not, and etc.
| ) | ||
| } | ||
|
|
||
| test("map_sort function") { |
There was a problem hiding this comment.
Could you cover more corner cases like:
- empty map
- duplicate keys, see the config
spark.sql.mapKeyDedupPolicy - null keys
- check the error class
INVALID_ORDERING_TYPE
There was a problem hiding this comment.
Updated map_sort tests
| exception = intercept[SparkRuntimeException] { | ||
| sql("SELECT map_sort(map(1, 1, 2, 2, 1, 1))").collect() | ||
| }, | ||
| errorClass = "DUPLICATED_MAP_KEY", |
There was a problem hiding this comment.
What's the behaviour when spark.sql.mapKeyDedupPolicy is LAST_WIN?
There was a problem hiding this comment.
It is my understanding that MAP_KEY_DEDUP_POLICY only applies to map creation - map_sort accepts already created maps so the policy will be caught before it even gets to the logic for this function (?)
| usage = """ | ||
| _FUNC_(map[, ascendingOrder]) - Sorts the input map in ascending or descending order | ||
| according to the natural ordering of the map keys. The sorting algorithm used is | ||
| an adaptive, stable and iterative merge sort algorithm. If the input map is empty, |
There was a problem hiding this comment.
No need to expose the implementation details like merge-sort. It is enough to mention user-visible behavior like stable.
| """ | ||
| Arguments: | ||
| * map - an expression. The map that will be sorted. | ||
| * ascendingOrder - an expression. The ordering in which the map will be sorted. |
There was a problem hiding this comment.
All args are expression in general. Could you mention that it is a boolean parameter, and true means ascending order (apparently false is descending order).
There was a problem hiding this comment.
I was following the template for Sequence ExpressionDefinision which states - an expression.
Restructured the docs a bit now.
…sSuite.scala Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
…sSuite.scala Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
| } | ||
|
|
||
| override def nullSafeEval(array: Any, ascending: Any): Any = { | ||
| // put keys and their respective indices inside a tuple |
| val boxedKeyType = CodeGenerator.boxedType(keyType) | ||
| val javaKeyType = CodeGenerator.javaType(keyType) | ||
|
|
||
| val simpleEntryType = s"java.util.AbstractMap.SimpleEntry<$boxedKeyType, Integer>" |
There was a problem hiding this comment.
Could you align the impl to none-codegen, and put values instead of indexes.
| exception = intercept[SparkRuntimeException] { | ||
| sql("SELECT map_sort(map(1, 1, 2, 2, 1, 1))").collect() | ||
| }, | ||
| errorClass = "DUPLICATED_MAP_KEY", |
|
|
||
| >>> import pyspark.sql.functions as sf | ||
| >>> df = spark.sql("SELECT map(3, 'c', 1, 'a', 2, 'b') as data") | ||
| >>> df.select(sf.map_sort(df.data, False)).show() |
MaxGekk
left a comment
There was a problem hiding this comment.
@stevomitric @stefankandic Could you update PR's description and title, and fix function and expression names according your changes:
Function:
map_sortExpression:
case class MapSortCo-authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
| +--------------------+ | ||
| |map_sort(data, true)| | ||
| +--------------------+ | ||
| |{1 -> a, 2 -> b, ...| |
There was a problem hiding this comment.
If truncate = False, why it is truncated?
I have double checked locally:
>>> df.select(sf.map_sort(df.data)).show(truncate=False)
+------------------------+
|map_sort(data, true) |
+------------------------+
|{1 -> a, 2 -> b, 3 -> c}|
+------------------------+| +---------------------+ | ||
| |map_sort(data, false)| | ||
| +---------------------+ | ||
| | {3 -> c, 2 -> b, ...| |
There was a problem hiding this comment.
Probably, need to fix the output.
map_sort function
| "inputType" -> toSQLType(ascendingOrder.dataType)) | ||
| ) | ||
| } | ||
| case MapType(_, _, _) => |
There was a problem hiding this comment.
let's don't depend on the number of parameters in MapType:
| case MapType(_, _, _) => | |
| case _: MapType => |
See https://github.com/databricks/scala-style-guide?tab=readme-ov-file#pattern-matching
| override def dataType: DataType = base.dataType | ||
|
|
||
| override def checkInputDataTypes(): TypeCheckResult = base.dataType match { | ||
| case MapType(kt, _, _) if RowOrdering.isOrderable(kt) => |
There was a problem hiding this comment.
| case MapType(kt, _, _) if RowOrdering.isOrderable(kt) => | |
| case m: MapType if RowOrdering.isOrderable(m.keyType) => |
| val c = ctx.freshName("c") | ||
| val newKeys = ctx.freshName("newKeys") | ||
| val newValues = ctx.freshName("newValues") | ||
| val originalIndex = ctx.freshName("originalIndex") |
|
+1, LGTM. Merging to master. |
|
Sorry I missed this. Why do we add this public function? Do other systems have it? To support GROUP BY map type, an internal |
@stevomitric @stefankandic Could you check other systems, please. |
|
I can't find it in other systems, and it does not make sense as map elements are order-less. I'm reverting it, please re-submit it without exposing the function publicly. |
|
+1 for Wenchen's decision. Thank you for reverting. |
|
Created new PR which omits creating new |
What changes were proposed in this pull request?
Adding a new function
map_sortto:Why are the changes needed?
In order to add the ability to do GROUP BY on map types we first have to be able to sort the maps by their key
Does this PR introduce any user-facing change?
Yes, new function
map_sortHow was this patch tested?
With new UTs
Was this patch authored or co-authored using generative AI tooling?
No