[SPARK-23936][SQL] Implement map_concat#21073
Conversation
|
Test build #89369 has finished for PR 21073 at commit
|
|
Test build #89378 has finished for PR 21073 at commit
|
There was a problem hiding this comment.
What's the result of map_concat(NULL, NULL)?
There was a problem hiding this comment.
@henryr empty map:
scala> df.select(map_concat('map1, 'map2).as('newMap)).show
+------+
|newMap|
+------+
| []|
| []|
+------+
Presto docs (from which the proposed spec comes) are quiet on the matter. Even after looking at the Presto code, I am still hard-pressed to say.
I did divine from the Presto code that there should be at least two inputs (and I don't currently verify that).
There was a problem hiding this comment.
Hm, seems a bit unusual to me to have, in effect, NULL ++ NULL => Map(). I checked with Presto and it looks like it returns NULL:
presto> select map_concat(NULL, NULL)
-> ;
_col0
-------
NULL
(1 row)
There was a problem hiding this comment.
@henryr Since Presto is the reference, map_concat should return NULL in this case. I will update.
There was a problem hiding this comment.
@henryr Another quick test of Presto also shows that if any input is NULL, the result is NULL:
presto:default> SELECT map_concat(NULL, map(ARRAY[1,3], ARRAY[2,4])); _col0 ------- NULL (1 row)
Looks like I need to check if any input is NULL.
|
Test build #89473 has finished for PR 21073 at commit
|
44137cc to
d3d6ad6
Compare
|
Test build #89523 has finished for PR 21073 at commit
|
|
Test build #89579 has finished for PR 21073 at commit
|
There was a problem hiding this comment.
Since this logic is big enough (and similar enough to the logic in eval), I wonder if the merge logic should be moved to a utility class and called from both eval as well as the generated code.
The FromUTCTimestamp expression does something sort of like that, where the eval method as well as the generated code both call utility functions in the DateTimeUtils scala object. Also, the Concat expression's eval method and generated code both call utility functions on UTF8String (although in this case, UTF8String is a Java class).
There was a problem hiding this comment.
FWIW, I don't really feel strongly either way here. The codegen method isn't so large as to be hard to understand yet.
|
cc @ueshin |
|
Test build #89590 has finished for PR 21073 at commit
|
henryr
left a comment
There was a problem hiding this comment.
This looks pretty good to me, would be good to have one of the people most familiar with codegen take a look.
There was a problem hiding this comment.
are the casts to Object necessary?
There was a problem hiding this comment.
is there one extra space before if?
There was a problem hiding this comment.
good idea to check Seq(mNull, m0) as well in case there's any asymmetry in the way the first argument is handled.
There was a problem hiding this comment.
can you put a blank line between tests? makes it a bit easier to see the separation.
There was a problem hiding this comment.
FWIW, I don't really feel strongly either way here. The codegen method isn't so large as to be hard to understand yet.
There was a problem hiding this comment.
what's this for?
Excellent question. I don't know, except that it seems sometimes the first column is a list of columns. I used other functions as a template.
|
Test build #89759 has finished for PR 21073 at commit
|
|
retest this please |
|
Test build #89785 has finished for PR 21073 at commit
|
|
@gatorsmile this looks ready for your review (asking because you filed the JIRA) if you time, thanks! |
There was a problem hiding this comment.
What about override def nullable: Boolean = children.exists(_.nullable)?
There was a problem hiding this comment.
Use cases with children.size < 2 don't make sense but I think that all functions with a variable number of children should behave the same way. Check implementation of Concat and Concat_ws.
There was a problem hiding this comment.
Do you need to inherit from CodegenFallback if you've overriden doGenCode?
There was a problem hiding this comment.
I think you should add handling of nulls when values are of a primitive type.
There was a problem hiding this comment.
Please add more test cases with null values.
|
Test build #89944 has finished for PR 21073 at commit
|
There was a problem hiding this comment.
How about merging these two lines into one line import org.apache.spark.sql.catalyst.util._?
|
retest this please |
1 similar comment
|
retest this please |
|
Test build #89993 has finished for PR 21073 at commit
|
|
A test failed with "./bin/spark-submit ... No such file or directory" Seems like there's lots of spurious test failures right now. I will hold off on re-running for a little while. |
|
retest this please |
There was a problem hiding this comment.
Update: The below appears to be by design (see SPARK-9415). That is, MapData objects explicitly should not support hashCode or equality. There is even a test for this. As a result, concatenating two Maps with keys that are also Maps can result in duplicate keys in the resulting map. Adding hashCode and equals fixed the issue, but violates the basis for SPARK-9415. Any opinion @rxin @viirya @gatorsmile? (pinging people on that Jira).
I found an issue. I was preparing to add some more tests when I noticed that using maps as keys doesn't work well in interpreted mode (seems to work fine in codegen mode, so far).
So, something like this doesn't work in interpreted mode (and in some cases gencode mode):
scala> dfmapmap.show(truncate=false)
+--------------------------------------------------+---------------------------------------------+
|mapmap1 |mapmap2 |
+--------------------------------------------------+---------------------------------------------+
|[[1 -> 2, 3 -> 4] -> 101, [5 -> 6, 7 -> 8] -> 102]|[[11 -> 12] -> 103, [1 -> 2, 3 -> 4] -> 1001]|
+--------------------------------------------------+---------------------------------------------+
scala> dfmapmap.select(map_concat('mapmap1, 'mapmap2).as('mapmap3)).show(truncate=false)
+-----------------------------------------------------------------------------------------------+
|mapmap3 |
+-----------------------------------------------------------------------------------------------+
|[[1 -> 2, 3 -> 4] -> 101, [5 -> 6, 7 -> 8] -> 102, [11 -> 12] -> 103, [1 -> 2, 3 -> 4] -> 1001]|
+-----------------------------------------------------------------------------------------------+
As you can see, the key [1 -> 2, 3 -> 4] shows up twice in the new map.
This is because:
val a1 = new ArrayBasedMapData(new GenericArrayData(Array(1, 3)), new GenericArrayData(Array(2, 4))) val a2 = new ArrayBasedMapData(new GenericArrayData(Array(1, 3)), new GenericArrayData(Array(2, 4))) a1 == a2 // will be false a1.hashCode() == a2.hashCode() // will be false
Different instances of ArrayBasedMapData with the exact same data are not considered the same key. The same seems to be the case for UnsafeMapData as well (but usually works out in gencode mode only because of some magic under the hood that returns the same reference for identical keys).
There was a problem hiding this comment.
@bersprockets Hi, thanks for the investigation. We don't need to care about key duplication like CreateMap for now.
|
Test build #90020 has finished for PR 21073 at commit
|
There was a problem hiding this comment.
nit: we can use ArrayBasedMapData.apply().
There was a problem hiding this comment.
Use ctx.splitExpressionsWithCurrentInputs() or something to avoid exceeding JVM limit.
|
Test build #90091 has finished for PR 21073 at commit
|
There was a problem hiding this comment.
Just a question. What happens if union.entrySet().toArray() has more than 0x7FFF_FFFF elements?
There was a problem hiding this comment.
I would imagine bad things would happen before you got this far (even Map's size method returns an Int).
|
Test build #92662 has finished for PR 21073 at commit
|
|
Test build #92660 has finished for PR 21073 at commit
|
|
Jenkins, retest this please. |
|
LGTM pending Jenkins. |
|
Test build #92672 has finished for PR 21073 at commit
|
|
retest this please. |
|
Test build #92689 has finished for PR 21073 at commit
|
|
I'd retrigger the build again, just in case. |
|
Jenkins, retest this please. |
|
Test build #92728 has finished for PR 21073 at commit
|
|
Jenkins, retest this please. |
|
Test build #92733 has finished for PR 21073 at commit
|
|
Jenkins, retest this please. |
|
Test build #92740 has finished for PR 21073 at commit
|
|
Jenkins, retest this please. |
|
Test build #92745 has finished for PR 21073 at commit
|
|
Thanks! merging to master. |
|
@ueshin Thanks for all your help! |
What changes were proposed in this pull request?
Implement map_concat high order function.
This implementation does not pick a winner when the specified maps have overlapping keys. Therefore, this implementation preserves existing duplicate keys in the maps and potentially introduces new duplicates (After discussion with @ueshin, we settled on option 1 from here).
How was this patch tested?
New tests
Manual tests
Run all sbt SQL tests
Run all pyspark sql tests