[SPARK-23936][SQL] Implement map_concat by bersprockets · Pull Request #21073 · apache/spark

bersprockets · 2018-04-15T03:42:23Z

What changes were proposed in this pull request?

Implement map_concat high order function.

This implementation does not pick a winner when the specified maps have overlapping keys. Therefore, this implementation preserves existing duplicate keys in the maps and potentially introduces new duplicates (After discussion with @ueshin, we settled on option 1 from here).

How was this patch tested?

New tests
Manual tests
Run all sbt SQL tests
Run all pyspark sql tests

SparkQA · 2018-04-15T06:50:04Z

Test build #89369 has finished for PR 21073 at commit d04893b.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-15T20:51:12Z

Test build #89378 has finished for PR 21073 at commit 97cffbe.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

henryr · 2018-04-16T23:37:30Z

What's the result of map_concat(NULL, NULL)?

@henryr empty map:

scala> df.select(map_concat('map1, 'map2).as('newMap)).show +------+ |newMap| +------+ | []| | []| +------+

Presto docs (from which the proposed spec comes) are quiet on the matter. Even after looking at the Presto code, I am still hard-pressed to say.

I did divine from the Presto code that there should be at least two inputs (and I don't currently verify that).

Hm, seems a bit unusual to me to have, in effect, NULL ++ NULL => Map(). I checked with Presto and it looks like it returns NULL:

presto> select map_concat(NULL, NULL) -> ; _col0 ------- NULL (1 row)

@henryr Since Presto is the reference, map_concat should return NULL in this case. I will update.

@henryr Another quick test of Presto also shows that if any input is NULL, the result is NULL:

presto:default> SELECT map_concat(NULL, map(ARRAY[1,3], ARRAY[2,4])); _col0 ------- NULL (1 row)

Looks like I need to check if any input is NULL.

SparkQA · 2018-04-18T01:59:42Z

Test build #89473 has finished for PR 21073 at commit 44137cc.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2018-04-18T21:28:31Z

Test build #89523 has finished for PR 21073 at commit d3d6ad6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-19T19:23:03Z

Test build #89579 has finished for PR 21073 at commit 62df629.

This patch fails from timeout after a configured wait of `300m`.
This patch merges cleanly.
This patch adds no public classes.

bersprockets · 2018-04-19T20:30:56Z

Since this logic is big enough (and similar enough to the logic in eval), I wonder if the merge logic should be moved to a utility class and called from both eval as well as the generated code.

The FromUTCTimestamp expression does something sort of like that, where the eval method as well as the generated code both call utility functions in the DateTimeUtils scala object. Also, the Concat expression's eval method and generated code both call utility functions on UTF8String (although in this case, UTF8String is a Java class).

FWIW, I don't really feel strongly either way here. The codegen method isn't so large as to be hard to understand yet.

gatorsmile · 2018-04-19T22:08:35Z

cc @ueshin

SparkQA · 2018-04-20T00:00:09Z

Test build #89590 has finished for PR 21073 at commit a904c17.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

henryr

This looks pretty good to me, would be good to have one of the people most familiar with codegen take a look.

henryr · 2018-04-23T22:39:49Z

henryr · 2018-04-23T22:44:06Z

are the casts to Object necessary?

henryr · 2018-04-23T22:45:31Z

is there one extra space before if?

henryr · 2018-04-23T22:46:43Z

good idea to check Seq(mNull, m0) as well in case there's any asymmetry in the way the first argument is handled.

henryr · 2018-04-23T22:47:39Z

can you put a blank line between tests? makes it a bit easier to see the separation.

henryr · 2018-04-23T22:49:49Z

FWIW, I don't really feel strongly either way here. The codegen method isn't so large as to be hard to understand yet.

henryr · 2018-04-23T22:50:41Z

what's this for?

what's this for?

Excellent question. I don't know, except that it seems sometimes the first column is a list of columns. I used other functions as a template.

SparkQA · 2018-04-24T07:05:02Z

Test build #89759 has finished for PR 21073 at commit 13baf96.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

bersprockets · 2018-04-24T13:36:07Z

retest this please

SparkQA · 2018-04-24T17:19:49Z

Test build #89785 has finished for PR 21073 at commit 13baf96.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

henryr · 2018-04-24T23:54:19Z

@gatorsmile this looks ready for your review (asking because you filed the JIRA) if you time, thanks!

mn-mikke · 2018-04-26T15:39:21Z

What about override def nullable: Boolean = children.exists(_.nullable)?

mn-mikke · 2018-04-26T15:49:12Z

Use cases with children.size < 2 don't make sense but I think that all functions with a variable number of children should behave the same way. Check implementation of Concat and Concat_ws.

mn-mikke · 2018-04-26T16:02:29Z

Do you need to inherit from CodegenFallback if you've overriden doGenCode?

mn-mikke · 2018-04-26T16:24:59Z

I think you should add handling of nulls when values are of a primitive type.

mn-mikke · 2018-04-26T16:26:38Z

since = "2.4.0"

mn-mikke · 2018-04-26T16:28:23Z

Please add more test cases with null values.

SparkQA · 2018-04-28T01:06:00Z

Test build #89944 has finished for PR 21073 at commit 2e49b1e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-04-28T01:50:18Z

How about merging these two lines into one line import org.apache.spark.sql.catalyst.util._?

bersprockets · 2018-04-30T19:42:36Z

@mn-mikke @kiszk Thanks for the review. I addressed the comments. Please take a look when you have a chance.

bersprockets · 2018-04-30T19:57:30Z

retest this please

bersprockets · 2018-05-01T17:58:19Z

retest this please

SparkQA · 2018-05-01T19:49:11Z

Test build #89993 has finished for PR 21073 at commit d9dccd3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

bersprockets · 2018-05-01T22:58:36Z

A test failed with "./bin/spark-submit ... No such file or directory"

Seems like there's lots of spurious test failures right now. I will hold off on re-running for a little while.

bersprockets · 2018-05-02T01:18:45Z

retest this please

bersprockets · 2018-05-02T04:51:06Z

Update: The below appears to be by design (see SPARK-9415). That is, MapData objects explicitly should not support hashCode or equality. There is even a test for this. As a result, concatenating two Maps with keys that are also Maps can result in duplicate keys in the resulting map. Adding hashCode and equals fixed the issue, but violates the basis for SPARK-9415. Any opinion @rxin @viirya @gatorsmile? (pinging people on that Jira).

I found an issue. I was preparing to add some more tests when I noticed that using maps as keys doesn't work well in interpreted mode (seems to work fine in codegen mode, so far).

So, something like this doesn't work in interpreted mode (and in some cases gencode mode):

scala> dfmapmap.show(truncate=false) +--------------------------------------------------+---------------------------------------------+ |mapmap1 |mapmap2 | +--------------------------------------------------+---------------------------------------------+ |[[1 -> 2, 3 -> 4] -> 101, [5 -> 6, 7 -> 8] -> 102]|[[11 -> 12] -> 103, [1 -> 2, 3 -> 4] -> 1001]| +--------------------------------------------------+---------------------------------------------+ scala> dfmapmap.select(map_concat('mapmap1, 'mapmap2).as('mapmap3)).show(truncate=false) +-----------------------------------------------------------------------------------------------+ |mapmap3 | +-----------------------------------------------------------------------------------------------+ |[[1 -> 2, 3 -> 4] -> 101, [5 -> 6, 7 -> 8] -> 102, [11 -> 12] -> 103, [1 -> 2, 3 -> 4] -> 1001]| +-----------------------------------------------------------------------------------------------+

As you can see, the key [1 -> 2, 3 -> 4] shows up twice in the new map.

This is because:

val a1 = new ArrayBasedMapData(new GenericArrayData(Array(1, 3)), new GenericArrayData(Array(2, 4))) val a2 = new ArrayBasedMapData(new GenericArrayData(Array(1, 3)), new GenericArrayData(Array(2, 4))) a1 == a2 // will be false a1.hashCode() == a2.hashCode() // will be false

Different instances of ArrayBasedMapData with the exact same data are not considered the same key. The same seems to be the case for UnsafeMapData as well (but usually works out in gencode mode only because of some magic under the hood that returns the same reference for identical keys).

@bersprockets Hi, thanks for the investigation. We don't need to care about key duplication like CreateMap for now.

SparkQA · 2018-05-02T05:10:25Z

Test build #90020 has finished for PR 21073 at commit d9dccd3.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2018-05-03T04:21:55Z

nit: we can use ArrayBasedMapData.apply().

ueshin · 2018-05-03T04:38:42Z

ueshin · 2018-05-03T04:38:53Z

ueshin · 2018-05-03T04:41:34Z

Use ctx.splitExpressionsWithCurrentInputs() or something to avoid exceeding JVM limit.

SparkQA · 2018-05-03T07:05:01Z

Test build #90091 has finished for PR 21073 at commit 77ae014.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-05-03T15:58:24Z

Just a question. What happens if union.entrySet().toArray() has more than 0x7FFF_FFFF elements?

I would imagine bad things would happen before you got this far (even Map's size method returns an Int).

SparkQA · 2018-07-06T05:20:27Z

Test build #92662 has finished for PR 21073 at commit 03328a4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-06T06:55:36Z

Test build #92660 has finished for PR 21073 at commit 3c0da03.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2018-07-06T07:21:13Z

Jenkins, retest this please.

ueshin · 2018-07-06T07:22:23Z

LGTM pending Jenkins.

SparkQA · 2018-07-06T09:23:23Z

Test build #92672 has finished for PR 21073 at commit 03328a4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

bersprockets · 2018-07-06T14:52:19Z

retest this please.

SparkQA · 2018-07-06T18:42:05Z

Test build #92689 has finished for PR 21073 at commit 03328a4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2018-07-09T03:06:12Z

I'd retrigger the build again, just in case.

ueshin · 2018-07-09T03:06:22Z

Jenkins, retest this please.

SparkQA · 2018-07-09T06:09:24Z

Test build #92728 has finished for PR 21073 at commit 03328a4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2018-07-09T06:10:55Z

Jenkins, retest this please.

SparkQA · 2018-07-09T07:05:02Z

Test build #92733 has finished for PR 21073 at commit 03328a4.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2018-07-09T07:06:33Z

Jenkins, retest this please.

SparkQA · 2018-07-09T07:59:07Z

Test build #92740 has finished for PR 21073 at commit 03328a4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2018-07-09T08:20:23Z

Jenkins, retest this please.

SparkQA · 2018-07-09T12:13:54Z

Test build #92745 has finished for PR 21073 at commit 03328a4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2018-07-09T12:20:36Z

Thanks! merging to master.

bersprockets · 2018-07-11T00:32:08Z

@ueshin Thanks for all your help!

henryr reviewed Apr 16, 2018

View reviewed changes

bersprockets changed the title ~~[SPARK-23936][SQL][WIP] Implement map_concat~~ [SPARK-23936][SQL] Implement map_concat Apr 17, 2018

bersprockets force-pushed the SPARK-23936 branch from 44137cc to d3d6ad6 Compare April 18, 2018 17:33

bersprockets commented Apr 19, 2018

View reviewed changes

henryr reviewed Apr 23, 2018

View reviewed changes

mn-mikke reviewed Apr 26, 2018

View reviewed changes

kiszk reviewed Apr 28, 2018

View reviewed changes

bersprockets commented May 2, 2018

View reviewed changes

ueshin reviewed May 3, 2018

View reviewed changes

kiszk reviewed May 3, 2018

View reviewed changes

bersprockets added 6 commits July 5, 2018 17:54

Review feedback: use pre-existing empty collections

206db97

Allow duplicate keys

549300f

Remove extra line added during rebase

1b52dd1

Review comments

969c66e

Review comments

47f0cf5

Initial implementation of type coercion for map_concat

3c0da03

bersprockets force-pushed the SPARK-23936 branch from 4ee7b46 to 3c0da03 Compare July 6, 2018 03:03

Simplify type coercion for map_concat parameters

03328a4

asfgit closed this in 034913b Jul 9, 2018

bersprockets deleted the SPARK-23936 branch December 30, 2018 17:20

asiunov mentioned this pull request Jan 9, 2019

[MINOR] Follow up for SPARK-23936, fixed description of "map_concat" function #23493

Closed

Uh oh!

Conversation

bersprockets commented Apr 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Apr 15, 2018

Uh oh!

SparkQA commented Apr 15, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bersprockets Apr 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 18, 2018

Uh oh!

SparkQA commented Apr 18, 2018

Uh oh!

SparkQA commented Apr 19, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Apr 19, 2018

Uh oh!

SparkQA commented Apr 20, 2018

Uh oh!

henryr left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 24, 2018

Uh oh!

bersprockets commented Apr 24, 2018

Uh oh!

SparkQA commented Apr 24, 2018

Uh oh!

henryr commented Apr 24, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bersprockets commented Apr 15, 2018 •

edited

Loading

bersprockets Apr 17, 2018 •

edited

Loading

kiszk Apr 28, 2018 •

edited

Loading

bersprockets May 2, 2018 •

edited

Loading