[SPARK-24636][SQL] Type coercion of arrays for array_join function#21620
[SPARK-24636][SQL] Type coercion of arrays for array_join function#21620mn-mikke wants to merge 4 commits into
Conversation
| case ArrayJoin(arr, d, nr) if !ArrayType(StringType).acceptsType(arr.dataType) && | ||
| ArrayType.acceptsType(arr.dataType) => | ||
| val containsNull = arr.dataType.asInstanceOf[ArrayType].containsNull | ||
| ArrayJoin(Cast(arr, ArrayType(StringType, containsNull)), d, nr) |
There was a problem hiding this comment.
not every type can be casted to StringType. What about using ImplicitTypeCasts.implicitCast in order to check if we can cast it?
There was a problem hiding this comment.
Hi @mgaido91,
to be honest, I've considered this option before submitting this PR. But I'm glad that you mentioned this approach. At least, we can discuss pros and cons of different solutions. Usage of ImplicitTypeCasts.implicitCast would enable conversion only from primitive types. I think it would be nice to support non-primitive types as well. WDYT?
Re: Casting to StringType: According to Cast.canCast method should be possible to cast any type to StringType:
line 42: case (_, StringType) => true
Or am I missing something? I hope test cases in .../typeCoercion/native/arrayJoin.sql cover to StringType conversions from all Spark types.
There was a problem hiding this comment.
I am not sure. I think it is arguable which is the right result of SELECT array_join(array(array('a', 'b'), array('c', 'd')), ';') for instance. With this PR, the result is [a, b];[c, d] but shouldn't it be [a;b];[c;d]? Moreover, Presto, which is the reference here, doesn't support nested arrays for instance:
presto> select array_join(array[array[1, 2, 3], array[3, 4, 5]], ';');
Query 20180625_090549_00003_bsbcg failed: Input type array(integer) not supported
So, I'd avoid that.
There was a problem hiding this comment.
Ok, no problem. Let's support just arrays of primitive types for now. Thanks!
|
Test build #92253 has finished for PR 21620 at commit
|
| case aj @ ArrayJoin(arr, d, nr) if !ArrayType(StringType).acceptsType(arr.dataType) && | ||
| ArrayType.acceptsType(arr.dataType) => | ||
| val containsNull = arr.dataType.asInstanceOf[ArrayType].containsNull | ||
| ImplicitTypeCasts.implicitCast(arr, ArrayType(StringType, containsNull)) |
There was a problem hiding this comment.
nit: usually we try to avoid long chains of functions, what about:
ImplicitTypeCasts.implicitCast(arr, ArrayType(StringType, containsNull)) match {
...
}
|
LGMT |
| ArrayType.acceptsType(arr.dataType) => | ||
| val containsNull = arr.dataType.asInstanceOf[ArrayType].containsNull | ||
| ImplicitTypeCasts.implicitCast(arr, ArrayType(StringType, containsNull)) match { | ||
| case Some(finalDataType) => ArrayJoin(finalDataType, d, nr) |
There was a problem hiding this comment.
castedArr or something instead of finalDataType?
|
Test build #92297 has finished for PR 21620 at commit
|
|
Test build #92298 has finished for PR 21620 at commit
|
|
Test build #92299 has finished for PR 21620 at commit
|
|
Merged to master. |
| -- !query 9 | ||
| SELECT array_join(array(timestamp '2016-11-15 20:54:00.000', timestamp '2016-11-12 20:54:00.000'), ', ') | ||
| -- !query 9 schema | ||
| struct<array_join(array(TIMESTAMP('2016-11-15 20:54:00.0'), TIMESTAMP('2016-11-12 20:54:00.0')), , ):string> |
There was a problem hiding this comment.
If the input array is very long, the automatically generated column name will be also super long?
There was a problem hiding this comment.
Hm, yes, it will be. In general, if an expression has children: Seq[Expression] as its argument, the automatically generated column name will be long for now?
What changes were proposed in this pull request?
Presto's implementation accepts arbitrary arrays of primitive types as an input:
This PR proposes to implement a type coercion rule for
array_joinfunction that converts arrays of primitive as well as non-primitive types to arrays of string.How was this patch tested?
New test cases add into: