Skip to content

[SPARK-32796][SQL] Make withField API support nested struct in array#29645

Closed
viirya wants to merge 2 commits into
apache:masterfrom
viirya:with-field-array
Closed

[SPARK-32796][SQL] Make withField API support nested struct in array#29645
viirya wants to merge 2 commits into
apache:masterfrom
viirya:with-field-array

Conversation

@viirya

@viirya viirya commented Sep 4, 2020

Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

This patch adds nested struct support to Column.withField API.

Why are the changes needed?

Currently Column.withField only supports StructType. For nested struct in ArrayType, it doesn't support. We can support nested struct in array to make the API more general and useful.

Does this PR introduce any user-facing change?

Yes. Adding nested struct support to Column.withField API.

How was this patch tested?

Unit tests.

case q: LogicalPlan =>
q.transformExpressions {
case expr if !expr.childrenResolved => expr
case e: UnresolvedWithFields => WithFields(e.col, e.fieldName, e.expr)

@viirya viirya Sep 4, 2020

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be moved to other proper rule. Just not sure which one is good, so put as an individual rule first.

@SparkQA

SparkQA commented Sep 4, 2020

Copy link
Copy Markdown

Test build #128278 has finished for PR 29645 at commit 643728d.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class UnresolvedWithFields(

@HyukjinKwon

Copy link
Copy Markdown
Member

retest this please


test("withField should add field to struct of array") {
checkAnswerAndSchema(
arrayLevel1.withColumn("a", 'a.withField("d", lit(4))),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya, BTW, is it same as:

transform($"struct_col", _.withField("new_col", col))

?

@HyukjinKwon

HyukjinKwon commented Sep 4, 2020

Copy link
Copy Markdown
Member

I personally more prefer an explicit call such as transform($"struct_col", _.withField("new_col", col)) as it's more explicit but I got that this can be sort of similar with the nested column access in SQL like a.b supports the arrays of structs. No strong opinion on this. WDYT @cloud-fan.

@cloud-fan

Copy link
Copy Markdown
Contributor

I agree with @HyukjinKwon that it's better to be more explicit. a.b is a bad example, as it arbitrarily decides to support only one level, array of array of struct is not supported. I prefer to not make the same mistake in withField.

@SparkQA

SparkQA commented Sep 4, 2020

Copy link
Copy Markdown

Test build #128288 has finished for PR 29645 at commit 643728d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class UnresolvedWithFields(

@viirya

viirya commented Sep 4, 2020

Copy link
Copy Markdown
Member Author

@cloud-fan @HyukjinKwon Thanks for comment. So let me get more ideas from you. This is like a syntax sugar to more easily express complex nested withField chain. I think we can explicitly expression it by using transform + withField, but as it goes deeply more, the syntax becomes hard to express and read. This doesn't create new syntax other than a.b used to access nested field in array, but just leverages existing one.

@viirya

viirya commented Sep 5, 2020

Copy link
Copy Markdown
Member Author

An example looks like:

private lazy val arrayType = ArrayType(
    StructType(Seq(
      StructField("a", IntegerType, nullable = false),
      StructField("b", IntegerType, nullable = true),
      StructField("c", IntegerType, nullable = false))),
    containsNull = true)

private lazy val arrayStructArrayLevel1: DataFrame = spark.createDataFrame(
    sparkContext.parallelize(Row(Array(Row(Array(Row(1, null, 3)), null, 3))) :: Nil),
    StructType(
      Seq(StructField("a", ArrayType(
        StructType(Seq(
          StructField("a", arrayType, nullable = false),
          StructField("b", IntegerType, nullable = true),
          StructField("c", IntegerType, nullable = false))),
        containsNull = false)))))

The data looks like:

+---------------------------+
|a                          |
+---------------------------+
|[{[{1, null, 3}], null, 3}]|
+---------------------------+

In order to replace deeply nested b column, like:

+------------------------+
|a                       |
+------------------------+
|[{[{1, 2, 3}], null, 3}]|
+------------------------+

Currently by using transform + withField, we probably need the following expressions. It looks complicated and takes a while to write it.

arrayStructArrayLevel1.withColumn("a",
      transform($"a", _.withField("a",
        flatten(transform($"a.a", transform(_, _.withField("b", lit(2))))))))

Using modified withField, we can do it like:

arrayStructArrayLevel1.withColumn("a", $"a".withField("a.b", lit(2)))

It could significantly simplify how we add/replace deeply nested fields.

@SparkQA

SparkQA commented Sep 5, 2020

Copy link
Copy Markdown

Test build #128313 has finished for PR 29645 at commit 27f37dc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Sep 5, 2020

Copy link
Copy Markdown

Test build #128314 has finished for PR 29645 at commit 39edb0a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan

Copy link
Copy Markdown
Contributor

We can save more code by supporting array of array of struct. It's a trade-off between "clear and simple semantic" vs "flexiblity of supporting various input types".

@viirya

viirya commented Sep 7, 2020

Copy link
Copy Markdown
Member Author

@cloud-fan Sorry if I mis-read your comment. Do you mean we should support array of array of struct?

@cloud-fan

Copy link
Copy Markdown
Contributor

I mean we should prefer "clear and simple semantic", otherwise people can always ask to be more flexible and save more code, like supporting array of array of struct.

@viirya

viirya commented Sep 8, 2020

Copy link
Copy Markdown
Member Author

Okay, I see. It also makes sense to me. This is a hard trade-off between simplicity and flexibility. I will close this now. If we need this flexibility in the future, we can revisit it.

@encarvlucas

Copy link
Copy Markdown

Could you revisit this?

@cloud-fan

Copy link
Copy Markdown
Contributor

Does transform($"struct_col", _.withField("new_col", col)) work for you?

@encarvlucas

Copy link
Copy Markdown

Unfortunately it doesn't. I'm using spark through the pyspark library

@cloud-fan

Copy link
Copy Markdown
Contributor

@HyukjinKwon shall we add the transform API in pyspark?

@encarvlucas

Copy link
Copy Markdown

There is a transform implementation in pyspark, but it does not work with python UDFs. Might be related to this issue.
Also, there is no withField function in Spark SQL.

@cloud-fan

Copy link
Copy Markdown
Contributor

Do you mean the col in transform($"struct_col", _.withField("new_col", col)) is a python udf? I'm not sure if this PR can help due to the limitation of python udf execution...

@encarvlucas

Copy link
Copy Markdown

Yes. Nevermind then, I retract my request.

@viirya viirya deleted the with-field-array branch December 27, 2023 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants