[SPARK-52449][CONNECT][PYTHON][ML] Make datatypes for Expression.Literal.Map/Array optional#51473
Closed
heyihong wants to merge 1 commit into
Closed
[SPARK-52449][CONNECT][PYTHON][ML] Make datatypes for Expression.Literal.Map/Array optional#51473heyihong wants to merge 1 commit into
heyihong wants to merge 1 commit into
Conversation
62372ff to
2f4c24b
Compare
9e7737d to
71814d0
Compare
Contributor
Author
hvanhovell
reviewed
Jul 16, 2025
7e5b478 to
da00208
Compare
Member
|
cc @zhengruifeng too |
c819bcb to
c579c1c
Compare
Contributor
Author
|
friendly ping @hvanhovell @HyukjinKwon @beliefer @zhengruifeng @WeichenXu123 Update: No need to review at the moment — I need to finish SPARK-52930 first. |
f2387fe to
e8daef5
Compare
Contributor
Author
e8daef5 to
7a6bb95
Compare
zhengruifeng
reviewed
Sep 3, 2025
Contributor
There was a problem hiding this comment.
what is the diff here? why it changed the analyzed plan?
Contributor
Author
There was a problem hiding this comment.
@zhengruifeng The diff is caused by an extra test case I added:
fn.typedLit(
Seq(
mutable.LinkedHashMap("a" -> Seq("1", "2"), "b" -> Seq("3", "4")),
mutable.LinkedHashMap("a" -> Seq("5", "6"), "b" -> Seq("7", "8")),
mutable.LinkedHashMap("a" -> Seq.empty[String], "b" -> Seq.empty[String])))[keys: [a,b], values: [[1,2],[3,4]],keys: [a,b], values: [[5,6],[7,8]],keys: [a,b], values: [[],[]]] AS ARRAY(MAP('a', ARRAY('1', '2'), 'b', ARRAY('3', '4')), MAP('a', ARRAY('5', '6'), 'b', ARRAY('7', '8')), MAP('a', ARRAY(), 'b', ARRAY()))#0
zhengruifeng
approved these changes
Sep 8, 2025
Contributor
|
@heyihong please resolve the conflicts |
e5e59aa to
dc6ed7c
Compare
…teral.Array optional
313835d to
e301da3
Compare
Contributor
|
merged to master |
dongjoon-hyun
added a commit
to apache/spark-connect-swift
that referenced
this pull request
Oct 1, 2025
…th `4.1.0-preview2` ### What changes were proposed in this pull request? This PR aims to update Spark Connect-generated Swift source code with Apache Spark `4.1.0-preview2`. ### Why are the changes needed? There are many changes from Apache Spark 4.1.0. - apache/spark#52342 - apache/spark#52256 - apache/spark#52271 - apache/spark#52242 - apache/spark#51473 - apache/spark#51653 - apache/spark#52072 - apache/spark#51561 - apache/spark#51563 - apache/spark#51489 - apache/spark#51507 - apache/spark#51462 - apache/spark#51464 - apache/spark#51442 To use the latest bug fixes and new messages to develop for new features of `4.1.0-preview2`. ``` $ git clone -b v4.1.0-preview2 https://github.com/apache/spark.git $ cd spark/sql/connect/common/src/main/protobuf/ $ protoc --swift_out=. spark/connect/*.proto $ protoc --grpc-swift_out=. spark/connect/*.proto // Remove empty GRPC files $ cd spark/connect $ grep 'This file contained no services' * catalog.grpc.swift:// This file contained no services. commands.grpc.swift:// This file contained no services. common.grpc.swift:// This file contained no services. example_plugins.grpc.swift:// This file contained no services. expressions.grpc.swift:// This file contained no services. ml_common.grpc.swift:// This file contained no services. ml.grpc.swift:// This file contained no services. pipelines.grpc.swift:// This file contained no services. relations.grpc.swift:// This file contained no services. types.grpc.swift:// This file contained no services. $ rm catalog.grpc.swift commands.grpc.swift common.grpc.swift example_plugins.grpc.swift expressions.grpc.swift ml_common.grpc.swift ml.grpc.swift pipelines.grpc.swift relations.grpc.swift types.grpc.swift ``` ### Does this PR introduce _any_ user-facing change? Pass the CIs. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #250 from dongjoon-hyun/SPARK-53777. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
huangxiaopingRD
pushed a commit
to huangxiaopingRD/spark
that referenced
this pull request
Nov 25, 2025
…ral.Map/Array optional ### What changes were proposed in this pull request? This PR optimizes the `LiteralValueProtoConverter` to reduce redundant type information in Spark Connect protocol buffers. The key changes include: 1. **Optimized type inference for arrays and maps**: Modified the conversion logic to only include type information in the first element of arrays and the first key-value pair of maps, since subsequent elements can infer their types from the first element. 2. **Added `needDataType` parameter**: Introduced a new parameter to control when type information is necessary, allowing the converter to skip redundant type information. 3. **Updated protobuf documentation**: Enhanced comments in the protobuf definitions to clarify that only the first element needs to contain type information for inference. 4. **Improved test coverage**: Added new test cases for complex nested structures including tuples and maps with array values. ### Why are the changes needed? The current implementation includes type information for every element in arrays and every key-value pair in maps, which is redundant and increases the size of protocol buffer messages. Since Spark Connect can infer types from the first element, including type information for subsequent elements is unnecessary and wastes bandwidth and processing time. ### Does this PR introduce any user-facing change? **No** - This PR does not introduce any user-facing changes. The change is backward compatible and existing connect clients will continue to work unchanged. ### How was this patch tested? `build/sbt "connect/testOnly *LiteralExpressionProtoConverterSuite"` ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Cursor 1.4.5 Closes apache#51473 from heyihong/SPARK-52449. Authored-by: Yihong He <heyihong.cn@gmail.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This PR optimizes the
LiteralValueProtoConverterto reduce redundant type information in Spark Connect protocol buffers. The key changes include:Optimized type inference for arrays and maps: Modified the conversion logic to only include type information in the first element of arrays and the first key-value pair of maps, since subsequent elements can infer their types from the first element.
Added
needDataTypeparameter: Introduced a new parameter to control when type information is necessary, allowing the converter to skip redundant type information.Updated protobuf documentation: Enhanced comments in the protobuf definitions to clarify that only the first element needs to contain type information for inference.
Improved test coverage: Added new test cases for complex nested structures including tuples and maps with array values.
Why are the changes needed?
The current implementation includes type information for every element in arrays and every key-value pair in maps, which is redundant and increases the size of protocol buffer messages. Since Spark Connect can infer types from the first element, including type information for subsequent elements is unnecessary and wastes bandwidth and processing time.
Does this PR introduce any user-facing change?
No - This PR does not introduce any user-facing changes.
The change is backward compatible and existing connect clients will continue to work unchanged.
How was this patch tested?
build/sbt "connect/testOnly *LiteralExpressionProtoConverterSuite"Was this patch authored or co-authored using generative AI tooling?
Generated-by: Cursor 1.4.5