[SPARK-54194][CONNECT] Spark Connect Proto Plan Compression#52894
[SPARK-54194][CONNECT] Spark Connect Proto Plan Compression#52894xi-db wants to merge 23 commits into
Conversation
# Conflicts: # python/pyspark/sql/connect/client/core.py # python/pyspark/sql/connect/proto/base_pb2.py # sql/connect/server/src/test/scala/org/apache/spark/sql/connect/service/SparkConnectServiceE2ESuite.scala
| "subClass" : { | ||
| "CANNOT_PARSE" : { | ||
| "message" : [ | ||
| "Cannot decompress or parse the input plan (<errorMsg>)" | ||
| ] | ||
| }, | ||
| "PLAN_SIZE_LARGER_THAN_MAX" : { | ||
| "message" : [ | ||
| "The plan size is larger than max (<planSize> vs. <maxPlanSize>)" | ||
| ] |
There was a problem hiding this comment.
Let's add more info here on what the user facing resolution is.
For CANNOT_PARSE, it would be to disable compression and for the large plan size, add some information on when these situations may occur and offer an alternative (e.g split/reduce operations, utilise temp views for intermediate materialization etc)
There was a problem hiding this comment.
Yes, I've updated the message to include the possible resolution.
# Conflicts: # dev/requirements.txt # dev/spark-test-image/numpy-213/Dockerfile # dev/spark-test-image/python-310/Dockerfile # dev/spark-test-image/python-311/Dockerfile # dev/spark-test-image/python-312/Dockerfile # dev/spark-test-image/python-313-nogil/Dockerfile # dev/spark-test-image/python-313/Dockerfile # dev/spark-test-image/python-314/Dockerfile # dev/spark-test-image/python-minimum/Dockerfile # dev/spark-test-image/python-ps-minimum/Dockerfile
There was a problem hiding this comment.
Hi, @xi-db and @hvanhovell .
May I ask why we choose 0.23 here?
Since this is a new dependency, can we use the latest bug-fixed stable version, 0.25.0 (Sep 14, 2025), instead of the old one, 0.23.0 (Jul 14, 2024)?
zstandard>=0.23.0
Technically, 0.24.0 is the first version to support Python 3.14 officially.
There are many changes since 0.23.0.
|
@xi-db please also update the doc https://apache.github.io/spark/api/python/getting_started/install.html#spark-connect |
|
Thanks @zhengruifeng , I've just added the doc change to the followup PR: #53003. |
|
It seems that we missed |
### What changes were proposed in this pull request? This PR aims to recover MacOS CIs by installing `zstandard==0.25.0`. ### Why are the changes needed? After SPARK-54194, `zstandard` Python package is required for `Connect`. - #52894 Currently, MacOS CIs are broken. - https://github.com/apache/spark/actions/workflows/build_python_3.11_macos26.yml - https://github.com/apache/spark/actions/workflows/build_python_3.11_macos.yml ``` Traceback (most recent call last): File "/Users/runner/work/spark/spark/python/pyspark/sql/connect/utils.py", line 105, in require_minimum_zstandard_version import zstandard # noqa ^^^^^^^^^^^^^^^^ ModuleNotFoundError: No module named 'zstandard' ``` ### Does this PR introduce _any_ user-facing change? No behavior change. ### How was this patch tested? Pass the CIs and manual review because MacOS CIs are triggered daily CIs only. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #53025 from dongjoon-hyun/SPARK-54326. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request? This PR aims to recover MacOS CIs by installing `zstandard==0.25.0`. ### Why are the changes needed? After SPARK-54194, `zstandard` Python package is required for `Connect`. - #52894 Currently, MacOS CIs are broken. - https://github.com/apache/spark/actions/workflows/build_python_3.11_macos26.yml - https://github.com/apache/spark/actions/workflows/build_python_3.11_macos.yml ``` Traceback (most recent call last): File "/Users/runner/work/spark/spark/python/pyspark/sql/connect/utils.py", line 105, in require_minimum_zstandard_version import zstandard # noqa ^^^^^^^^^^^^^^^^ ModuleNotFoundError: No module named 'zstandard' ``` ### Does this PR introduce _any_ user-facing change? No behavior change. ### How was this patch tested? Pass the CIs and manual review because MacOS CIs are triggered daily CIs only. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #53025 from dongjoon-hyun/SPARK-54326. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 5c16a73) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
|
Thank you @dongjoon-hyun ! |
|
Hi, @xi-db , @hvanhovell , @zhengruifeng . This seems to break our Python Classic CIs. I made a follow-up. Could you review the PR? |
… conditionally ### What changes were proposed in this pull request? This PR is a follow-up of the following to fix `connectutils.py` to import `pb2` conditionally. - #52894 ### Why are the changes needed? Currently, Python CIs are broken like the following. - https://github.com/apache/spark/actions/workflows/build_python_3.11_classic_only.yml - https://github.com/apache/spark/actions/runs/19316448951/job/55248810741 - https://github.com/apache/spark/actions/workflows/build_python_3.12.yml - https://github.com/apache/spark/actions/runs/19275741458/job/55212353468 ``` File "/__w/spark/spark/python/pyspark/testing/connectutils.py", line 26, in <module> import pyspark.sql.connect.proto as pb2 File "/__w/spark/spark/python/pyspark/sql/connect/proto/__init__.py", line 18, in <module> from pyspark.sql.connect.proto.base_pb2_grpc import * File "/__w/spark/spark/python/pyspark/sql/connect/proto/base_pb2_grpc.py", line 19, in <module> import grpc ModuleNotFoundError: No module named 'grpc' ``` ### Does this PR introduce _any_ user-facing change? No behavior change. We has been importing `pyspark.sql.connect` conditionally before #52894 . ### How was this patch tested? Pass the CIs and manual test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #53037 from dongjoon-hyun/SPARK-54194. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
… conditionally ### What changes were proposed in this pull request? This PR is a follow-up of the following to fix `connectutils.py` to import `pb2` conditionally. - #52894 ### Why are the changes needed? Currently, Python CIs are broken like the following. - https://github.com/apache/spark/actions/workflows/build_python_3.11_classic_only.yml - https://github.com/apache/spark/actions/runs/19316448951/job/55248810741 - https://github.com/apache/spark/actions/workflows/build_python_3.12.yml - https://github.com/apache/spark/actions/runs/19275741458/job/55212353468 ``` File "/__w/spark/spark/python/pyspark/testing/connectutils.py", line 26, in <module> import pyspark.sql.connect.proto as pb2 File "/__w/spark/spark/python/pyspark/sql/connect/proto/__init__.py", line 18, in <module> from pyspark.sql.connect.proto.base_pb2_grpc import * File "/__w/spark/spark/python/pyspark/sql/connect/proto/base_pb2_grpc.py", line 19, in <module> import grpc ModuleNotFoundError: No module named 'grpc' ``` ### Does this PR introduce _any_ user-facing change? No behavior change. We has been importing `pyspark.sql.connect` conditionally before #52894 . ### How was this patch tested? Pass the CIs and manual test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #53037 from dongjoon-hyun/SPARK-54194. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 63bcc87) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
… - Scala Client ### What changes were proposed in this pull request? In the previous PR #52894 of Spark Connect Proto Plan Compression, both Server-side and PySpark client changes were implemented. In this PR, the corresponding Scala client changes are implemented, so plan compression are now supported on the Scala client as well. To reproduce the existing issue we are solving here, run this code on Spark Connect Scala client: ``` import scala.util.Random import org.apache.spark.sql.DataFrame import spark.implicits._ def randomLetters(n: Int): String = { Iterator.continually(Random.nextPrintableChar()) .filter(_.isLetter) .take(n) .mkString } val numUniqueSmallRelations = 5 val sizePerSmallRelation = 512 * 1024 val smallDfs: Seq[DataFrame] = (0 until numUniqueSmallRelations).map { _ => Seq(randomLetters(sizePerSmallRelation)).toDF("value") } var resultDf = smallDfs.head for (_ <- 0 until 500) { val idx = Random.nextInt(smallDfs.length) resultDf = resultDf.unionByName(smallDfs(idx)) } resultDf.collect() ``` It fails with RESOURCE_EXHAUSTED error with message `gRPC message exceeds maximum size 134217728: 269207219`, because the server is trying to send an ExecutePlanResponse of ~260MB to the client. With the improvement introduced by the PR, the above code runs successfully and prints the expected result. ### Why are the changes needed? It improves Spark Connect stability when handling large plans. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #53003 from xi-db/plan-compression-scala-client. Authored-by: Xi Lyu <xi.lyu@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com>
… - Scala Client ### What changes were proposed in this pull request? In the previous PR #52894 of Spark Connect Proto Plan Compression, both Server-side and PySpark client changes were implemented. In this PR, the corresponding Scala client changes are implemented, so plan compression are now supported on the Scala client as well. To reproduce the existing issue we are solving here, run this code on Spark Connect Scala client: ``` import scala.util.Random import org.apache.spark.sql.DataFrame import spark.implicits._ def randomLetters(n: Int): String = { Iterator.continually(Random.nextPrintableChar()) .filter(_.isLetter) .take(n) .mkString } val numUniqueSmallRelations = 5 val sizePerSmallRelation = 512 * 1024 val smallDfs: Seq[DataFrame] = (0 until numUniqueSmallRelations).map { _ => Seq(randomLetters(sizePerSmallRelation)).toDF("value") } var resultDf = smallDfs.head for (_ <- 0 until 500) { val idx = Random.nextInt(smallDfs.length) resultDf = resultDf.unionByName(smallDfs(idx)) } resultDf.collect() ``` It fails with RESOURCE_EXHAUSTED error with message `gRPC message exceeds maximum size 134217728: 269207219`, because the server is trying to send an ExecutePlanResponse of ~260MB to the client. With the improvement introduced by the PR, the above code runs successfully and prints the expected result. ### Why are the changes needed? It improves Spark Connect stability when handling large plans. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #53003 from xi-db/plan-compression-scala-client. Authored-by: Xi Lyu <xi.lyu@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com> (cherry picked from commit 6cb88c1) Signed-off-by: Herman van Hovell <herman@databricks.com>
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Hi, @xi-db and @hvanhovell .
Could you confirm that Apache Spark 3.5 Spark Connect client still can talk with Apache Spark 4.1.0 Spark Connect Server by default without any additional setting?
| .internal() | ||
| .stringConf | ||
| .transform(_.toUpperCase(Locale.ROOT)) | ||
| .checkValues(ConnectPlanCompressionAlgorithm.values.map(_.toString)) |
There was a problem hiding this comment.
Is there a way to unset this feature? It seems that we enforce to use ZSTD because it's the only one, @xi-db and @hvanhovell ?
object ConnectPlanCompressionAlgorithm extends Enumeration {
val ZSTD = Value
}There was a problem hiding this comment.
To disable the feature, user can set the spark.connect.session.planCompression.threshold to -1 (link). Still, it is better to have a enum "None". Thanks for your PR!
|
I made a PR to support |
### What changes were proposed in this pull request? Currently, Spark Connect enforce gRPC message limits on both the client and the server. These limits are largely meant to protect the server from potential OOMs by rejecting abnormally large messages. However, there are several cases where genuine messages exceed the limit and cause execution failures. To improve Spark Connect stability, this PR implements compressing unresolved proto plans to mitigate the issue of oversized messages from the client to the server. The compression applies to ExecutePlan and AnalyzePlan - the only two methods that might hit the message limit. The other issue of message limit from the server to the client is a different issue, and it’s out of the scope (that one is already fixed in apache#52271). In the implementation, * Zstandard is leveraged to compress proto plan as it has consistent high performance in our benchmark and achieves a good balance between compression ratio and performance. * The config `spark.connect.maxPlanSize` is introduced to control the maximum size of a (decompressed) proto plan that can be executed in Spark Connect. It is mainly used to avoid decompression bomb attacks. (Scala client changes are being implemented in a follow-up PR.) To reproduce the existing issue we are solving here, run this code on Spark Connect: ``` import random import string def random_letters(length: int) -> str: return ''.join(random.choices(string.ascii_letters, k=length)) num_unique_small_relations = 5 size_per_small_relation = 512 * 1024 small_dfs = [spark.createDataFrame([(random_letters(size_per_small_relation),)],) for _ in range(num_unique_small_relations)] result_df = small_dfs[0] for _ in range(512): result_df = result_df.unionByName(small_dfs[random.randint(0, len(small_dfs) - 1)]) result_df.collect() ``` It fails with `StatusCode.RESOURCE_EXHAUSTED` error with message`Sent message larger than max (269178955 vs. 134217728)`, because the client was trying to send a too large message to the server. Note: repeated small local relations is just one way causing a large plan, the size of the plan can also be contributed by repeated subtrees of plan transformations, serialized UDFs, captured external variables by UDFs, etc. With the improvement introduced by the PR, the above code runs successfully and prints the expected result. ### Why are the changes needed? It improves Spark Connect stability when executing and analyzing large plans. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests on both the server side and the client side. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#52894 from xi-db/plan-compression. Authored-by: Xi Lyu <xi.lyu@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com>
### What changes were proposed in this pull request? This PR aims to recover MacOS CIs by installing `zstandard==0.25.0`. ### Why are the changes needed? After SPARK-54194, `zstandard` Python package is required for `Connect`. - apache#52894 Currently, MacOS CIs are broken. - https://github.com/apache/spark/actions/workflows/build_python_3.11_macos26.yml - https://github.com/apache/spark/actions/workflows/build_python_3.11_macos.yml ``` Traceback (most recent call last): File "/Users/runner/work/spark/spark/python/pyspark/sql/connect/utils.py", line 105, in require_minimum_zstandard_version import zstandard # noqa ^^^^^^^^^^^^^^^^ ModuleNotFoundError: No module named 'zstandard' ``` ### Does this PR introduce _any_ user-facing change? No behavior change. ### How was this patch tested? Pass the CIs and manual review because MacOS CIs are triggered daily CIs only. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#53025 from dongjoon-hyun/SPARK-54326. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
… conditionally ### What changes were proposed in this pull request? This PR is a follow-up of the following to fix `connectutils.py` to import `pb2` conditionally. - apache#52894 ### Why are the changes needed? Currently, Python CIs are broken like the following. - https://github.com/apache/spark/actions/workflows/build_python_3.11_classic_only.yml - https://github.com/apache/spark/actions/runs/19316448951/job/55248810741 - https://github.com/apache/spark/actions/workflows/build_python_3.12.yml - https://github.com/apache/spark/actions/runs/19275741458/job/55212353468 ``` File "/__w/spark/spark/python/pyspark/testing/connectutils.py", line 26, in <module> import pyspark.sql.connect.proto as pb2 File "/__w/spark/spark/python/pyspark/sql/connect/proto/__init__.py", line 18, in <module> from pyspark.sql.connect.proto.base_pb2_grpc import * File "/__w/spark/spark/python/pyspark/sql/connect/proto/base_pb2_grpc.py", line 19, in <module> import grpc ModuleNotFoundError: No module named 'grpc' ``` ### Does this PR introduce _any_ user-facing change? No behavior change. We has been importing `pyspark.sql.connect` conditionally before apache#52894 . ### How was this patch tested? Pass the CIs and manual test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#53037 from dongjoon-hyun/SPARK-54194. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request? Currently, Spark Connect enforce gRPC message limits on both the client and the server. These limits are largely meant to protect the server from potential OOMs by rejecting abnormally large messages. However, there are several cases where genuine messages exceed the limit and cause execution failures. To improve Spark Connect stability, this PR implements compressing unresolved proto plans to mitigate the issue of oversized messages from the client to the server. The compression applies to ExecutePlan and AnalyzePlan - the only two methods that might hit the message limit. The other issue of message limit from the server to the client is a different issue, and it’s out of the scope (that one is already fixed in apache#52271). In the implementation, * Zstandard is leveraged to compress proto plan as it has consistent high performance in our benchmark and achieves a good balance between compression ratio and performance. * The config `spark.connect.maxPlanSize` is introduced to control the maximum size of a (decompressed) proto plan that can be executed in Spark Connect. It is mainly used to avoid decompression bomb attacks. (Scala client changes are being implemented in a follow-up PR.) To reproduce the existing issue we are solving here, run this code on Spark Connect: ``` import random import string def random_letters(length: int) -> str: return ''.join(random.choices(string.ascii_letters, k=length)) num_unique_small_relations = 5 size_per_small_relation = 512 * 1024 small_dfs = [spark.createDataFrame([(random_letters(size_per_small_relation),)],) for _ in range(num_unique_small_relations)] result_df = small_dfs[0] for _ in range(512): result_df = result_df.unionByName(small_dfs[random.randint(0, len(small_dfs) - 1)]) result_df.collect() ``` It fails with `StatusCode.RESOURCE_EXHAUSTED` error with message`Sent message larger than max (269178955 vs. 134217728)`, because the client was trying to send a too large message to the server. Note: repeated small local relations is just one way causing a large plan, the size of the plan can also be contributed by repeated subtrees of plan transformations, serialized UDFs, captured external variables by UDFs, etc. With the improvement introduced by the PR, the above code runs successfully and prints the expected result. ### Why are the changes needed? It improves Spark Connect stability when executing and analyzing large plans. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests on both the server side and the client side. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#52894 from xi-db/plan-compression. Authored-by: Xi Lyu <xi.lyu@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com>
### What changes were proposed in this pull request? This PR aims to recover MacOS CIs by installing `zstandard==0.25.0`. ### Why are the changes needed? After SPARK-54194, `zstandard` Python package is required for `Connect`. - apache#52894 Currently, MacOS CIs are broken. - https://github.com/apache/spark/actions/workflows/build_python_3.11_macos26.yml - https://github.com/apache/spark/actions/workflows/build_python_3.11_macos.yml ``` Traceback (most recent call last): File "/Users/runner/work/spark/spark/python/pyspark/sql/connect/utils.py", line 105, in require_minimum_zstandard_version import zstandard # noqa ^^^^^^^^^^^^^^^^ ModuleNotFoundError: No module named 'zstandard' ``` ### Does this PR introduce _any_ user-facing change? No behavior change. ### How was this patch tested? Pass the CIs and manual review because MacOS CIs are triggered daily CIs only. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#53025 from dongjoon-hyun/SPARK-54326. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
… conditionally ### What changes were proposed in this pull request? This PR is a follow-up of the following to fix `connectutils.py` to import `pb2` conditionally. - apache#52894 ### Why are the changes needed? Currently, Python CIs are broken like the following. - https://github.com/apache/spark/actions/workflows/build_python_3.11_classic_only.yml - https://github.com/apache/spark/actions/runs/19316448951/job/55248810741 - https://github.com/apache/spark/actions/workflows/build_python_3.12.yml - https://github.com/apache/spark/actions/runs/19275741458/job/55212353468 ``` File "/__w/spark/spark/python/pyspark/testing/connectutils.py", line 26, in <module> import pyspark.sql.connect.proto as pb2 File "/__w/spark/spark/python/pyspark/sql/connect/proto/__init__.py", line 18, in <module> from pyspark.sql.connect.proto.base_pb2_grpc import * File "/__w/spark/spark/python/pyspark/sql/connect/proto/base_pb2_grpc.py", line 19, in <module> import grpc ModuleNotFoundError: No module named 'grpc' ``` ### Does this PR introduce _any_ user-facing change? No behavior change. We has been importing `pyspark.sql.connect` conditionally before apache#52894 . ### How was this patch tested? Pass the CIs and manual test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#53037 from dongjoon-hyun/SPARK-54194. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
… - Scala Client ### What changes were proposed in this pull request? In the previous PR apache#52894 of Spark Connect Proto Plan Compression, both Server-side and PySpark client changes were implemented. In this PR, the corresponding Scala client changes are implemented, so plan compression are now supported on the Scala client as well. To reproduce the existing issue we are solving here, run this code on Spark Connect Scala client: ``` import scala.util.Random import org.apache.spark.sql.DataFrame import spark.implicits._ def randomLetters(n: Int): String = { Iterator.continually(Random.nextPrintableChar()) .filter(_.isLetter) .take(n) .mkString } val numUniqueSmallRelations = 5 val sizePerSmallRelation = 512 * 1024 val smallDfs: Seq[DataFrame] = (0 until numUniqueSmallRelations).map { _ => Seq(randomLetters(sizePerSmallRelation)).toDF("value") } var resultDf = smallDfs.head for (_ <- 0 until 500) { val idx = Random.nextInt(smallDfs.length) resultDf = resultDf.unionByName(smallDfs(idx)) } resultDf.collect() ``` It fails with RESOURCE_EXHAUSTED error with message `gRPC message exceeds maximum size 134217728: 269207219`, because the server is trying to send an ExecutePlanResponse of ~260MB to the client. With the improvement introduced by the PR, the above code runs successfully and prints the expected result. ### Why are the changes needed? It improves Spark Connect stability when handling large plans. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#53003 from xi-db/plan-compression-scala-client. Authored-by: Xi Lyu <xi.lyu@databricks.com> Signed-off-by: Herman van Hovell <herman@databricks.com>
…ift` source code ### What changes were proposed in this pull request? This PR aims to update Spark Connect-generated Swift source code with Apache Spark `4.1.0`. ### Why are the changes needed? To use the latest bug fixes and new messages to develop for new features of `4.1.0`. - apache/spark#53024 - apache/spark#52894 - apache/spark#52890 - apache/spark#52872 - apache/spark#52746 - apache/spark#52831 ``` $ git clone -b v4.1.0 https://github.com/apache/spark.git $ cd spark/sql/connect/common/src/main/protobuf/ $ protoc --swift_out=. spark/connect/*.proto $ protoc --grpc-swift_out=. spark/connect/*.proto // Remove empty GRPC files $ cd spark/connect $ grep 'This file contained no services' * | awk -F: '{print $1}' | xargs rm ``` ### Does this PR introduce _any_ user-facing change? Pass the CIs. ### How was this patch tested? Pass the CIs. I manually tested with `Apache Spark 4.1.0`. ``` $ swift test --no-parallel ... Test run with 203 tests in 21 suites passed after 33.163 seconds. ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #271 from dongjoon-hyun/SPARK-54811. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
… logic in gRPC interceptor ### What changes were proposed in this pull request? Previously, decompression logic introduced by #52894 was scattered across multiple handlers (AnalyzePlanHandler, ExecutePlanHandler, and probably other handlers in the future), with each handler responsible for decompressing its own compressed requests. This approach made it difficult to maintain consistent decompression behavior across different RPC types. This PR introduces RequestDecompressionInterceptor as a gRPC interceptor that handles decompression for all compressed requests before they reach downstream handlers. To improve debuggability on decompression errors, * Enriched exceptions before throwing back to the client, so the correct error class with error message will be propagated to the client, like other errors thrown in handlers. * Added more logs. On error, the driver logs will have detailed information for debugging. ### Why are the changes needed? This refactoring ensures single responsibility and consistency, as decompression logic is now in one place, making it easier to maintain. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests for the interceptor and existing tests for plan compression. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #54113 from xi-db/move-decompression-into-a-single-interceptor. Authored-by: Xi Lyu <xi.lyu@databricks.com> Signed-off-by: Herman van Hövell <herman@databricks.com>
What changes were proposed in this pull request?
Currently, Spark Connect enforce gRPC message limits on both the client and the server. These limits are largely meant to protect the server from potential OOMs by rejecting abnormally large messages. However, there are several cases where genuine messages exceed the limit and cause execution failures.
To improve Spark Connect stability, this PR implements compressing unresolved proto plans to mitigate the issue of oversized messages from the client to the server. The compression applies to ExecutePlan and AnalyzePlan - the only two methods that might hit the message limit. The other issue of message limit from the server to the client is a different issue, and it’s out of the scope (that one is already fixed in #52271).
In the implementation,
spark.connect.maxPlanSizeis introduced to control the maximum size of a (decompressed) proto plan that can be executed in Spark Connect. It is mainly used to avoid decompression bomb attacks.(Scala client changes are being implemented in a follow-up PR.)
To reproduce the existing issue we are solving here, run this code on Spark Connect:
It fails with
StatusCode.RESOURCE_EXHAUSTEDerror with messageSent message larger than max (269178955 vs. 134217728), because the client was trying to send a too large message to the server.Note: repeated small local relations is just one way causing a large plan, the size of the plan can also be contributed by repeated subtrees of plan transformations, serialized UDFs, captured external variables by UDFs, etc.
With the improvement introduced by the PR, the above code runs successfully and prints the expected result.
Why are the changes needed?
It improves Spark Connect stability when executing and analyzing large plans.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
New tests on both the server side and the client side.
Was this patch authored or co-authored using generative AI tooling?
No.