Skip to content

[PYTHON] Improve static typing for methods behind is_remote_only#56150

Open
durgaprasadml wants to merge 3 commits into
apache:masterfrom
durgaprasadml:fix-remote-only-typing
Open

[PYTHON] Improve static typing for methods behind is_remote_only#56150
durgaprasadml wants to merge 3 commits into
apache:masterfrom
durgaprasadml:fix-remote-only-typing

Conversation

@durgaprasadml

@durgaprasadml durgaprasadml commented May 27, 2026

Copy link
Copy Markdown

What changes were proposed in this pull request?

This PR improves static typing support for PySpark methods conditionally defined behind is_remote_only().

Previously, methods such as:

  • DataFrame.rdd
  • SparkSession.newSession
  • SparkSession.sparkContext

were defined inside:

python if not is_remote_only():

Static type checkers such as mypy cannot statically evaluate is_remote_only(), which causes these methods to fall back to getattr resolution during type analysis. As a result, attributes such as df.rdd could be inferred as incorrect union types (for example Union[RDD[Row], Column]).

This PR updates those conditional definitions to:

python if TYPE_CHECKING or not is_remote_only():

This allows static analyzers to unconditionally resolve the methods during type checking while preserving the existing runtime behavior for Spark Connect and remote-only environments.

Additionally, typing regression tests were added for:

  • DataFrame.rdd
  • SparkSession.sparkContext
  • SparkSession.newSession

Why are the changes needed?

The existing conditional definitions are difficult for static analyzers to evaluate because is_remote_only() is a runtime condition.

For example:

python from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() df = spark.range(5) reveal_type(df.rdd)

Before this change, static analyzers could infer df.rdd as a union involving Column due to getattr fallback resolution.

Using TYPE_CHECKING is a standard Python typing pattern that exposes these definitions to static analyzers without changing runtime semantics.

Does this PR introduce any user-facing change?

No.

This change only affects static type analysis behavior and does not modify runtime behavior.

How was this patch tested?

Verified with targeted PySpark typing regression tests:

bash export PYTHONPATH=$(pwd)/python MYPYPATH=python pytest \ python/pyspark/sql/tests/typing/test_dataframe.yml \ python/pyspark/sql/tests/typing/test_session.yml

Result:

text 13 passed

Also verified modified files with mypy using Spark's typing configuration.

Closes #56141

@durgaprasadml durgaprasadml marked this pull request as ready for review May 28, 2026 01:19
@iamkhav

iamkhav commented May 28, 2026

Copy link
Copy Markdown

In #45053, the author mentions a pure Python lib pyspark-connect which is what the is_remote_only() change was intended for. Your changes work well for the type annotations in the full pyspark package but is the codebase going to be shared between pyspark and pyspark-connect? If so, then I'm not sure how to fix this well for both libraries. What do you think?

@iamkhav

iamkhav commented May 28, 2026

Copy link
Copy Markdown

Setting a constant actually works, I wasn't sure.

from typing import TYPE_CHECKING


IS_PYSPARK_CONNECT = True  # <- set this to False, will break typechecker

if TYPE_CHECKING and IS_PYSPARK_CONNECT:
    def some_fn() -> int:
        return 1
else:
    def some_fn() -> int:
        return "string"

Maybe we could just set the constant on packaging pyspark-connect, that would also replace is_remote_only() I suppose.

@durgaprasadml

Copy link
Copy Markdown
Author

Thanks for pointing this out — I reviewed #45053 and the pyspark-connect packaging split introduced there.

My understanding is that this PR should still remain safe for both pyspark and pyspark-connect because the added condition only affects static analysis:

python if TYPE_CHECKING or not is_remote_only():

At runtime, TYPE_CHECKING is always False, so the existing is_remote_only() behavior and packaging separation introduced in #45053 remain unchanged.

The goal here was specifically to expose these members to static analyzers so they do not fall back to getattr resolution during type checking.

I agree that a packaging-specific constant could also work, especially if there is future interest in making the static behavior package-aware. For now I tried to keep the change minimal and localized without introducing additional packaging/build logic.

Happy to adjust the approach if maintainers think the packaging-aware direction is preferable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pyspark: DataFrame methods behind is_remote_only() statically evaluate to Union during typechecking

2 participants