[PYTHON] Improve static typing for methods behind is_remote_only#56150
[PYTHON] Improve static typing for methods behind is_remote_only#56150durgaprasadml wants to merge 3 commits into
Conversation
|
In #45053, the author mentions a pure Python lib |
|
Setting a constant actually works, I wasn't sure. from typing import TYPE_CHECKING
IS_PYSPARK_CONNECT = True # <- set this to False, will break typechecker
if TYPE_CHECKING and IS_PYSPARK_CONNECT:
def some_fn() -> int:
return 1
else:
def some_fn() -> int:
return "string"Maybe we could just set the constant on packaging |
|
Thanks for pointing this out — I reviewed #45053 and the pyspark-connect packaging split introduced there. My understanding is that this PR should still remain safe for both pyspark and pyspark-connect because the added condition only affects static analysis: python if TYPE_CHECKING or not is_remote_only(): At runtime, TYPE_CHECKING is always False, so the existing is_remote_only() behavior and packaging separation introduced in #45053 remain unchanged. The goal here was specifically to expose these members to static analyzers so they do not fall back to getattr resolution during type checking. I agree that a packaging-specific constant could also work, especially if there is future interest in making the static behavior package-aware. For now I tried to keep the change minimal and localized without introducing additional packaging/build logic. Happy to adjust the approach if maintainers think the packaging-aware direction is preferable. |
What changes were proposed in this pull request?
This PR improves static typing support for PySpark methods conditionally defined behind is_remote_only().
Previously, methods such as:
were defined inside:
python if not is_remote_only():
Static type checkers such as mypy cannot statically evaluate is_remote_only(), which causes these methods to fall back to getattr resolution during type analysis. As a result, attributes such as df.rdd could be inferred as incorrect union types (for example Union[RDD[Row], Column]).
This PR updates those conditional definitions to:
python if TYPE_CHECKING or not is_remote_only():
This allows static analyzers to unconditionally resolve the methods during type checking while preserving the existing runtime behavior for Spark Connect and remote-only environments.
Additionally, typing regression tests were added for:
Why are the changes needed?
The existing conditional definitions are difficult for static analyzers to evaluate because is_remote_only() is a runtime condition.
For example:
python from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() df = spark.range(5) reveal_type(df.rdd)
Before this change, static analyzers could infer df.rdd as a union involving Column due to getattr fallback resolution.
Using TYPE_CHECKING is a standard Python typing pattern that exposes these definitions to static analyzers without changing runtime semantics.
Does this PR introduce any user-facing change?
No.
This change only affects static type analysis behavior and does not modify runtime behavior.
How was this patch tested?
Verified with targeted PySpark typing regression tests:
bash export PYTHONPATH=$(pwd)/python MYPYPATH=python pytest \ python/pyspark/sql/tests/typing/test_dataframe.yml \ python/pyspark/sql/tests/typing/test_session.yml
Result:
text 13 passed
Also verified modified files with mypy using Spark's typing configuration.
Closes #56141