[SPARK-48970][PYTHON][ML] Avoid using SparkSession.getActiveSession in spark ML reader/writer#47453
[SPARK-48970][PYTHON][ML] Avoid using SparkSession.getActiveSession in spark ML reader/writer#47453WeichenXu123 wants to merge 4 commits into
Conversation
|
merged to master. |
| def loadMetadata(path: String, sc: SparkContext, expectedClassName: String = ""): Metadata = { | ||
| val metadataPath = new Path(path, "metadata").toString | ||
| val spark = SparkSession.getActiveSession.get | ||
| val spark = SparkSession.builder().sparkContext(sc).getOrCreate() |
There was a problem hiding this comment.
Hi, @WeichenXu123 , @HyukjinKwon , @zhengruifeng .
This sounds like a regression of
If we cannot get an existing one, I believe we should not create SparkSession here.
Can we recover the existing code?
There was a problem hiding this comment.
It will not be a regression. This is Spark ML which is DataFrame-based MLlib by definition. Therefore we should always have default session running. Active session is specific to a thread, so it might not exist within the same thread. Alternatively we could use SparkSession.getDefaultSession.
| spark.createDataFrame( # type: ignore[union-attr] | ||
| [(metadataJson,)], schema=["value"] | ||
| ).coalesce(1).write.text(metadataPath) | ||
| spark = SparkSession._getActiveSessionOrCreate() |
| metadataPath = os.path.join(path, "metadata") | ||
| spark = SparkSession.getActiveSession() | ||
| metadataStr = spark.read.text(metadataPath).first()[0] # type: ignore[union-attr,index] | ||
| spark = SparkSession._getActiveSessionOrCreate() |
|
Initially, the existing PRs assumes that there is no regression because we use the active sessions. AFAIK, this assumption was the same in the dev mailing discussion . https://lists.apache.org/thread/s24lqtmno0xtoxxz6pk6tyn726bfwp8q Is this regression inevitable, @HyukjinKwon ?
|
|
I replied on the existing thread. |
|
There is no regression. This is Spark ML which is DataFrame-based MLlib. There should be a running Spark session always. |
|
@dongjoon-hyun loads the metadata then loads the model coefficients, you can see the |
|
I think probably we can change the signature of to to avoid such confusion. I will have a try |
|
Thank you, @HyukjinKwon and @zhengruifeng . I'm +1 for both to have a clear semantic.
|
|
For the record and the other reviewers, (2) is implemented and merged to Apache Spark 4.0.0. |
…n spark ML reader/writer
### What changes were proposed in this pull request?
`SparkSession.getActiveSession` is thread-local session, but spark ML reader / writer might be executed in different threads which causes `SparkSession.getActiveSession` returning None.
### Why are the changes needed?
It fixes the bug like:
```
spark = SparkSession.getActiveSession()
> spark.createDataFrame( # type: ignore[union-attr]
[(metadataJson,)], schema=["value"]
).coalesce(1).write.text(metadataPath)
E AttributeError: 'NoneType' object has no attribute 'createDataFrame'
```
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Manually.
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes apache#47453 from WeichenXu123/SPARK-48970.
Authored-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
What changes were proposed in this pull request?
SparkSession.getActiveSessionis thread-local session, but spark ML reader / writer might be executed in different threads which causesSparkSession.getActiveSessionreturning None.Why are the changes needed?
It fixes the bug like:
Does this PR introduce any user-facing change?
No
How was this patch tested?
Manually.
Was this patch authored or co-authored using generative AI tooling?
No.