[SPARK-57021][CONNECT][PYTHON] Add SQLContext wrapper for Spark Connect#55574
[SPARK-57021][CONNECT][PYTHON] Add SQLContext wrapper for Spark Connect#55574dbtsai wants to merge 1 commit into
Conversation
0d26795 to
d99ee95
Compare
3496379 to
8febe99
Compare
Yicong-Huang
left a comment
There was a problem hiding this comment.
I think there are some changes needed, please see inline comments.
cloud-fan
left a comment
There was a problem hiding this comment.
Summary
Prior state and problem. SQLContext has been deprecated since Spark 2.0 in favor of SparkSession, but legacy code still constructs it directly via SQLContext(sc) or SQLContext.getOrCreate(). Both classic paths hard-require a JVM SparkContext (__init__ dereferences sparkContext._jsc, _jvm, and sparkSession._jsparkSession.sqlContext()). In a Spark Connect remote-only install (pyspark-client) there is no SparkContext, so any of these calls crashes. The Connect SparkSession itself goes further — __getattr__ raises JVM_ATTRIBUTE_NOT_SUPPORTED for _jsc, _jconf, _jvm, _jsparkSession, sparkContext, and newSession — so the classic class can't be made to work by handing it a Connect session.
Design approach.
- A new
pyspark.sql.connect.context.SQLContextthat wraps a ConnectSparkSessionand pure-delegates almost every method;HiveContextis a sentinel that raises everywhere. - Classic
SQLContext.getOrCreate(sc=None)checksis_remote_only()and dispatches toConnectSQLContext._get_or_create_from_session(active_session).scbecomes optional; the classic branch keeps anassert.
Key design decisions made by this PR.
- Dispatch keyed on
is_remote_only(), not session type: only fires for pyspark-client installs; full installs still take the JVM path. HiveContextraises on every Connect construction: closes the direct-Connect bypass (commit c967950). The classic-dispatch path still bypasses this — see Finding #2.SQLContext.newSession()is implemented viacloneSession(): opposite state semantics from classicnewSession()— see Finding #3.tables()materializes catalog rows on the driver rather than usingSHOW TABLES, for column-name stability across catalog versions (per the 2026-05-26 follow-up to @Yicong-Huang).
Notes on the existing review thread.
- HiveContext bypass: c967950 closes the direct Connect path. The classic-dispatch path remains open — see Finding #2.
- Test reorg (@zhengruifeng, 2026-05-27): still open and the right call — the
SQLContextTestsMixincurrently never runs on the classic side. The current file namingtest_connect_context.pyalso doesn't match the parity-mixin convention (test_parity_*.py).
PR description. The opening sentence is broken — "...implementation in continues to work transparently...". Please fix the missing clause.
| FutureWarning, | ||
| ) | ||
| if is_remote_only(): | ||
| from pyspark.sql.connect.context import SQLContext as ConnectSQLContext |
There was a problem hiding this comment.
Finding #2 — HiveContext bypass through classic dispatch.
This is_remote_only() branch hardcodes ConnectSQLContext regardless of cls. A remote-only user calling from pyspark.sql import HiveContext; HiveContext.getOrCreate() reaches here with cls=HiveContext, gets handed back a plain ConnectSQLContext, and never sees the PySparkNotImplementedError they should.
c967950 closed the direct-Connect bypass via _from_session, but the classic dispatch bypasses _from_session entirely because it routes to ConnectSQLContext, not ConnectHiveContext.
Two fixes:
- (a) route based on
cls—if cls is HiveContext: from pyspark.sql.connect.context import HiveContext as ConnectHiveContext; ...; or - (b) override
HiveContext.getOrCreatein the classic file to raise in remote-only mode before delegating.
There was a problem hiding this comment.
Fixed via your option (a): the classic SQLContext.getOrCreate now does connect_cls = getattr(_connect_context, cls.__name__, _connect_context.SQLContext) before calling _get_or_create_from_session. When cls is HiveContext, this routes to ConnectHiveContext._get_or_create_from_session, which calls ConnectHiveContext._from_session — and that raises PySparkNotImplementedError.
| return cls._instantiatedContext | ||
|
|
||
| @classmethod | ||
| def getOrCreate(cls: Type["SQLContext"], sparkSession: "SparkSession") -> "SQLContext": |
There was a problem hiding this comment.
Finding #4 — signature divergence + duplication (subsumed if Finding #1 takes path (a)).
(a) Classic getOrCreate(sc=None) makes the session optional and resolves the active one inside; this Connect signature is getOrCreate(sparkSession) — positional, required. Code that does SQLContext.getOrCreate() after importing from pyspark.sql.connect.context directly hits a TypeError.
(b) The body of this method is identical to _get_or_create_from_session except for warnings.warn — it could just call through.
Both concerns stop mattering if the class is made implementation-private (Finding #1).
There was a problem hiding this comment.
Fixed as a consequence of Finding #1: with getOrCreate removed from the Connect class, both (a) the signature divergence and (b) the duplication disappear.
| a :class:`~pyspark.sql.types.DataType` or a datatype string or a list of | ||
| column names. | ||
| samplingRatio : float, optional | ||
| the sample ratio of rows used for inferring |
There was a problem hiding this comment.
Sentence is incomplete — inferring what?
| the sample ratio of rows used for inferring | |
| the sample ratio of rows used for inferring the schema. |
cloud-fan
left a comment
There was a problem hiding this comment.
Re-review — 11 addressed, 1 remaining, 5 new (5 new = 3 newly introduced, 2 late catches). 1 blocking, 1 non-blocking, 3 nits.
Thanks for the thorough rework — Findings #1–#4, the doc nits, and the @zhengruifeng test reorg all look correctly addressed. One gap surfaced from the reorg itself, plus a few nits.
Correctness (1)
context.py:168: theis_remote_only()dispatch and the Finding #2 fix (HiveContext.getOrCreate()raising) are now untested — the reorg droppedtest_getOrCreate_emits_deprecation_warningandtest_hive_context_get_or_create_raiseswithout replacement. See inline.
Design / architecture (1)
connect/context.py:49: class declared internal but still exported in__all__. See inline.
Nits: 3 minor items (see inline comments).
PR description suggestions
- The opening sentence is still broken ("…implementation in continues to work transparently…") — flagged last round, still unfixed. Add the missing clause (e.g. "implementation in
pyspark.sql.connect.contextso that legacy code usingSQLContextcontinues to work…"). - The "How was this patch tested?" section claims Connect-specific cases include "deprecation warnings on
__init__andgetOrCreate", but thegetOrCreatetest was dropped in the reorg and no longer exists — restore it (see the inline comment) or update the text.
Yicong-Huang
left a comment
There was a problem hiding this comment.
LGTM. Thanks for the thorough rework. all my earlier change-requests are addressed. Approving with a few non-blocking nits left inline.
cloud-fan
left a comment
There was a problem hiding this comment.
5 addressed, 0 remaining, 1 new. All previous-round findings are correctly addressed (dispatch tests, __all__ removal, comment/dead-import/opening-sentence fixes all verified).
Nits (1)
connect/context.py:109:versionaddedshould be4.3.0, not4.2.0—4.2.0is in code freeze and this PR isn't a backport candidate, so it lands in the next open feature release. Applies to all ~22 occurrences in the file. See inline.
Also still open from @Yicong-Huang (non-blocking):
connect/context.py:93: the cached context isn't validated against the currently-active session — classic_get_or_createre-creates when the cached context is dead (_sc._jsc is None); the Connect path resets only via thestop()hook.test_sql_context.py:123: the shared mixin doesn't coverudf/udtf/registerFunction/createExternalTable, though the PR description says it does.
PR description suggestions
- Align "How was this patch tested?" with the mixin: it does not exercise
udf/udtf/registerFunction/createExternalTable. Add the cases or correct the text (same point as @Yicong-Huang'stest_sql_context.py:123).
viirya
left a comment
There was a problem hiding this comment.
Solid direction — reconnecting the deprecated SQLContext for Connect is worth doing, and sharing the suite via SQLContextTestsMixin across both modes is the right structure. But a couple of the central claims in the PR description don't hold up against what's actually wired, plus some smaller items.
SQLContext(spark) still isn't wired to Connect — only getOrCreate() is. The PR description says "calling SQLContext(spark) or SQLContext.getOrCreate() ... After this PR, both calls succeed", but from pyspark.sql import SQLContext still resolves to the classic class (pyspark/sql/__init__.py:44), and only getOrCreate() got the Connect dispatch. The classic constructor dereferences self._sc._jsc and sparkSession._jsparkSession (context.py:118-124), so SQLContext(connect_spark) through the public import raises regardless. The new test_init_emits_deprecation_warning imports the internal pyspark.sql.connect.context.SQLContext directly, so it doesn't exercise the public path and the gap goes uncaught. Either add constructor-level dispatch in the classic __init__, or narrow the description/docstrings to say only getOrCreate() is supported — and add a test that goes through from pyspark.sql import SQLContext.
getOrCreate() gates on is_remote_only(), which is narrower than "running against a Connect server." is_remote_only() (util.py:891) is only true when pyspark-client is installed alone (no RDD/JVM at all). A normal full PySpark install talking to a Connect server via SPARK_REMOTE / a remote builder has is_remote_only() == False, so getOrCreate() with no sc falls through to assert sc is not None and fails — exactly the "running against a Connect server" case the PR is meant to cover. The predicate used elsewhere in PySpark for "are we in Connect mode" is is_remote() (pyspark/sql/utils.py:247, = SPARK_CONNECT_MODE_ENABLED or is_remote_only()). Switching to that covers both packagings; I traced the resolve path and _getActiveSessionOrCreate() still lands on the active Connect session in that case, so the cast stays valid.
Test mixin breaks the setUp/tearDown super() chain. SQLContextTestsMixin.setUp/tearDown only reset _instantiatedContext and never call super(). For SQLContextTests(Mixin, ReusedSQLTestCase) the mixin runs directly and ReusedSQLTestCase.tearDown (sqlutils.py:338, cleanupPythonWorkerLogs()) is swallowed; for SQLContextParityTests(Mixin, ReusedConnectTestCase) the subclass's super().setUp() resolves to the mixin (earlier in the MRO) which doesn't chain up, so ReusedConnectTestCase's ML-cache and worker-log cleanup never run. Best-effort cleanup so the tests likely still pass, but the chain should be preserved:
def setUp(self) -> None:
super().setUp()
SQLContext._instantiatedContext = None
def tearDown(self) -> None:
SQLContext._instantiatedContext = None
super().tearDown()legacy.rst — .. deprecated:: has no version argument. Sphinx expects .. deprecated:: <version>; without it the directive renders with an empty version (or warns). Suggest 3.0.0 to match the class docstring.
Minor: in session.py, stop() runs from pyspark.sql.connect.context import SQLContext unconditionally before the _instantiatedContext is not None check — cheap (import is cached) but could sit behind the guard.
For what it's worth, the parts that are wired look correct: tables() reconstructs the (namespace, tableName, isTemporary) shape via catalog.listTables() instead of SHOW TABLES, the getOrCreate dispatch routes HiveContext to its raising _from_session, and overriding _instantiatedContext on the Connect HiveContext is necessary to avoid handing back the wrong type from the cache.
|
Thanks for the careful review, @viirya! Addressed in b51a754:
And thanks for confirming the wired parts ( |
viirya
left a comment
There was a problem hiding this comment.
Traced this against classic context.py, the Connect session/catalog, and the SparkSession.builder remote-dispatch path. Solid compatibility shim — the design is clean and the tricky cases are handled deliberately and correctly. A few minor things, only one of which I'd consider blocking.
The Connect dispatch premise holds: getOrCreate routes through classic SparkSession._getActiveSessionOrCreate(), and since getActiveSession is @try_remote_session_classmethod and the builder dispatches to RemoteSparkSession under connect mode, the returned session really is a Connect one.
The HiveContext handling is the part I want to call out as correct rather than accidental. Because _get_or_create_from_session is inherited and reads cls._instantiatedContext, without HiveContext's own _instantiatedContext = None class var the attribute lookup would walk up to a cached SQLContext instance, find it non-None, and return it — so HiveContext.getOrCreate() would hand back a plain SQLContext instead of raising. The override shadows that and forces the _from_session -> PySparkNotImplementedError path. The inline comment explains exactly this; good catch.
tables() matches the classic schema (namespace, tableName, isTemporary), and t.namespace[-1] if t.namespace else "" correctly yields "" for temp views. The comment justifying listTables() over SHOW TABLES (stable column names across catalogs) is the right reasoning. The session-stop cache reset is well done too — under the lock, only resetting when the stopped session is the one the cached context wraps, and using sys.modules.get(...) to avoid importing the context module (and a circular import) when no SQLContext was ever created.
Things I'd change or consider:
assert sc is not Nonefor the classic path is the one I'd actually fix.assertis stripped underpython -O, so in optimized mode a classicgetOrCreate()with noscskips the guard and fails later with a more cryptic error (anAttributeErrordeep in_get_or_createwhen it touchessc._jvm). For validating a public-API argument, prefer aPySparkValueError/ValueErrorso the contract holds regardless of-O.- Duplicate
SQLContextTestsclass name — there's already one intest_context.py(a minimal classic smoke test) and the newtest_sql_context.pyadds another. No runtime collision, but it's a discoverability wrinkle; consider renaming the new classic one (e.g.SQLContextClassicTests) or folding the old smoke test into the new mixin-based suite. legacy.rstlistsregisterJavaFunction, which raises under Connect. The docs build is fine sincecurrentmodule:: pyspark.sqlresolves to the classic class, but the page renders the classic docstring with no hint that this method (andHiveContext) is unsupported under Connect. Optional: a one-line note on the page.
The shared-mixin test approach (same suite over classic and Connect via SQLContextParityTests) is exactly right for a wrapper like this and gives real parity confidence, and the Connect-specific cases (deprecation warning, public-path getOrCreate dispatch, both HiveContext rejection routes, registerJavaFunction) are all covered. One small gap: test_newSession_returns_distinct_instance only asserts ctx2 is not ctx, but since the Connect newSession semantics deliberately differ from classic (clones state via cloneSession vs. classic's shared-cache-only fresh session), a test that locks in the documented inherited-state behavior would be worth adding. Optional.
Nice work — happy to approve once the assert sc is not None becomes an explicit error; the rest are cosmetic/optional.
|
Thanks for tracing through it so carefully, @viirya — and for confirming the Connect-dispatch and
CI was green on the previous revision; I'll keep an eye on this run as well. |
viirya
left a comment
There was a problem hiding this comment.
Thanks for the updates — all the points from my earlier pass are addressed:
getOrCreatenow raisesPySparkValueError(ARGUMENT_REQUIRED)instead ofassert sc is not None, so the classic-mode contract holds underpython -Oand gives a clear message. Verified the error class and params render correctly.- The new classic suite is renamed
SQLContextClassicTests, removing the clash withtest_context.py::SQLContextTests. legacy.rstnow carries a note thatregisterJavaFunctionandHiveContextare unsupported under Connect and raisePySparkNotImplementedError.test_newSession_inherits_statenow locks in thecloneSessionsemantics (temp view created on the parent is visible in the clone), and cleans up the cloned session infinally.
I also re-checked that stopping the cloned session in that new test cannot wrongly clear the cache (the clone is built via _from_session and never written to _instantiatedContext, and the reset only fires when _instantiatedContext.sparkSession is self). The wrapper body and the session-stop invalidation are unchanged from what I reviewed before.
LGTM.
cloud-fan
left a comment
There was a problem hiding this comment.
3 addressed, 0 remaining, 4 new. (4 new = 0 newly introduced, 4 late catches — my misses from earlier rounds.) All round-3 findings are correctly resolved (versionadded 4.3.0, session-cache validation, mixin coverage), and the viirya/HyukjinKwon feedback is correctly applied.
Design / architecture (2)
connect/context.py:126: reinforcing @hvanhovell's open comment —newSession()viacloneSession()inverts classic fresh-session semantics, and ConnectSparkSessionitself deliberately raises onnewSession; recommend raisingPySparkNotImplementedErrorinstead — see inline (blocking until the thread is settled)context.py:176:getattrfallback silently swaps user-definedSQLContextsubclasses to the base Connect class — see inline (non-blocking)
Correctness (1)
connect/context.py:358:t.namespace[-1]truncates multi-part v2 namespaces where classic emits the full quoted namespace — see inline (non-blocking)
Nits: 1 minor item (see inline comment).
| session sharing only the table cache, this uses :meth:`SparkSession.cloneSession` and | ||
| inherits the current session's state. | ||
| """ | ||
| return self._from_session(self.sparkSession.cloneSession()) |
There was a problem hiding this comment.
Reinforcing @hvanhovell's comment with what I traced — I accepted the docstring-only fix last round, and I now think that was a miss. Classic newSession() returns a fresh session (separate conf/temp views/UDFs, shared cache — pyspark/sql/session.py:717-735); cloneSession() copies all of that state in, so code relying on newSession() isolation silently sees parent state. Notably, Connect's own SparkSession deliberately raises JVM_ATTRIBUTE_NOT_SUPPORTED for newSession (pyspark/sql/connect/session.py:1019) — there is no fresh-session construct bound to the same connection, and cloneSession() is a developer API with the opposite semantics. Given that, I'd raise PySparkNotImplementedError here (consistent with registerJavaFunction and HiveContext) rather than silently substituting different semantics; test_newSession_inherits_state and the Connect leg of test_newSession_returns_distinct_instance would flip accordingly, and the PR description's "the Connect equivalent of SparkSession.newSession()" claim should be dropped either way.
There was a problem hiding this comment.
Thanks Wenchen. I went the other way on this one -- returning a genuinely fresh session rather than raising -- and I think the parity argument settles it: Scala Connect SparkSession already supports newSession() (sql/connect/common/.../SparkSession.scala:403), implemented as SparkSession.builder().client(client.copy()).create() -- a fresh, independent session bound to the same connection with no state copied in. So this isn't a deliberate cross-language "Connect can't do newSession" stance; the Python side was just missing the construct Scala already has.
- Added
SparkSession.newSession()to the Connect PythonSparkSession, mirroring Scala: it rebuilds the client against the same endpoint with the session id cleared, so a fresh UUID is generated and the server lazily creates an empty isolated session -- noCloneSessionRPC, no state copy, and no dependency on thecloneSessiondeveloper API. - Removed the now-dead
"newSession"entry from the__getattr__JVM_ATTRIBUTE_NOT_SUPPORTEDguard (the real method shadows it anyway). SQLContext.newSession()delegates to it; the test now asserts the parent's temp views are not visible in the new session.
This gives @hvanhovell the independent-session semantics he asked for, matches classic newSession(), and brings Python Connect to parity with Scala Connect rather than diverging by raising. Happy to revisit if you'd still prefer raising for the deprecated shim specifically, but raising would make df.sparkSession.newSession() behave differently across the Scala and Python Connect clients. I'll also drop the "the Connect equivalent of SparkSession.newSession()" wording from the PR description regardless.
| session = SparkSession._getActiveSessionOrCreate() | ||
| # Route to the Connect counterpart so subclasses (e.g. HiveContext) are handled | ||
| # correctly: the Connect HiveContext._from_session raises PySparkNotImplementedError. | ||
| connect_cls = getattr(_connect_context, cls.__name__, _connect_context.SQLContext) |
There was a problem hiding this comment.
The default-arg fallback silently hands a base Connect SQLContext to any user-defined classic subclass: MyContext.getOrCreate() in Connect mode returns an object that is not a MyContext, and the subclass's attributes vanish with no signal until an AttributeError. The classic branch instantiates the actual subclass (cls._get_or_create -> cls(...)), so this is a behavior divergence, not just a theoretical one. Since the only known subclass is HiveContext (already routed by name), raising PySparkNotImplementedError when cls.__name__ has no Connect counterpart would fail loudly instead — one line. Non-blocking given how rare subclassing this 1.x entry point is.
There was a problem hiding this comment.
Fixed in 3c29340. The dispatch now does getattr(_connect_context, cls.__name__, None) and raises PySparkNotImplementedError when a subclass has no Connect counterpart, instead of silently handing back a base Connect SQLContext. HiveContext is unaffected (it has a Connect counterpart and is still routed by name to raise via _from_session).
| # (namespace, tableName, isTemporary), matching the classic implementation. | ||
| # SHOW TABLES returns "database" vs "namespace" depending on the active catalog. | ||
| rows = [ | ||
| (t.namespace[-1] if t.namespace else "", t.name, t.isTemporary) |
There was a problem hiding this comment.
Small parity question: t.namespace[-1] keeps only the last namespace part, while classic SHOW TABLES emits the full quoted namespace (ShowTablesExec.scala:53, ident.namespace().quoted) — so under a v2 catalog with a multi-level namespace the column reads b where classic reads a.b. Identical for v1 single-part namespaces, which is surely the dominant case for this shim, so non-blocking — but ".".join(t.namespace) would match classic exactly if you want full parity.
There was a problem hiding this comment.
Good catch -- fixed in 3c29340. Switched t.namespace[-1] to ".".join(t.namespace) so multi-level v2 catalog namespaces read a.b like classic SHOW TABLES instead of dropping to b. Added a comment noting the parity rationale.
| from pyspark.sql.connect.udtf import UDTFRegistration | ||
| from pyspark.sql._typing import UserDefinedFunctionLike | ||
|
|
||
| # Internal module — not part of the public PySpark API surface. |
There was a problem hiding this comment.
This em-dash is the only non-ASCII character the PR introduces; comments are conventionally ASCII-only.
| # Internal module — not part of the public PySpark API surface. | |
| # Internal module - not part of the public PySpark API surface. |
There was a problem hiding this comment.
Fixed in 3c29340 -- replaced the em-dash with an ASCII hyphen. Confirmed with grep -P "[^\x00-\x7F]" that the file is now ASCII-only.
|
Ran an additional automated + manual review pass over this PR. Summary of findings and follow-up fixes (commit incoming): 1. Critical: sessions created via
|
|
The fixes described above are now pushed:
No code change for item 3 - the "newSession() drops client settings" suggestion was rejected for the reasons above. |
8caf289 to
d25ff09
Compare
|
cc @hvanhovell |
815543d to
c940d2d
Compare
Co-authored-by: Isaac
c940d2d to
923443c
Compare
|
cc @viirya @cloud-fan @Yicong-Huang @hvanhovell @HyukjinKwon to review again since many changes after the approvals. Thanks, |
|
I think there are two minor points:
I think both can be fixed in follow ups. If no objections I will merge this PR tonight. |
### What changes were proposed in this pull request? This PR adds a Spark Connect-compatible `SQLContext` (and `HiveContext`) implementation in `pyspark.sql.connect.context` so that legacy code using `SQLContext` continues to work transparently when running against a Connect server. Key changes: 1. **New `pyspark.sql.connect.context.SQLContext`** — wraps a Connect `SparkSession` directly (no `SparkContext` required). Delegates all supported operations to the session: `sql`, `table`, `range`, `createDataFrame`, `conf`, `udf`, `udtf`, `read`, `readStream`, `streams`, and catalog operations (`cacheTable`, `uncacheTable`, `clearCache`, `tables`, `tableNames`, `registerDataFrameAsTable`, `dropTempTable`, `createExternalTable`). - `newSession()` uses `cloneSession()` (the Connect equivalent of `SparkSession.newSession()`). - JVM-only APIs (`registerJavaFunction`, `HiveContext.__init__`) raise `PySparkNotImplementedError`. 2. **Connect dispatch in classic `SQLContext.getOrCreate()`** — when running in Spark Connect mode (`is_remote()`, which covers both a remote-only pyspark-client install and a full install talking to a Connect server via `SPARK_REMOTE`), the classic `getOrCreate()` now automatically returns a Connect `SQLContext` wrapping the active Connect session, so callers do not need to import from `pyspark.sql.connect` directly. Note that only `getOrCreate()` is wired to Connect; the `SQLContext(...)` constructor remains classic-only and still requires a `SparkContext`. 3. **Shared test mixin** — `SQLContextTestsMixin` extracted to `test_sql_context.py` so the same suite runs against both the classic and Connect implementations via `SQLContextParityTests`. 4. **API reference docs** — new `python/docs/source/reference/pyspark.sql/legacy.rst` page listing `SQLContext` and `HiveContext` in the public API reference. 5. **CI registration** — `test_connect_context` registered in `modules.py`. ### Why are the changes needed? `SQLContext` is deprecated since Spark 2.0 in favor of `SparkSession`, but many existing PySpark applications still instantiate it directly. Without this wrapper, those applications fail entirely on Spark Connect because the classic `SQLContext.__init__` requires a live `SparkContext` (JVM), which is not available in Connect mode. This patch closes that compatibility gap. ### Does this PR introduce _any_ user-facing change? Yes. Previously, calling `SQLContext.getOrCreate()` in a Spark Connect environment raised an error because the classic implementation requires a `SparkContext`. After this PR, in Spark Connect mode `SQLContext.getOrCreate()` succeeds and returns a fully functional (but still deprecated) `SQLContext` backed by the active Connect session. The `SQLContext(...)` constructor is unchanged and remains classic-only — it still requires a `SparkContext`. Code that needs Connect compatibility should use `SQLContext.getOrCreate()` (or, preferably, migrate to `SparkSession`). JVM-specific methods (`registerJavaFunction`, `HiveContext`) now raise a clear `PySparkNotImplementedError` instead of a cryptic JVM/attribute error. ### How was this patch tested? - Added `SQLContextTestsMixin` in `python/pyspark/sql/tests/test_sql_context.py` covering: `setConf`/`getConf`, `createDataFrame`, `sql`, `table`, `tables`/`tableNames`, `cacheTable`/`uncacheTable`/`clearCache`, `registerDataFrameAsTable`/`dropTempTable`, `createExternalTable`, `range`, `read`, `readStream`, `streams`, `udf`/`udtf`, `registerFunction`, `newSession`. - `SQLContextConnectTests` in `python/pyspark/sql/tests/connect/test_connect_context.py` adds Connect-specific cases: deprecation warning on `__init__`, `SQLContext.getOrCreate()` via the public `from pyspark.sql import SQLContext` path returns a Connect-backed context and emits a deprecation warning in Connect mode (patching `pyspark.sql.utils.is_remote`), `HiveContext.getOrCreate()` in Connect mode raises `PySparkNotImplementedError`, `registerJavaFunction` raises `PySparkNotImplementedError`, and `HiveContext.__init__` raises `PySparkNotImplementedError`. - Registered in `dev/sparktestsupport/modules.py` so the Connect test is picked up by CI. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude (claude-sonnet-4-6), via Anthropic Claude Code Closes #55574 from dbtsai/connect-sqlcontext-wrapper. Authored-by: DB Tsai <dbtsai@apache.org> Signed-off-by: Yicong-Huang <17627829+Yicong-Huang@users.noreply.github.com> (cherry picked from commit 7071ac1) Signed-off-by: Yicong-Huang <17627829+Yicong-Huang@users.noreply.github.com>
|
Thanks all, merged to master/4.x. Let's handle minor issues in followups. |
What changes were proposed in this pull request?
This PR adds a Spark Connect-compatible
SQLContext(andHiveContext) implementation inpyspark.sql.connect.contextso that legacy code usingSQLContextcontinues to worktransparently when running against a Connect server.
Key changes:
New
pyspark.sql.connect.context.SQLContext— wraps a ConnectSparkSessiondirectly(no
SparkContextrequired). Delegates all supported operations to the session:sql,table,range,createDataFrame,conf,udf,udtf,read,readStream,streams, and catalog operations (cacheTable,uncacheTable,clearCache,tables,tableNames,registerDataFrameAsTable,dropTempTable,createExternalTable).newSession()usescloneSession()(the Connect equivalent ofSparkSession.newSession()).registerJavaFunction,HiveContext.__init__) raisePySparkNotImplementedError.Connect dispatch in classic
SQLContext.getOrCreate()— when running in Spark Connectmode (
is_remote(), which covers both a remote-only pyspark-client install and a fullinstall talking to a Connect server via
SPARK_REMOTE), the classicgetOrCreate()nowautomatically returns a Connect
SQLContextwrapping the active Connect session, so callersdo not need to import from
pyspark.sql.connectdirectly. Note that onlygetOrCreate()is wired to Connect; the
SQLContext(...)constructor remains classic-only and stillrequires a
SparkContext.Shared test mixin —
SQLContextTestsMixinextracted totest_sql_context.pyso the samesuite runs against both the classic and Connect implementations via
SQLContextParityTests.API reference docs — new
python/docs/source/reference/pyspark.sql/legacy.rstpagelisting
SQLContextandHiveContextin the public API reference.CI registration —
test_connect_contextregistered inmodules.py.Why are the changes needed?
SQLContextis deprecated since Spark 2.0 in favor ofSparkSession, but many existingPySpark applications still instantiate it directly. Without this wrapper, those applications
fail entirely on Spark Connect because the classic
SQLContext.__init__requires a liveSparkContext(JVM), which is not available in Connect mode. This patch closes thatcompatibility gap.
Does this PR introduce any user-facing change?
Yes. Previously, calling
SQLContext.getOrCreate()in a Spark Connect environment raised anerror because the classic implementation requires a
SparkContext. After this PR, in SparkConnect mode
SQLContext.getOrCreate()succeeds and returns a fully functional (but stilldeprecated)
SQLContextbacked by the active Connect session.The
SQLContext(...)constructor is unchanged and remains classic-only — it still requires aSparkContext. Code that needs Connect compatibility should useSQLContext.getOrCreate()(or, preferably, migrate to
SparkSession).JVM-specific methods (
registerJavaFunction,HiveContext) now raise a clearPySparkNotImplementedErrorinstead of a cryptic JVM/attribute error.How was this patch tested?
SQLContextTestsMixininpython/pyspark/sql/tests/test_sql_context.pycovering:setConf/getConf,createDataFrame,sql,table,tables/tableNames,cacheTable/uncacheTable/clearCache,registerDataFrameAsTable/dropTempTable,createExternalTable,range,read,readStream,streams,udf/udtf,registerFunction,newSession.SQLContextConnectTestsinpython/pyspark/sql/tests/connect/test_connect_context.pyadds Connect-specific cases: deprecation warning on
__init__,SQLContext.getOrCreate()via the public
from pyspark.sql import SQLContextpath returns a Connect-backed contextand emits a deprecation warning in Connect mode (patching
pyspark.sql.utils.is_remote),HiveContext.getOrCreate()in Connect mode raisesPySparkNotImplementedError,registerJavaFunctionraisesPySparkNotImplementedError, andHiveContext.__init__raises
PySparkNotImplementedError.dev/sparktestsupport/modules.pyso the Connect test is picked up by CI.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude (claude-sonnet-4-6), via Anthropic Claude Code