Skip to content

[SPARK-57021][CONNECT][PYTHON] Add SQLContext wrapper for Spark Connect#55574

Closed
dbtsai wants to merge 1 commit into
apache:masterfrom
dbtsai:connect-sqlcontext-wrapper
Closed

[SPARK-57021][CONNECT][PYTHON] Add SQLContext wrapper for Spark Connect#55574
dbtsai wants to merge 1 commit into
apache:masterfrom
dbtsai:connect-sqlcontext-wrapper

Conversation

@dbtsai

@dbtsai dbtsai commented Apr 27, 2026

Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

This PR adds a Spark Connect-compatible SQLContext (and HiveContext) implementation in
pyspark.sql.connect.context so that legacy code using SQLContext continues to work
transparently when running against a Connect server.

Key changes:

  1. New pyspark.sql.connect.context.SQLContext — wraps a Connect SparkSession directly
    (no SparkContext required). Delegates all supported operations to the session:
    sql, table, range, createDataFrame, conf, udf, udtf, read,
    readStream, streams, and catalog operations (cacheTable, uncacheTable,
    clearCache, tables, tableNames, registerDataFrameAsTable, dropTempTable,
    createExternalTable).

    • newSession() uses cloneSession() (the Connect equivalent of SparkSession.newSession()).
    • JVM-only APIs (registerJavaFunction, HiveContext.__init__) raise PySparkNotImplementedError.
  2. Connect dispatch in classic SQLContext.getOrCreate() — when running in Spark Connect
    mode (is_remote(), which covers both a remote-only pyspark-client install and a full
    install talking to a Connect server via SPARK_REMOTE), the classic getOrCreate() now
    automatically returns a Connect SQLContext wrapping the active Connect session, so callers
    do not need to import from pyspark.sql.connect directly. Note that only getOrCreate()
    is wired to Connect; the SQLContext(...) constructor remains classic-only and still
    requires a SparkContext.

  3. Shared test mixinSQLContextTestsMixin extracted to test_sql_context.py so the same
    suite runs against both the classic and Connect implementations via SQLContextParityTests.

  4. API reference docs — new python/docs/source/reference/pyspark.sql/legacy.rst page
    listing SQLContext and HiveContext in the public API reference.

  5. CI registrationtest_connect_context registered in modules.py.

Why are the changes needed?

SQLContext is deprecated since Spark 2.0 in favor of SparkSession, but many existing
PySpark applications still instantiate it directly. Without this wrapper, those applications
fail entirely on Spark Connect because the classic SQLContext.__init__ requires a live
SparkContext (JVM), which is not available in Connect mode. This patch closes that
compatibility gap.

Does this PR introduce any user-facing change?

Yes. Previously, calling SQLContext.getOrCreate() in a Spark Connect environment raised an
error because the classic implementation requires a SparkContext. After this PR, in Spark
Connect mode SQLContext.getOrCreate() succeeds and returns a fully functional (but still
deprecated) SQLContext backed by the active Connect session.

The SQLContext(...) constructor is unchanged and remains classic-only — it still requires a
SparkContext. Code that needs Connect compatibility should use SQLContext.getOrCreate()
(or, preferably, migrate to SparkSession).

JVM-specific methods (registerJavaFunction, HiveContext) now raise a clear
PySparkNotImplementedError instead of a cryptic JVM/attribute error.

How was this patch tested?

  • Added SQLContextTestsMixin in python/pyspark/sql/tests/test_sql_context.py covering:
    setConf/getConf, createDataFrame, sql, table, tables/tableNames,
    cacheTable/uncacheTable/clearCache, registerDataFrameAsTable/dropTempTable,
    createExternalTable, range, read, readStream, streams, udf/udtf,
    registerFunction, newSession.
  • SQLContextConnectTests in python/pyspark/sql/tests/connect/test_connect_context.py
    adds Connect-specific cases: deprecation warning on __init__, SQLContext.getOrCreate()
    via the public from pyspark.sql import SQLContext path returns a Connect-backed context
    and emits a deprecation warning in Connect mode (patching pyspark.sql.utils.is_remote),
    HiveContext.getOrCreate() in Connect mode raises PySparkNotImplementedError,
    registerJavaFunction raises PySparkNotImplementedError, and HiveContext.__init__
    raises PySparkNotImplementedError.
  • Registered in dev/sparktestsupport/modules.py so the Connect test is picked up by CI.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude (claude-sonnet-4-6), via Anthropic Claude Code

Comment thread python/pyspark/sql/connect/context.py
Comment thread python/pyspark/sql/tests/connect/test_connect_context.py
Comment thread python/pyspark/sql/connect/context.py
Comment thread python/pyspark/sql/connect/context.py Outdated
@dbtsai dbtsai force-pushed the connect-sqlcontext-wrapper branch from 0d26795 to d99ee95 Compare April 28, 2026 05:14
@dbtsai dbtsai force-pushed the connect-sqlcontext-wrapper branch from 3496379 to 8febe99 Compare May 22, 2026 20:50
@dbtsai dbtsai changed the title [WIP][CONNECT][PYTHON] Add SQLContext wrapper for Spark Connect [SPARK-57021][CONNECT][PYTHON] Add SQLContext wrapper for Spark Connect May 23, 2026

@Yicong-Huang Yicong-Huang left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are some changes needed, please see inline comments.

Comment thread python/pyspark/sql/connect/context.py Outdated
Comment thread python/pyspark/sql/connect/context.py
Comment thread python/pyspark/sql/tests/test_context.py Outdated
Comment thread python/pyspark/sql/context.py Outdated
@dbtsai

dbtsai commented May 26, 2026

Copy link
Copy Markdown
Member Author

Comment thread python/pyspark/sql/tests/test_context.py Outdated

@cloud-fan cloud-fan left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Prior state and problem. SQLContext has been deprecated since Spark 2.0 in favor of SparkSession, but legacy code still constructs it directly via SQLContext(sc) or SQLContext.getOrCreate(). Both classic paths hard-require a JVM SparkContext (__init__ dereferences sparkContext._jsc, _jvm, and sparkSession._jsparkSession.sqlContext()). In a Spark Connect remote-only install (pyspark-client) there is no SparkContext, so any of these calls crashes. The Connect SparkSession itself goes further — __getattr__ raises JVM_ATTRIBUTE_NOT_SUPPORTED for _jsc, _jconf, _jvm, _jsparkSession, sparkContext, and newSession — so the classic class can't be made to work by handing it a Connect session.

Design approach.

  1. A new pyspark.sql.connect.context.SQLContext that wraps a Connect SparkSession and pure-delegates almost every method; HiveContext is a sentinel that raises everywhere.
  2. Classic SQLContext.getOrCreate(sc=None) checks is_remote_only() and dispatches to ConnectSQLContext._get_or_create_from_session(active_session). sc becomes optional; the classic branch keeps an assert.

Key design decisions made by this PR.

  • Dispatch keyed on is_remote_only(), not session type: only fires for pyspark-client installs; full installs still take the JVM path.
  • HiveContext raises on every Connect construction: closes the direct-Connect bypass (commit c967950). The classic-dispatch path still bypasses this — see Finding #2.
  • SQLContext.newSession() is implemented via cloneSession(): opposite state semantics from classic newSession() — see Finding #3.
  • tables() materializes catalog rows on the driver rather than using SHOW TABLES, for column-name stability across catalog versions (per the 2026-05-26 follow-up to @Yicong-Huang).

Notes on the existing review thread.

  • HiveContext bypass: c967950 closes the direct Connect path. The classic-dispatch path remains open — see Finding #2.
  • Test reorg (@zhengruifeng, 2026-05-27): still open and the right call — the SQLContextTestsMixin currently never runs on the classic side. The current file naming test_connect_context.py also doesn't match the parity-mixin convention (test_parity_*.py).

PR description. The opening sentence is broken — "...implementation in continues to work transparently...". Please fix the missing clause.

Comment thread python/pyspark/sql/connect/context.py
Comment thread python/pyspark/sql/context.py Outdated
FutureWarning,
)
if is_remote_only():
from pyspark.sql.connect.context import SQLContext as ConnectSQLContext

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finding #2HiveContext bypass through classic dispatch.

This is_remote_only() branch hardcodes ConnectSQLContext regardless of cls. A remote-only user calling from pyspark.sql import HiveContext; HiveContext.getOrCreate() reaches here with cls=HiveContext, gets handed back a plain ConnectSQLContext, and never sees the PySparkNotImplementedError they should.

c967950 closed the direct-Connect bypass via _from_session, but the classic dispatch bypasses _from_session entirely because it routes to ConnectSQLContext, not ConnectHiveContext.

Two fixes:

  • (a) route based on clsif cls is HiveContext: from pyspark.sql.connect.context import HiveContext as ConnectHiveContext; ...; or
  • (b) override HiveContext.getOrCreate in the classic file to raise in remote-only mode before delegating.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed via your option (a): the classic SQLContext.getOrCreate now does connect_cls = getattr(_connect_context, cls.__name__, _connect_context.SQLContext) before calling _get_or_create_from_session. When cls is HiveContext, this routes to ConnectHiveContext._get_or_create_from_session, which calls ConnectHiveContext._from_session — and that raises PySparkNotImplementedError.

Comment thread python/pyspark/sql/connect/context.py Outdated
Comment thread python/pyspark/sql/connect/context.py Outdated
return cls._instantiatedContext

@classmethod
def getOrCreate(cls: Type["SQLContext"], sparkSession: "SparkSession") -> "SQLContext":

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finding #4 — signature divergence + duplication (subsumed if Finding #1 takes path (a)).

(a) Classic getOrCreate(sc=None) makes the session optional and resolves the active one inside; this Connect signature is getOrCreate(sparkSession) — positional, required. Code that does SQLContext.getOrCreate() after importing from pyspark.sql.connect.context directly hits a TypeError.

(b) The body of this method is identical to _get_or_create_from_session except for warnings.warn — it could just call through.

Both concerns stop mattering if the class is made implementation-private (Finding #1).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed as a consequence of Finding #1: with getOrCreate removed from the Connect class, both (a) the signature divergence and (b) the duplication disappear.

Comment thread python/pyspark/sql/connect/context.py Outdated
Comment thread python/pyspark/sql/connect/context.py Outdated
Comment thread python/pyspark/sql/connect/context.py Outdated
a :class:`~pyspark.sql.types.DataType` or a datatype string or a list of
column names.
samplingRatio : float, optional
the sample ratio of rows used for inferring

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sentence is incomplete — inferring what?

Suggested change
the sample ratio of rows used for inferring
the sample ratio of rows used for inferring the schema.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Comment thread python/pyspark/sql/connect/context.py Outdated
Comment thread python/pyspark/sql/connect/context.py Outdated
Comment thread python/pyspark/sql/connect/context.py

@cloud-fan cloud-fan left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review — 11 addressed, 1 remaining, 5 new (5 new = 3 newly introduced, 2 late catches). 1 blocking, 1 non-blocking, 3 nits.

Thanks for the thorough rework — Findings #1#4, the doc nits, and the @zhengruifeng test reorg all look correctly addressed. One gap surfaced from the reorg itself, plus a few nits.

Correctness (1)

  • context.py:168: the is_remote_only() dispatch and the Finding #2 fix (HiveContext.getOrCreate() raising) are now untested — the reorg dropped test_getOrCreate_emits_deprecation_warning and test_hive_context_get_or_create_raises without replacement. See inline.

Design / architecture (1)

  • connect/context.py:49: class declared internal but still exported in __all__. See inline.

Nits: 3 minor items (see inline comments).

PR description suggestions

  • The opening sentence is still broken ("…implementation in continues to work transparently…") — flagged last round, still unfixed. Add the missing clause (e.g. "implementation in pyspark.sql.connect.context so that legacy code using SQLContext continues to work…").
  • The "How was this patch tested?" section claims Connect-specific cases include "deprecation warnings on __init__ and getOrCreate", but the getOrCreate test was dropped in the reorg and no longer exists — restore it (see the inline comment) or update the text.

Comment thread python/pyspark/sql/context.py Outdated
Comment thread python/pyspark/sql/connect/context.py Outdated
Comment thread python/pyspark/sql/tests/test_context.py Outdated
Comment thread python/pyspark/sql/connect/context.py Outdated
Comment thread python/pyspark/sql/context.py Outdated

@Yicong-Huang Yicong-Huang left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the thorough rework. all my earlier change-requests are addressed. Approving with a few non-blocking nits left inline.

Comment thread python/pyspark/sql/connect/context.py
Comment thread python/pyspark/sql/tests/test_sql_context.py

@cloud-fan cloud-fan left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 addressed, 0 remaining, 1 new. All previous-round findings are correctly addressed (dispatch tests, __all__ removal, comment/dead-import/opening-sentence fixes all verified).

Nits (1)

  • connect/context.py:109: versionadded should be 4.3.0, not 4.2.04.2.0 is in code freeze and this PR isn't a backport candidate, so it lands in the next open feature release. Applies to all ~22 occurrences in the file. See inline.

Also still open from @Yicong-Huang (non-blocking):

  • connect/context.py:93: the cached context isn't validated against the currently-active session — classic _get_or_create re-creates when the cached context is dead (_sc._jsc is None); the Connect path resets only via the stop() hook.
  • test_sql_context.py:123: the shared mixin doesn't cover udf/udtf/registerFunction/createExternalTable, though the PR description says it does.

PR description suggestions

  • Align "How was this patch tested?" with the mixin: it does not exercise udf/udtf/registerFunction/createExternalTable. Add the cases or correct the text (same point as @Yicong-Huang's test_sql_context.py:123).

Comment thread python/pyspark/sql/connect/context.py Outdated

@viirya viirya left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid direction — reconnecting the deprecated SQLContext for Connect is worth doing, and sharing the suite via SQLContextTestsMixin across both modes is the right structure. But a couple of the central claims in the PR description don't hold up against what's actually wired, plus some smaller items.

SQLContext(spark) still isn't wired to Connect — only getOrCreate() is. The PR description says "calling SQLContext(spark) or SQLContext.getOrCreate() ... After this PR, both calls succeed", but from pyspark.sql import SQLContext still resolves to the classic class (pyspark/sql/__init__.py:44), and only getOrCreate() got the Connect dispatch. The classic constructor dereferences self._sc._jsc and sparkSession._jsparkSession (context.py:118-124), so SQLContext(connect_spark) through the public import raises regardless. The new test_init_emits_deprecation_warning imports the internal pyspark.sql.connect.context.SQLContext directly, so it doesn't exercise the public path and the gap goes uncaught. Either add constructor-level dispatch in the classic __init__, or narrow the description/docstrings to say only getOrCreate() is supported — and add a test that goes through from pyspark.sql import SQLContext.

getOrCreate() gates on is_remote_only(), which is narrower than "running against a Connect server." is_remote_only() (util.py:891) is only true when pyspark-client is installed alone (no RDD/JVM at all). A normal full PySpark install talking to a Connect server via SPARK_REMOTE / a remote builder has is_remote_only() == False, so getOrCreate() with no sc falls through to assert sc is not None and fails — exactly the "running against a Connect server" case the PR is meant to cover. The predicate used elsewhere in PySpark for "are we in Connect mode" is is_remote() (pyspark/sql/utils.py:247, = SPARK_CONNECT_MODE_ENABLED or is_remote_only()). Switching to that covers both packagings; I traced the resolve path and _getActiveSessionOrCreate() still lands on the active Connect session in that case, so the cast stays valid.

Test mixin breaks the setUp/tearDown super() chain. SQLContextTestsMixin.setUp/tearDown only reset _instantiatedContext and never call super(). For SQLContextTests(Mixin, ReusedSQLTestCase) the mixin runs directly and ReusedSQLTestCase.tearDown (sqlutils.py:338, cleanupPythonWorkerLogs()) is swallowed; for SQLContextParityTests(Mixin, ReusedConnectTestCase) the subclass's super().setUp() resolves to the mixin (earlier in the MRO) which doesn't chain up, so ReusedConnectTestCase's ML-cache and worker-log cleanup never run. Best-effort cleanup so the tests likely still pass, but the chain should be preserved:

def setUp(self) -> None:
    super().setUp()
    SQLContext._instantiatedContext = None

def tearDown(self) -> None:
    SQLContext._instantiatedContext = None
    super().tearDown()

legacy.rst.. deprecated:: has no version argument. Sphinx expects .. deprecated:: <version>; without it the directive renders with an empty version (or warns). Suggest 3.0.0 to match the class docstring.

Minor: in session.py, stop() runs from pyspark.sql.connect.context import SQLContext unconditionally before the _instantiatedContext is not None check — cheap (import is cached) but could sit behind the guard.

For what it's worth, the parts that are wired look correct: tables() reconstructs the (namespace, tableName, isTemporary) shape via catalog.listTables() instead of SHOW TABLES, the getOrCreate dispatch routes HiveContext to its raising _from_session, and overriding _instantiatedContext on the Connect HiveContext is necessary to avoid handing back the wrong type from the cache.

@dbtsai

dbtsai commented Jun 4, 2026

Copy link
Copy Markdown
Member Author

Thanks for the careful review, @viirya! Addressed in b51a754:

  1. SQLContext(spark) constructor vs getOrCreate(). Narrowed the scope rather than wiring the constructor — this keeps the Connect SQLContext an internal implementation detail (per @cloud-fan's Finding Removed reference to incubation in README.md. #1) instead of re-exposing a public Connect constructor. Updated the PR description and the change notes to state explicitly that only SQLContext.getOrCreate() is wired to Connect; SQLContext(...) stays classic-only and still requires a SparkContext. The public path through from pyspark.sql import SQLContext is already covered by test_getOrCreate_emits_deprecation_and_returns_connect_context (imports from pyspark.sql import SQLContext and calls getOrCreate()).

  2. is_remote_only()is_remote(). Done. getOrCreate() now gates on is_remote() (pyspark.sql.utils), so a full PySpark install talking to a Connect server via SPARK_REMOTE is covered, not just the remote-only pyspark-client packaging. Thanks for tracing the resolve path — confirmed _getActiveSessionOrCreate() lands on the active Connect session, so the cast stays valid. Test patch targets updated to pyspark.sql.utils.is_remote.

  3. Mixin setUp/tearDown super() chain. Fixed — SQLContextTestsMixin.setUp now calls super().setUp() first and tearDown calls super().tearDown() last, so ReusedSQLTestCase / ReusedConnectTestCase ML-cache and worker-log cleanup runs in both SQLContextTests and SQLContextParityTests.

  4. legacy.rst .. deprecated:: version. Fixed — now .. deprecated:: 3.0.0 to match the class docstring.

  5. session.py stop() unconditional import. Fixed — guarded behind sys.modules.get("pyspark.sql.connect.context"), so if no SQLContext was ever created (module never imported) there's nothing to reset and we don't force the import.

And thanks for confirming the wired parts (tables() via catalog.listTables(), the HiveContext getOrCreate routing, and the Connect HiveContext._instantiatedContext override).

@viirya viirya left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Traced this against classic context.py, the Connect session/catalog, and the SparkSession.builder remote-dispatch path. Solid compatibility shim — the design is clean and the tricky cases are handled deliberately and correctly. A few minor things, only one of which I'd consider blocking.

The Connect dispatch premise holds: getOrCreate routes through classic SparkSession._getActiveSessionOrCreate(), and since getActiveSession is @try_remote_session_classmethod and the builder dispatches to RemoteSparkSession under connect mode, the returned session really is a Connect one.

The HiveContext handling is the part I want to call out as correct rather than accidental. Because _get_or_create_from_session is inherited and reads cls._instantiatedContext, without HiveContext's own _instantiatedContext = None class var the attribute lookup would walk up to a cached SQLContext instance, find it non-None, and return it — so HiveContext.getOrCreate() would hand back a plain SQLContext instead of raising. The override shadows that and forces the _from_session -> PySparkNotImplementedError path. The inline comment explains exactly this; good catch.

tables() matches the classic schema (namespace, tableName, isTemporary), and t.namespace[-1] if t.namespace else "" correctly yields "" for temp views. The comment justifying listTables() over SHOW TABLES (stable column names across catalogs) is the right reasoning. The session-stop cache reset is well done too — under the lock, only resetting when the stopped session is the one the cached context wraps, and using sys.modules.get(...) to avoid importing the context module (and a circular import) when no SQLContext was ever created.

Things I'd change or consider:

  • assert sc is not None for the classic path is the one I'd actually fix. assert is stripped under python -O, so in optimized mode a classic getOrCreate() with no sc skips the guard and fails later with a more cryptic error (an AttributeError deep in _get_or_create when it touches sc._jvm). For validating a public-API argument, prefer a PySparkValueError/ValueError so the contract holds regardless of -O.
  • Duplicate SQLContextTests class name — there's already one in test_context.py (a minimal classic smoke test) and the new test_sql_context.py adds another. No runtime collision, but it's a discoverability wrinkle; consider renaming the new classic one (e.g. SQLContextClassicTests) or folding the old smoke test into the new mixin-based suite.
  • legacy.rst lists registerJavaFunction, which raises under Connect. The docs build is fine since currentmodule:: pyspark.sql resolves to the classic class, but the page renders the classic docstring with no hint that this method (and HiveContext) is unsupported under Connect. Optional: a one-line note on the page.

The shared-mixin test approach (same suite over classic and Connect via SQLContextParityTests) is exactly right for a wrapper like this and gives real parity confidence, and the Connect-specific cases (deprecation warning, public-path getOrCreate dispatch, both HiveContext rejection routes, registerJavaFunction) are all covered. One small gap: test_newSession_returns_distinct_instance only asserts ctx2 is not ctx, but since the Connect newSession semantics deliberately differ from classic (clones state via cloneSession vs. classic's shared-cache-only fresh session), a test that locks in the documented inherited-state behavior would be worth adding. Optional.

Nice work — happy to approve once the assert sc is not None becomes an explicit error; the rest are cosmetic/optional.

@dbtsai

dbtsai commented Jun 4, 2026

Copy link
Copy Markdown
Member Author

Thanks for tracing through it so carefully, @viirya — and for confirming the Connect-dispatch and HiveContext reasoning. Addressed in 1b105b6:

  • assert sc is not None (blocking). Replaced with an explicit PySparkValueError(errorClass="ARGUMENT_REQUIRED", ...) so the guard holds under python -O. Good call — the stripped-assert path would have failed with a cryptic AttributeError on sc._jvm deep in _get_or_create.

  • Duplicate SQLContextTests class name. Renamed the new mixin-based classic suite to SQLContextClassicTests. Kept the existing SQLContextTests smoke test in test_context.py since it covers a distinct path (getOrCreate(sc) against a real local SparkContext, which the constructor-based mixin doesn't).

  • legacy.rst note. Added a .. note:: calling out that registerJavaFunction and HiveContext are unsupported under Spark Connect and raise PySparkNotImplementedError.

  • newSession inherited-state test. Added test_newSession_inherits_state to the Connect-specific suite (test_connect_context.py) — it registers a temp view, calls newSession(), and asserts the clone sees it, locking in the documented cloneSession inherited-state behavior. It lives in the Connect-only suite rather than the shared mixin precisely because this semantics intentionally differs from classic newSession (fresh session, shared-cache-only), so it can't be a parity assertion.

CI was green on the previous revision; I'll keep an eye on this run as well.

@viirya viirya left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates — all the points from my earlier pass are addressed:

  • getOrCreate now raises PySparkValueError(ARGUMENT_REQUIRED) instead of assert sc is not None, so the classic-mode contract holds under python -O and gives a clear message. Verified the error class and params render correctly.
  • The new classic suite is renamed SQLContextClassicTests, removing the clash with test_context.py::SQLContextTests.
  • legacy.rst now carries a note that registerJavaFunction and HiveContext are unsupported under Connect and raise PySparkNotImplementedError.
  • test_newSession_inherits_state now locks in the cloneSession semantics (temp view created on the parent is visible in the clone), and cleans up the cloned session in finally.

I also re-checked that stopping the cloned session in that new test cannot wrongly clear the cache (the clone is built via _from_session and never written to _instantiatedContext, and the reset only fires when _instantiatedContext.sparkSession is self). The wrapper body and the session-stop invalidation are unchanged from what I reviewed before.

LGTM.

Comment thread python/pyspark/sql/connect/context.py Outdated

@cloud-fan cloud-fan left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 addressed, 0 remaining, 4 new. (4 new = 0 newly introduced, 4 late catches — my misses from earlier rounds.) All round-3 findings are correctly resolved (versionadded 4.3.0, session-cache validation, mixin coverage), and the viirya/HyukjinKwon feedback is correctly applied.

Design / architecture (2)

  • connect/context.py:126: reinforcing @hvanhovell's open comment — newSession() via cloneSession() inverts classic fresh-session semantics, and Connect SparkSession itself deliberately raises on newSession; recommend raising PySparkNotImplementedError instead — see inline (blocking until the thread is settled)
  • context.py:176: getattr fallback silently swaps user-defined SQLContext subclasses to the base Connect class — see inline (non-blocking)

Correctness (1)

  • connect/context.py:358: t.namespace[-1] truncates multi-part v2 namespaces where classic emits the full quoted namespace — see inline (non-blocking)

Nits: 1 minor item (see inline comment).

Comment thread python/pyspark/sql/connect/context.py Outdated
session sharing only the table cache, this uses :meth:`SparkSession.cloneSession` and
inherits the current session's state.
"""
return self._from_session(self.sparkSession.cloneSession())

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reinforcing @hvanhovell's comment with what I traced — I accepted the docstring-only fix last round, and I now think that was a miss. Classic newSession() returns a fresh session (separate conf/temp views/UDFs, shared cache — pyspark/sql/session.py:717-735); cloneSession() copies all of that state in, so code relying on newSession() isolation silently sees parent state. Notably, Connect's own SparkSession deliberately raises JVM_ATTRIBUTE_NOT_SUPPORTED for newSession (pyspark/sql/connect/session.py:1019) — there is no fresh-session construct bound to the same connection, and cloneSession() is a developer API with the opposite semantics. Given that, I'd raise PySparkNotImplementedError here (consistent with registerJavaFunction and HiveContext) rather than silently substituting different semantics; test_newSession_inherits_state and the Connect leg of test_newSession_returns_distinct_instance would flip accordingly, and the PR description's "the Connect equivalent of SparkSession.newSession()" claim should be dropped either way.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Wenchen. I went the other way on this one -- returning a genuinely fresh session rather than raising -- and I think the parity argument settles it: Scala Connect SparkSession already supports newSession() (sql/connect/common/.../SparkSession.scala:403), implemented as SparkSession.builder().client(client.copy()).create() -- a fresh, independent session bound to the same connection with no state copied in. So this isn't a deliberate cross-language "Connect can't do newSession" stance; the Python side was just missing the construct Scala already has.

Pushed in 3c29340 / aaa51c8:

  • Added SparkSession.newSession() to the Connect Python SparkSession, mirroring Scala: it rebuilds the client against the same endpoint with the session id cleared, so a fresh UUID is generated and the server lazily creates an empty isolated session -- no CloneSession RPC, no state copy, and no dependency on the cloneSession developer API.
  • Removed the now-dead "newSession" entry from the __getattr__ JVM_ATTRIBUTE_NOT_SUPPORTED guard (the real method shadows it anyway).
  • SQLContext.newSession() delegates to it; the test now asserts the parent's temp views are not visible in the new session.

This gives @hvanhovell the independent-session semantics he asked for, matches classic newSession(), and brings Python Connect to parity with Scala Connect rather than diverging by raising. Happy to revisit if you'd still prefer raising for the deprecated shim specifically, but raising would make df.sparkSession.newSession() behave differently across the Scala and Python Connect clients. I'll also drop the "the Connect equivalent of SparkSession.newSession()" wording from the PR description regardless.

Comment thread python/pyspark/sql/context.py Outdated
session = SparkSession._getActiveSessionOrCreate()
# Route to the Connect counterpart so subclasses (e.g. HiveContext) are handled
# correctly: the Connect HiveContext._from_session raises PySparkNotImplementedError.
connect_cls = getattr(_connect_context, cls.__name__, _connect_context.SQLContext)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default-arg fallback silently hands a base Connect SQLContext to any user-defined classic subclass: MyContext.getOrCreate() in Connect mode returns an object that is not a MyContext, and the subclass's attributes vanish with no signal until an AttributeError. The classic branch instantiates the actual subclass (cls._get_or_create -> cls(...)), so this is a behavior divergence, not just a theoretical one. Since the only known subclass is HiveContext (already routed by name), raising PySparkNotImplementedError when cls.__name__ has no Connect counterpart would fail loudly instead — one line. Non-blocking given how rare subclassing this 1.x entry point is.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 3c29340. The dispatch now does getattr(_connect_context, cls.__name__, None) and raises PySparkNotImplementedError when a subclass has no Connect counterpart, instead of silently handing back a base Connect SQLContext. HiveContext is unaffected (it has a Connect counterpart and is still routed by name to raise via _from_session).

Comment thread python/pyspark/sql/connect/context.py Outdated
# (namespace, tableName, isTemporary), matching the classic implementation.
# SHOW TABLES returns "database" vs "namespace" depending on the active catalog.
rows = [
(t.namespace[-1] if t.namespace else "", t.name, t.isTemporary)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small parity question: t.namespace[-1] keeps only the last namespace part, while classic SHOW TABLES emits the full quoted namespace (ShowTablesExec.scala:53, ident.namespace().quoted) — so under a v2 catalog with a multi-level namespace the column reads b where classic reads a.b. Identical for v1 single-part namespaces, which is surely the dominant case for this shim, so non-blocking — but ".".join(t.namespace) would match classic exactly if you want full parity.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch -- fixed in 3c29340. Switched t.namespace[-1] to ".".join(t.namespace) so multi-level v2 catalog namespaces read a.b like classic SHOW TABLES instead of dropping to b. Added a comment noting the parity rationale.

Comment thread python/pyspark/sql/connect/context.py Outdated
from pyspark.sql.connect.udtf import UDTFRegistration
from pyspark.sql._typing import UserDefinedFunctionLike

# Internal module — not part of the public PySpark API surface.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This em-dash is the only non-ASCII character the PR introduces; comments are conventionally ASCII-only.

Suggested change
# Internal module not part of the public PySpark API surface.
# Internal module - not part of the public PySpark API surface.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 3c29340 -- replaced the em-dash with an ASCII hyphen. Confirmed with grep -P "[^\x00-\x7F]" that the file is now ASCII-only.

@dbtsai

dbtsai commented Jun 9, 2026

Copy link
Copy Markdown
Member Author

Ran an additional automated + manual review pass over this PR. Summary of findings and follow-up fixes (commit incoming):

1. Critical: sessions created via object.__new__ are missing release_session_on_close

newSession() (added in this PR) builds the new SparkSession with object.__new__(SparkSession) and only sets _client and _session_id. Since __init__ is bypassed, release_session_on_close (set at session.py:312) is never set, and stop() reads it unconditionally at session.py:951 - so calling stop() on a session returned by newSession() raises AttributeError and breaks the normal cleanup path.

cloneSession() has the identical pre-existing bug (it predates this PR), so the fix covers both:

  • Set release_session_on_close = True in both newSession() and cloneSession().
  • Regression assertions added in test_connect_context.py and test_connect_clone_session.py (the local-remote test harness cannot safely call stop() on a child session, so the tests assert the attribute directly).

2. Connect client leak in the parity test

SQLContextTestsMixin.test_newSession_returns_distinct_instance runs under SQLContextParityTests(ReusedConnectTestCase), where ctx.newSession() creates a brand-new SparkConnectClient (which registers an atexit hook at core.py:760) that the test never releases or closes. This is exactly the hang-after-tearDownClass failure mode already documented in test_connect_context.py. Fixed by overriding the test in the Connect parity subclass with the same release_session() + close() cleanup; the classic mixin is untouched.

3. Reviewed and rejected: "newSession() drops client settings that clone() preserves"

A reviewer tool suggested SparkConnectClient.newSession() should pass through session_hooks / retry settings like the clone path. This is not accurate: clone() constructs its new client with exactly the same three arguments (connection = deepcopy of the builder, user_id, use_reattachable_execute). The builder deepcopy already preserves the endpoint, channel options, and metadata, and session_hooks cannot be propagated since hook factories are consumed in SparkSession.__init__ and bound to that specific session. newSession() intentionally mirrors the clone() constructor pattern; any improvement there is a shared, pre-existing concern and out of scope here.

@dbtsai

dbtsai commented Jun 9, 2026

Copy link
Copy Markdown
Member Author

The fixes described above are now pushed:

  • 9b96499 sets release_session_on_close = True in both newSession() and cloneSession() (the latter was a pre-existing copy of the same bug), with regression assertions in test_connect_context.py and test_connect_clone_session.py.
  • 8caf289 overrides test_newSession_returns_distinct_instance in SQLContextParityTests to release the new server-side session and close its client, fixing the Connect client leak in the parity run.

No code change for item 3 - the "newSession() drops client settings" suggestion was rejected for the reasons above.

@dbtsai dbtsai force-pushed the connect-sqlcontext-wrapper branch from 8caf289 to d25ff09 Compare June 10, 2026 18:05
@dbtsai

dbtsai commented Jun 10, 2026

Copy link
Copy Markdown
Member Author

cc @hvanhovell

@dbtsai dbtsai force-pushed the connect-sqlcontext-wrapper branch 4 times, most recently from 815543d to c940d2d Compare June 12, 2026 17:44
@dbtsai dbtsai force-pushed the connect-sqlcontext-wrapper branch from c940d2d to 923443c Compare June 12, 2026 17:56
@dbtsai

dbtsai commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

cc @viirya @cloud-fan @Yicong-Huang @hvanhovell @HyukjinKwon to review again since many changes after the approvals. Thanks,

@Yicong-Huang

Copy link
Copy Markdown
Contributor

I think there are two minor points:

  1. SparkConnectClient.newSession() reaches into the private _params dict. newSession() drops the session id with new_connection._params.pop(ChannelBuilder.PARAM_SESSION_ID, None), whereas the sibling clone() uses the public new_connection.set(...). Maybe we can use public api for newSession as well.

  2. In SQLContext.getOrCreate(), the is_remote() branch obtains the session through the classic SparkSession._getActiveSessionOrCreate() and then casts it to a Connect SparkSession before wrapping it. I am not sure if this is intended behavior, @dbtsai please confirm.

I think both can be fixed in follow ups. If no objections I will merge this PR tonight.

Yicong-Huang pushed a commit that referenced this pull request Jun 16, 2026
### What changes were proposed in this pull request?

This PR adds a Spark Connect-compatible `SQLContext` (and `HiveContext`) implementation in
`pyspark.sql.connect.context` so that legacy code using `SQLContext` continues to work
transparently when running against a Connect server.

Key changes:

1. **New `pyspark.sql.connect.context.SQLContext`** — wraps a Connect `SparkSession` directly
   (no `SparkContext` required). Delegates all supported operations to the session:
   `sql`, `table`, `range`, `createDataFrame`, `conf`, `udf`, `udtf`, `read`,
   `readStream`, `streams`, and catalog operations (`cacheTable`, `uncacheTable`,
   `clearCache`, `tables`, `tableNames`, `registerDataFrameAsTable`, `dropTempTable`,
   `createExternalTable`).
   - `newSession()` uses `cloneSession()` (the Connect equivalent of `SparkSession.newSession()`).
   - JVM-only APIs (`registerJavaFunction`, `HiveContext.__init__`) raise `PySparkNotImplementedError`.

2. **Connect dispatch in classic `SQLContext.getOrCreate()`** — when running in Spark Connect
   mode (`is_remote()`, which covers both a remote-only pyspark-client install and a full
   install talking to a Connect server via `SPARK_REMOTE`), the classic `getOrCreate()` now
   automatically returns a Connect `SQLContext` wrapping the active Connect session, so callers
   do not need to import from `pyspark.sql.connect` directly. Note that only `getOrCreate()`
   is wired to Connect; the `SQLContext(...)` constructor remains classic-only and still
   requires a `SparkContext`.

3. **Shared test mixin** — `SQLContextTestsMixin` extracted to `test_sql_context.py` so the same
   suite runs against both the classic and Connect implementations via `SQLContextParityTests`.

4. **API reference docs** — new `python/docs/source/reference/pyspark.sql/legacy.rst` page
   listing `SQLContext` and `HiveContext` in the public API reference.

5. **CI registration** — `test_connect_context` registered in `modules.py`.

### Why are the changes needed?

`SQLContext` is deprecated since Spark 2.0 in favor of `SparkSession`, but many existing
PySpark applications still instantiate it directly. Without this wrapper, those applications
fail entirely on Spark Connect because the classic `SQLContext.__init__` requires a live
`SparkContext` (JVM), which is not available in Connect mode. This patch closes that
compatibility gap.

### Does this PR introduce _any_ user-facing change?

Yes. Previously, calling `SQLContext.getOrCreate()` in a Spark Connect environment raised an
error because the classic implementation requires a `SparkContext`. After this PR, in Spark
Connect mode `SQLContext.getOrCreate()` succeeds and returns a fully functional (but still
deprecated) `SQLContext` backed by the active Connect session.

The `SQLContext(...)` constructor is unchanged and remains classic-only — it still requires a
`SparkContext`. Code that needs Connect compatibility should use `SQLContext.getOrCreate()`
(or, preferably, migrate to `SparkSession`).

JVM-specific methods (`registerJavaFunction`, `HiveContext`) now raise a clear
`PySparkNotImplementedError` instead of a cryptic JVM/attribute error.

### How was this patch tested?

- Added `SQLContextTestsMixin` in `python/pyspark/sql/tests/test_sql_context.py` covering:
  `setConf`/`getConf`, `createDataFrame`, `sql`, `table`, `tables`/`tableNames`,
  `cacheTable`/`uncacheTable`/`clearCache`, `registerDataFrameAsTable`/`dropTempTable`,
  `createExternalTable`, `range`, `read`, `readStream`, `streams`, `udf`/`udtf`,
  `registerFunction`, `newSession`.
- `SQLContextConnectTests` in `python/pyspark/sql/tests/connect/test_connect_context.py`
  adds Connect-specific cases: deprecation warning on `__init__`, `SQLContext.getOrCreate()`
  via the public `from pyspark.sql import SQLContext` path returns a Connect-backed context
  and emits a deprecation warning in Connect mode (patching `pyspark.sql.utils.is_remote`),
  `HiveContext.getOrCreate()` in Connect mode raises `PySparkNotImplementedError`,
  `registerJavaFunction` raises `PySparkNotImplementedError`, and `HiveContext.__init__`
  raises `PySparkNotImplementedError`.
- Registered in `dev/sparktestsupport/modules.py` so the Connect test is picked up by CI.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude (claude-sonnet-4-6), via Anthropic Claude Code

Closes #55574 from dbtsai/connect-sqlcontext-wrapper.

Authored-by: DB Tsai <dbtsai@apache.org>
Signed-off-by: Yicong-Huang <17627829+Yicong-Huang@users.noreply.github.com>
(cherry picked from commit 7071ac1)
Signed-off-by: Yicong-Huang <17627829+Yicong-Huang@users.noreply.github.com>
@Yicong-Huang

Copy link
Copy Markdown
Contributor

Thanks all, merged to master/4.x. Let's handle minor issues in followups.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants