Skip to content

feat: experimental Spark JSON support via codegen dispatcher#4305

Draft
andygrove wants to merge 4 commits into
apache:mainfrom
andygrove:worktree-json-jvm-udf
Draft

feat: experimental Spark JSON support via codegen dispatcher#4305
andygrove wants to merge 4 commits into
apache:mainfrom
andygrove:worktree-json-jvm-udf

Conversation

@andygrove
Copy link
Copy Markdown
Member

@andygrove andygrove commented May 12, 2026

Which issue does this PR close?

Closes #.

Rationale for this change

The native Rust JSON expressions in Comet have known compatibility gaps and feature restrictions: from_json only supports PERMISSIVE mode with simple schemas, to_json does not handle map/array at the top level, and get_json_object differs from Spark on certain path expressions. Routing through Spark's own expression classes guarantees byte-exact compatibility, at the cost of a JNI roundtrip per batch.

This PR adds a spark.comet.exec.json.engine selector that routes the JSON expressions through the Arrow-direct codegen dispatcher introduced in #4417. The dispatcher Janino-compiles Spark's own doGenCode (or eval(row) for CodegenFallback expressions) so the JSON family inherits Spark-identical semantics with no per-expression glue code. The existing native path remains the default.

Configs

  • spark.comet.exec.json.engine in {rust, java}, default rust
    • rust: native DataFusion implementation. Fast, but has known compatibility gaps.
    • java (experimental): routes through the codegen dispatcher so Spark's own doGenCode runs inside the Comet pipeline. Requires spark.comet.exec.scalaUDF.codegen.enabled=true; otherwise the operator falls back to Spark.

Reuses the existing spark.comet.exec.scalaUDF.codegen.enabled (default false) gate introduced for the codegen dispatcher.

What changes are included in this PR?

  • Add a JsonRoute helper in structs.scala that each JSON serde delegates to. It picks between the native path and the codegen dispatcher based on engine and scalaUDF.codegen.enabled.
  • Refactor CometGetJsonObject (strings.scala), CometStructsToJson, and CometJsonToStructs (structs.scala) to use the route helper. The JVM arm delegates to CometScalaUDF.emitJvmCodegenDispatch.
  • Add the spark.comet.exec.json.engine config in CometConf.
  • Two enabling changes in the codegen dispatcher so the JSON expressions can ride the same path:
    • CometBatchKernelCodegen.canHandle now accepts CodegenFallback expressions. CodegenFallback.doGenCode emits references[N].eval(row), the same mechanism the existing HigherOrderFunction carve-out relies on; lifting the rejection lets non-HOF CodegenFallback expressions (here JsonToStructs and StructsToJson) ride the same path.
    • CometScalaUDF.emitJvmCodegenDispatch now unwraps RuntimeReplaceable before binding. Spark 4's StructsToJson is RuntimeReplaceable and its doGenCode always throws; calling .replacement gives the Invoke(StructsToJsonEvaluator, ...) form that does codegen.
  • Document the model in docs/source/user-guide/latest/compatibility/json.md.

json_array_length and json_object_keys are intentionally out of scope. Both are RuntimeReplaceable in Spark 4.x and Catalyst's ReplaceExpressions rewrites them to StaticInvoke before Comet sees the plan, so classOf[LengthOfJsonArray] / classOf[JsonObjectKeys] registrations never match. Adding support requires recognizing the rewritten StaticInvoke form in Comet's serde dispatch and is left to a follow-up.

How are these changes tested?

  • CometJsonJvmSuite: 3 tests covering get_json_object, from_json, and to_json(from_json(...)) round-trip with engine=java and the codegen flag enabled.
  • CometJsonExpressionSuite: 8 native-path tests continue to pass on default engine=rust.
  • CometCodegenSuite, CometCodegenSourceSuite, CometStringExpressionSuite, and CometSqlFileTestSuite continue to pass. CometCodegenSourceSuite has an updated test asserting the new CodegenFallback admission.

This PR was scaffolded with the project's brainstorming, writing-plans, and subagent-driven-development skills.

Add `spark.comet.exec.json.engine` (default `rust`, experimental `java`)
that routes the JSON expressions in scope through the JVM UDF framework
introduced in apache#4232, delegating to Spark's own expression classes for
byte-exact compatibility at the cost of JNI roundtrips per batch.

Expressions in scope when `engine=java`:

- `get_json_object` -> `GetJsonObjectUDF`
- `from_json` -> `FromJsonUDF`
- `to_json` -> `ToJsonUDF`

A fresh Spark expression is built per `evaluate` call. Spark's JSON
evaluators (`GetJsonObjectEvaluator`, `StructsToJsonEvaluator`,
`JsonToStructsEvaluator`) hold mutable per-row state, and the JVM UDF
framework shares one UDF instance across native worker threads, so a
cached cross-thread expression races on its evaluator state.

`from_json` / `to_json` use a serde-side `CometLambdaRegistry` to pass
the configured Spark expression (schema, options, timezone) to the UDF.
The serde rebinds the child to `BoundReference(0)` so the UDF can call
`eval(row)` against a single-column wrapper row.

`json_array_length` and `json_object_keys` are out of scope: both are
`RuntimeReplaceable` in Spark 4.x and Catalyst's `ReplaceExpressions`
rule rewrites them to `StaticInvoke` before Comet sees the plan, so
`classOf[LengthOfJsonArray]` / `classOf[JsonObjectKeys]` serde
registrations never match. Adding support requires recognizing the
rewritten `StaticInvoke` form in Comet's serde dispatch.

This PR was scaffolded with the project's brainstorming, writing-plans,
and subagent-driven-development skills.
@andygrove andygrove marked this pull request as draft May 12, 2026 17:23
@andygrove andygrove marked this pull request as draft May 12, 2026 17:23
@andygrove andygrove changed the title feat: add JVM UDF engine for Spark JSON expressions feat: add experimental support fro JSON expressions using Comet JVM UDF Framework May 12, 2026
@andygrove andygrove changed the title feat: add experimental support fro JSON expressions using Comet JVM UDF Framework feat: add experimental support for JSON expressions using Comet JVM UDF Framework May 12, 2026
andygrove added 2 commits May 26, 2026 16:10
Resolve conflicts:

- The common module rename (PR apache#4325) moved the UDF files added by this
  branch under common/.../udf/ to spark/.../udf/. Git's location-conflict
  resolver guessed the wrong destination (.../shims/); fix manually by
  placing FromJsonUDF / GetJsonObjectUDF / ToJsonUDF under
  spark/src/main/scala/org/apache/comet/udf/. (Follow-up commit retires
  them in favor of the codegen dispatcher anyway.)

- CometConf.scala: keep both COMET_JSON_ENGINE (this PR) and
  COMET_SCALA_UDF_CODEGEN_ENABLED (main).

- serde/structs.scala: combine the JSON engine selector with main's
  improved native ignoreNullFields / options handling. engine=java keeps
  routing through convertViaJvmUdf; engine=rust uses main's tightened
  native path.
…f hand-written UDFs

Replace the three hand-written `GetJsonObjectUDF` / `FromJsonUDF` /
`ToJsonUDF` JVM UDF implementations and the `CometLambdaRegistry`
indirection with the Arrow-direct codegen dispatcher introduced in
PR apache#4417 (`CometScalaUDF.emitJvmCodegenDispatch`). The dispatcher
Janino-compiles Spark's own `doGenCode` (or `eval(row)` for
CodegenFallback expressions) so the JSON family inherits Spark-identical
semantics with no per-expression glue.

Changes:

- Delete the three hand-written UDF files under
  `spark/src/main/scala/org/apache/comet/udf/` and their unit-test
  suites. The codegen dispatcher's per-task `kernelCache` provides the
  same per-thread isolation that `CometLambdaRegistry` was working
  around.
- Rewrite the JSON serdes (`CometGetJsonObject` in `strings.scala`,
  `CometStructsToJson` and `CometJsonToStructs` in `structs.scala`) to
  go through a new `JsonRoute` helper. `engine=rust` keeps the native
  path; `engine=java` delegates to
  `CometScalaUDF.emitJvmCodegenDispatch` when
  `spark.comet.exec.scalaUDF.codegen.enabled=true`.
- Generalize the codegen dispatcher to accept `CodegenFallback`
  expressions. `CodegenFallback.doGenCode` emits
  `references[N].eval(row)`, the same shape the `HigherOrderFunction`
  carve-out already relied on; lifting the rejection lets `JsonToStructs`
  and `StructsToJson` (which are `CodegenFallback` in Spark 4) ride the
  same path.
- Unwrap `RuntimeReplaceable` expressions inside
  `CometScalaUDF.emitJvmCodegenDispatch` before binding. Spark 4's
  `StructsToJson` is `RuntimeReplaceable` and its `doGenCode` throws
  "Cannot generate code for expression"; calling `.replacement` gives
  the `Invoke(StructsToJsonEvaluator, ...)` form that does codegen.
- Update the JSON compatibility doc and the `CometJsonJvmSuite` config
  to reference the codegen flag.

Test plan:
- `CometJsonJvmSuite`: 3/3 pass (get_json_object, from_json,
  to_json round-trip via the codegen dispatcher).
- `CometJsonExpressionSuite`: 8/8 pass on the unchanged native path.
- `CometStringExpressionSuite`: 33/33, `CometCodegenSuite`: 60/60,
  `CometCodegenSourceSuite`: 50/50, `CometSqlFileTestSuite`: 284/284.
- `cargo clippy --all-targets --workspace -- -D warnings`: clean.
@andygrove andygrove changed the title feat: add experimental support for JSON expressions using Comet JVM UDF Framework feat: experimental Spark JSON support via codegen dispatcher May 26, 2026
- Add CometJsonJvmSuite to pr_build_linux.yml so check-missing-suites
  passes.
- Remove unused HigherOrderFunction, LambdaFunction, NamedLambdaVariable
  imports from CometBatchKernelCodegen.scala (referenced only in
  comments).
- Remove unused serializeDataType import from strings.scala.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant