feat: experimental Spark JSON support via codegen dispatcher by andygrove · Pull Request #4305 · apache/datafusion-comet

andygrove · 2026-05-12T17:23:04Z

Which issue does this PR close?

Closes #.

Rationale for this change

The native Rust JSON expressions in Comet have known compatibility gaps and feature restrictions: from_json only supports PERMISSIVE mode with simple schemas, to_json does not handle map/array at the top level, and get_json_object differs from Spark on certain path expressions. Routing through Spark's own expression classes guarantees byte-exact compatibility, at the cost of a JNI roundtrip per batch.

This PR adds a spark.comet.exec.json.engine selector that routes the JSON expressions through the Arrow-direct codegen dispatcher introduced in #4417. The dispatcher Janino-compiles Spark's own doGenCode (or eval(row) for CodegenFallback expressions) so the JSON family inherits Spark-identical semantics with no per-expression glue code. The existing native path remains the default.

Configs

spark.comet.exec.json.engine in {rust, java}, default rust
- rust: native DataFusion implementation. Fast, but has known compatibility gaps.
- java (experimental): routes through the codegen dispatcher so Spark's own doGenCode runs inside the Comet pipeline. Requires spark.comet.exec.scalaUDF.codegen.enabled=true; otherwise the operator falls back to Spark.

Reuses the existing spark.comet.exec.scalaUDF.codegen.enabled (default false) gate introduced for the codegen dispatcher.

What changes are included in this PR?

Add a JsonRoute helper in structs.scala that each JSON serde delegates to. It picks between the native path and the codegen dispatcher based on engine and scalaUDF.codegen.enabled.
Refactor CometGetJsonObject (strings.scala), CometStructsToJson, and CometJsonToStructs (structs.scala) to use the route helper. The JVM arm delegates to CometScalaUDF.emitJvmCodegenDispatch.
Add the spark.comet.exec.json.engine config in CometConf.
Two enabling changes in the codegen dispatcher so the JSON expressions can ride the same path:
- CometBatchKernelCodegen.canHandle now accepts CodegenFallback expressions. CodegenFallback.doGenCode emits references[N].eval(row), the same mechanism the existing HigherOrderFunction carve-out relies on; lifting the rejection lets non-HOF CodegenFallback expressions (here JsonToStructs and StructsToJson) ride the same path.
- CometScalaUDF.emitJvmCodegenDispatch now unwraps RuntimeReplaceable before binding. Spark 4's StructsToJson is RuntimeReplaceable and its doGenCode always throws; calling .replacement gives the Invoke(StructsToJsonEvaluator, ...) form that does codegen.
Document the model in docs/source/user-guide/latest/compatibility/json.md.

json_array_length and json_object_keys are intentionally out of scope. Both are RuntimeReplaceable in Spark 4.x and Catalyst's ReplaceExpressions rewrites them to StaticInvoke before Comet sees the plan, so classOf[LengthOfJsonArray] / classOf[JsonObjectKeys] registrations never match. Adding support requires recognizing the rewritten StaticInvoke form in Comet's serde dispatch and is left to a follow-up.

How are these changes tested?

CometJsonJvmSuite: 3 tests covering get_json_object, from_json, and to_json(from_json(...)) round-trip with engine=java and the codegen flag enabled.
CometJsonExpressionSuite: 8 native-path tests continue to pass on default engine=rust.
CometCodegenSuite, CometCodegenSourceSuite, CometStringExpressionSuite, and CometSqlFileTestSuite continue to pass. CometCodegenSourceSuite has an updated test asserting the new CodegenFallback admission.

This PR was scaffolded with the project's brainstorming, writing-plans, and subagent-driven-development skills.

Add `spark.comet.exec.json.engine` (default `rust`, experimental `java`) that routes the JSON expressions in scope through the JVM UDF framework introduced in apache#4232, delegating to Spark's own expression classes for byte-exact compatibility at the cost of JNI roundtrips per batch. Expressions in scope when `engine=java`: - `get_json_object` -> `GetJsonObjectUDF` - `from_json` -> `FromJsonUDF` - `to_json` -> `ToJsonUDF` A fresh Spark expression is built per `evaluate` call. Spark's JSON evaluators (`GetJsonObjectEvaluator`, `StructsToJsonEvaluator`, `JsonToStructsEvaluator`) hold mutable per-row state, and the JVM UDF framework shares one UDF instance across native worker threads, so a cached cross-thread expression races on its evaluator state. `from_json` / `to_json` use a serde-side `CometLambdaRegistry` to pass the configured Spark expression (schema, options, timezone) to the UDF. The serde rebinds the child to `BoundReference(0)` so the UDF can call `eval(row)` against a single-column wrapper row. `json_array_length` and `json_object_keys` are out of scope: both are `RuntimeReplaceable` in Spark 4.x and Catalyst's `ReplaceExpressions` rule rewrites them to `StaticInvoke` before Comet sees the plan, so `classOf[LengthOfJsonArray]` / `classOf[JsonObjectKeys]` serde registrations never match. Adding support requires recognizing the rewritten `StaticInvoke` form in Comet's serde dispatch. This PR was scaffolded with the project's brainstorming, writing-plans, and subagent-driven-development skills.

Resolve conflicts: - The common module rename (PR apache#4325) moved the UDF files added by this branch under common/.../udf/ to spark/.../udf/. Git's location-conflict resolver guessed the wrong destination (.../shims/); fix manually by placing FromJsonUDF / GetJsonObjectUDF / ToJsonUDF under spark/src/main/scala/org/apache/comet/udf/. (Follow-up commit retires them in favor of the codegen dispatcher anyway.) - CometConf.scala: keep both COMET_JSON_ENGINE (this PR) and COMET_SCALA_UDF_CODEGEN_ENABLED (main). - serde/structs.scala: combine the JSON engine selector with main's improved native ignoreNullFields / options handling. engine=java keeps routing through convertViaJvmUdf; engine=rust uses main's tightened native path.

…f hand-written UDFs Replace the three hand-written `GetJsonObjectUDF` / `FromJsonUDF` / `ToJsonUDF` JVM UDF implementations and the `CometLambdaRegistry` indirection with the Arrow-direct codegen dispatcher introduced in PR apache#4417 (`CometScalaUDF.emitJvmCodegenDispatch`). The dispatcher Janino-compiles Spark's own `doGenCode` (or `eval(row)` for CodegenFallback expressions) so the JSON family inherits Spark-identical semantics with no per-expression glue. Changes: - Delete the three hand-written UDF files under `spark/src/main/scala/org/apache/comet/udf/` and their unit-test suites. The codegen dispatcher's per-task `kernelCache` provides the same per-thread isolation that `CometLambdaRegistry` was working around. - Rewrite the JSON serdes (`CometGetJsonObject` in `strings.scala`, `CometStructsToJson` and `CometJsonToStructs` in `structs.scala`) to go through a new `JsonRoute` helper. `engine=rust` keeps the native path; `engine=java` delegates to `CometScalaUDF.emitJvmCodegenDispatch` when `spark.comet.exec.scalaUDF.codegen.enabled=true`. - Generalize the codegen dispatcher to accept `CodegenFallback` expressions. `CodegenFallback.doGenCode` emits `references[N].eval(row)`, the same shape the `HigherOrderFunction` carve-out already relied on; lifting the rejection lets `JsonToStructs` and `StructsToJson` (which are `CodegenFallback` in Spark 4) ride the same path. - Unwrap `RuntimeReplaceable` expressions inside `CometScalaUDF.emitJvmCodegenDispatch` before binding. Spark 4's `StructsToJson` is `RuntimeReplaceable` and its `doGenCode` throws "Cannot generate code for expression"; calling `.replacement` gives the `Invoke(StructsToJsonEvaluator, ...)` form that does codegen. - Update the JSON compatibility doc and the `CometJsonJvmSuite` config to reference the codegen flag. Test plan: - `CometJsonJvmSuite`: 3/3 pass (get_json_object, from_json, to_json round-trip via the codegen dispatcher). - `CometJsonExpressionSuite`: 8/8 pass on the unchanged native path. - `CometStringExpressionSuite`: 33/33, `CometCodegenSuite`: 60/60, `CometCodegenSourceSuite`: 50/50, `CometSqlFileTestSuite`: 284/284. - `cargo clippy --all-targets --workspace -- -D warnings`: clean.

- Add CometJsonJvmSuite to pr_build_linux.yml so check-missing-suites passes. - Remove unused HigherOrderFunction, LambdaFunction, NamedLambdaVariable imports from CometBatchKernelCodegen.scala (referenced only in comments). - Remove unused serializeDataType import from strings.scala.

andygrove marked this pull request as draft May 12, 2026 17:23

andygrove changed the title ~~feat: add JVM UDF engine for Spark JSON expressions~~ feat: add experimental support fro JSON expressions using Comet JVM UDF Framework May 12, 2026

andygrove changed the title ~~feat: add experimental support fro JSON expressions using Comet JVM UDF Framework~~ feat: add experimental support for JSON expressions using Comet JVM UDF Framework May 12, 2026

andygrove modified the milestones: 0.18.0 (July 2026), 0.18.0, 0.17.0 May 13, 2026

This was referenced May 14, 2026

feat(experimental): ScalaUDF and Java UDF support via Janino codegen #4267

Merged

feat: support stateful CometUDFs #4345

Merged

kazuyukitanimura mentioned this pull request May 18, 2026

Implement JVM UDFs for JSON expressions #4313

Open

andygrove added 2 commits May 26, 2026 16:10

andygrove changed the title ~~feat: add experimental support for JSON expressions using Comet JVM UDF Framework~~ feat: experimental Spark JSON support via codegen dispatcher May 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: experimental Spark JSON support via codegen dispatcher#4305

feat: experimental Spark JSON support via codegen dispatcher#4305
andygrove wants to merge 4 commits into
apache:mainfrom
andygrove:worktree-json-jvm-udf

andygrove commented May 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andygrove commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

Configs

What changes are included in this PR?

How are these changes tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

andygrove commented May 12, 2026 •

edited

Loading