Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
58 commits
Select commit Hold shift + click to select a range
9cd1566
docs: add implement-comet-expression Claude skill
andygrove Apr 30, 2026
953cb86
docs: reference PR template and add skill-acknowledgement note
andygrove Apr 30, 2026
422d2b3
docs: check datafusion-spark crate before writing native code
andygrove Apr 30, 2026
88f2331
Merge branch 'add-implement-expression-skill'
andygrove Apr 30, 2026
eb8aa14
feat: add CometUDF trait for JVM-side scalar UDFs
andygrove May 1, 2026
60a2ecd
feat: add RegExpLikeUDF using java.util.regex.Pattern
andygrove May 1, 2026
633b75e
feat: add CometUdfBridge JNI entry point for native UDF dispatch
andygrove May 1, 2026
1c64070
feat: add JvmScalarUdf proto message for JVM UDF dispatch
andygrove May 1, 2026
8f78436
feat: register CometUdfBridge in JVMClasses for native UDF dispatch
andygrove May 1, 2026
cf233d5
feat: add JvmScalarUdfExpr PhysicalExpr that dispatches to JVM via JNI
andygrove May 1, 2026
d8ab411
feat: wire JvmScalarUdf proto into native planner
andygrove May 1, 2026
4970c9c
feat: add spark.comet.exec.regexp.useJVM config
andygrove May 1, 2026
54ddd50
feat: route RLike through JVM UDF when spark.comet.exec.regexp.useJVM…
andygrove May 1, 2026
0a942ad
test: add end-to-end suite for JVM-backed RLike
andygrove May 1, 2026
fbfc158
fix: use project-wide CometArrowAllocator in RegExpLikeUDF
andygrove May 1, 2026
909ab91
docs: correct CometUdfBridge thread cache lifetime comment
andygrove May 1, 2026
862ed2e
docs: document from_ffi consumption invariant in JvmScalarUdfExpr
andygrove May 1, 2026
a943de5
style: apply make format
andygrove May 1, 2026
e1b9b2a
docs: mark spark.comet.exec.regexp.useJVM experimental and generalize…
andygrove May 1, 2026
76418c6
test: add CometRegExpBenchmark covering all rlike modes
andygrove May 1, 2026
8ac45be
ci: register new RLike JVM-bridge test suites in PR workflows
andygrove May 1, 2026
a1f8ecf
build: exclude docs/superpowers from rat and git
andygrove May 1, 2026
23a9e52
remove skill
andygrove May 1, 2026
1c66f44
refactor: rename regexp.useJVM boolean to regexp.engine enum (rust|java)
andygrove May 1, 2026
56327ed
fix: ensure UDF bridge inputs/result close on every path and resolve …
andygrove May 1, 2026
fee5ab2
fix: validate regex pattern at convert time so invalid or null patter…
andygrove May 1, 2026
7d0f25c
fix: tolerate missing CometUdfBridge class at JVMClasses init
andygrove May 1, 2026
2a43867
refactor: introduce REGEXP_ENGINE_RUST/REGEXP_ENGINE_JAVA constants
andygrove May 1, 2026
760cd94
perf: send scalar UDF arguments as length-1 vectors
andygrove May 1, 2026
85029c5
test: cover empty and all-null subject vectors in RegExpLikeUDF unit …
andygrove May 1, 2026
a16f336
feat: propagate result nullability through JvmScalarUdf proto
andygrove May 1, 2026
5937650
fix: validate UDF result row count matches longest input
andygrove May 1, 2026
1dd81fb
fix: qualify CometRLike incompat reasons by engine config
andygrove May 1, 2026
42462c3
fix: bound UDF and pattern caches with LRU eviction
andygrove May 1, 2026
8073cf3
test: stop using per-test RootAllocator in RegExpLikeUDFSuite
andygrove May 1, 2026
ce01339
test: remove RegExpLikeUDFSuite due to shading boundary
andygrove May 1, 2026
eb544d6
Merge remote-tracking branch 'apache/main' into prototype-jvm-scalar-udf
andygrove May 6, 2026
4683199
feat: add all Spark regexp expressions via JVM UDF framework
andygrove May 6, 2026
6cac094
docs: update regexp compatibility guide for java vs rust engine
andygrove May 6, 2026
1ad838b
Merge remote-tracking branch 'apache/main' into java-regexp
andygrove May 8, 2026
250b469
fix: use ConcurrentHashMap for pattern cache in regexp UDFs
andygrove May 8, 2026
941d9c7
refactor: use computeIfAbsent for pattern cache lookup
andygrove May 8, 2026
336ec6e
Merge remote-tracking branch 'apache/main' into worktree-pr-4239-rege…
andygrove May 12, 2026
ea939ce
fix: default regexp engine back to rust, mark java engine experimental
andygrove May 12, 2026
5e18c62
style: prettier format regex compatibility docs
andygrove May 12, 2026
8b92370
style: drop unused idx binding in RegExpInStrUDF to fix scalafix lint
andygrove May 12, 2026
ca6628b
style: drop unused idx bindings in regexp serde to fix scalafix lint
andygrove May 12, 2026
c4e88fb
test: set regexp engine to java in SQL tests that need it
andygrove May 13, 2026
b55adb0
Merge remote-tracking branch 'apache/main' into java-regexp
andygrove May 13, 2026
0fa237f
fix: update regexp UDFs to new CometUDF.evaluate(inputs, numRows) sig…
andygrove May 14, 2026
f6b4096
Merge branch 'main' of github.com:apache/datafusion-comet into worktr…
andygrove May 19, 2026
2eb06c9
feat: gate JVM UDF framework behind spark.comet.jvmUdf.enabled
andygrove May 19, 2026
5dd2398
refactor: simplify regexp engine config to {rust, java}, default java
andygrove May 20, 2026
be487f1
refactor: surface engine=rust as the optedInBy opt-in for regex
andygrove May 20, 2026
0f21e19
fix: address CI failures for java-regexp PR
andygrove May 21, 2026
29428e5
Merge apache/main into java-regexp
andygrove May 26, 2026
bec171e
refactor: route regex expressions through codegen dispatcher instead …
andygrove May 26, 2026
3a13aa7
test: route rlike non-scalar-pattern fallback test through engine=rust
andygrove May 27, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/pr_build_linux.yml
Original file line number Diff line number Diff line change
Expand Up @@ -407,6 +407,7 @@ jobs:
org.apache.comet.expressions.conditional.CometIfSuite
org.apache.comet.expressions.conditional.CometCoalesceSuite
org.apache.comet.expressions.conditional.CometCaseWhenSuite
org.apache.comet.CometRegExpJvmSuite
org.apache.comet.CometCodegenSuite
org.apache.comet.CometCodegenSourceSuite
org.apache.comet.CometCodegenHOFSuite
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/pr_build_macos.yml
Original file line number Diff line number Diff line change
Expand Up @@ -247,6 +247,7 @@ jobs:
org.apache.comet.expressions.conditional.CometIfSuite
org.apache.comet.expressions.conditional.CometCoalesceSuite
org.apache.comet.expressions.conditional.CometCaseWhenSuite
org.apache.comet.CometRegExpJvmSuite
org.apache.comet.CometCodegenSuite
org.apache.comet.CometCodegenSourceSuite
org.apache.comet.CometCodegenHOFSuite
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -27,3 +27,4 @@ output
docs/comet-*/
docs/build/
docs/temp/
docs/superpowers/
101 changes: 98 additions & 3 deletions docs/source/user-guide/latest/compatibility/regex.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,101 @@ under the License.

# Regular Expressions

Comet uses the Rust regexp crate for evaluating regular expressions, and this has different behavior from Java's
regular expression engine. Comet will fall back to Spark for patterns that are known to produce different results, but
this can be overridden by setting `spark.comet.expression.regexp.allowIncompatible=true`.
Comet provides two regexp engines for evaluating regular expressions: a **Rust engine** that uses the Rust
[`regex`] crate natively, and an experimental **Java engine** that runs Spark's own `doGenCode` for the
expression inside Comet's Arrow-direct codegen dispatcher (the same dispatcher used by Comet's
`ScalaUDF` codegen path). The engine is selected with `spark.comet.exec.regexp.engine`, which accepts:

- `java` (default) — route through the Java engine for full Spark compatibility. Requires
`spark.comet.exec.scalaUDF.codegen.enabled=true`; otherwise regex expressions fall back to Spark with
an explanatory message.
- `rust` — run the Rust engine when an expression has a native implementation. Setting this is itself
the opt-in for the semantic differences between Java and Rust regex (no separate `allowIncompatible`
flag needed). Expressions without a native Rust implementation (`regexp_extract`,
`regexp_extract_all`, `regexp_instr`) fall back to Spark.

The codegen dispatcher is experimental and disabled by default. With pure defaults
(`engine=java`, `scalaUDF.codegen.enabled=false`), all regex expressions fall back to Spark.

## Choosing an engine

| | Rust engine | Java engine (experimental, default) |
| -------------------- | --------------------------------------- | ------------------------------------------------------------------------------------------------------------------- |
| **Compatibility** | Differs from Java regex (see below) | 100% compatible with Spark |
| **Feature coverage** | `rlike`, `regexp_replace`, `split` only | All regexp expressions (`rlike`, `regexp_extract`, `regexp_extract_all`, `regexp_instr`, `regexp_replace`, `split`) |
| **Performance** | Fully native, no JNI overhead | One JNI round-trip per batch (Arrow vectors stay columnar) |
| **Pattern support** | Linear-time subset only | All Java regex features (backreferences, lookaround, etc.) |

The **Rust engine** is faster but cannot match Java regex semantics for every pattern. Because the engine
choice is itself the opt-in, setting `spark.comet.exec.regexp.engine=rust` declares acceptance of those
differences without a separate per-expression flag.

The **Java engine** is the default but the underlying codegen dispatcher is experimental and gated behind
`spark.comet.exec.scalaUDF.codegen.enabled=true`; the behavior, configuration, and supported expressions
may change in future releases.

## Why the engines differ

Java's `java.util.regex` is a backtracking engine in the Perl/PCRE family. It supports the full range of
features that style of engine provides, including some whose worst-case running time grows exponentially with
the input.

Rust's [`regex`] crate is a finite-automaton engine in the [RE2] family. It deliberately omits features that
cannot be implemented with a guarantee of linear-time matching. In exchange, every pattern it does accept runs
in time linear in the size of the input. This is the same trade-off RE2, Go's `regexp`, and several other
engines make.

The practical consequence is that Java accepts a strictly larger set of patterns than the Rust engine, and
several constructs that look the same in source have different semantics on the two sides.

## Features supported by Java but not by the Rust engine

Patterns that use any of the following will not compile in Comet's Rust engine and must run on Spark (or use
the Java engine):

- **Backreferences** such as `\1`, `\2`, or `\k<name>`. The Rust engine has no backtracking and cannot match
a previously captured group.
- **Lookaround**, including lookahead (`(?=...)`, `(?!...)`) and lookbehind (`(?<=...)`, `(?<!...)`).
- **Atomic groups** (`(?>...)`).
- **Possessive quantifiers** (`*+`, `++`, `?+`, `{n,m}+`). Rust supports greedy and lazy quantifiers but not
possessive.
- **Embedded code, conditionals, and recursion** such as `(?(cond)yes|no)` or `(?R)`. Rust accepts none of
these.

## Features that exist on both sides but behave differently

Even where both engines accept a construct, the matching behavior is not always the same.

- **Unicode-aware character classes.** In the Rust engine, `\d`, `\w`, `\s`, and `.` are Unicode-aware by
default, so `\d` matches every digit codepoint defined by Unicode rather than only `0`-`9`. Java's defaults
match ASCII only and require the `UNICODE_CHARACTER_CLASS` flag (or `(?U)` inline) to switch to Unicode
semantics. The same pattern can therefore match a different set of characters on each side.
- **Line terminators.** In multiline mode, Java treats `\r`, `\n`, `\r\n`, and a few additional Unicode line
separators as line boundaries by default. The Rust engine treats only `\n` as a line boundary unless CRLF
mode is enabled. `^`, `$`, and `.` (with `(?s)` off) all depend on this definition.
- **Case-insensitive matching.** Both engines support `(?i)`, but Java's default is ASCII case folding while
the Rust engine uses full Unicode simple case folding when Unicode mode is on. Patterns that match characters
outside ASCII can produce different results.
- **POSIX character classes.** The Rust engine supports `[[:alpha:]]` style POSIX classes inside bracket
expressions but not Java's `\p{Alpha}` shorthand. Java accepts both. Unicode property escapes (`\p{L}`,
`\p{Greek}`, etc.) are supported by both engines but cover slightly different sets of properties.
- **Octal and Unicode escapes.** Java accepts `\0nnn` for octal and `\uXXXX` for a BMP codepoint. Rust uses
`\x{...}` for arbitrary codepoints and does not accept Java's bare `\uXXXX` form.
- **Empty matches in `split`.** Spark's `StringSplit`, which is built on Java's regex, includes leading empty
strings produced by zero-width matches at the start of the input. The Rust engine's `split` follows different
rules, so split results can differ in edge cases involving empty matches even when the pattern itself is
identical on both sides.

## When the Rust engine is safe

For most ASCII-only, non-anchored patterns that use only literal characters, simple character classes, and
ordinary quantifiers, the two engines produce the same results. If you are confident your patterns fit this
shape and want to avoid the JNI overhead of the Java engine, switching to the Rust engine with
`allowIncompatible=true` is generally safe.

For anything that uses backreferences, lookaround, or relies on Java's specific Unicode or line-handling
defaults, use the experimental Java engine.

[`java.util.regex`]: https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html
[`regex`]: https://docs.rs/regex/latest/regex/
[RE2]: https://github.com/google/re2/wiki/Syntax
1 change: 1 addition & 0 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -1153,6 +1153,7 @@ under the License.
<exclude>native/proto/src/generated/**</exclude>
<exclude>benchmarks/tpc/queries/**</exclude>
<exclude>.claude/**</exclude>
<exclude>docs/superpowers/**</exclude>
</excludes>
</configuration>
</plugin>
Expand Down
25 changes: 24 additions & 1 deletion spark/src/main/scala/org/apache/comet/CometConf.scala
Original file line number Diff line number Diff line change
Expand Up @@ -369,10 +369,33 @@ object CometConf extends ShimCometConf {
"Arrow-direct codegen dispatcher. When enabled, a supported ScalaUDF is compiled into " +
"a per-batch kernel that reads and writes Arrow vectors directly from native " +
"execution. When disabled, plans containing a ScalaUDF fall back to Spark for the " +
"enclosing operator.")
"enclosing operator. The same dispatcher backs `spark.comet.exec.regexp.engine=java` " +
"so the regex family routes through it as well.")
.booleanConf
.createWithDefault(false)

val REGEXP_ENGINE_RUST = "rust"
val REGEXP_ENGINE_JAVA = "java"

val COMET_REGEXP_ENGINE: ConfigEntry[String] =
conf("spark.comet.exec.regexp.engine")
.category(CATEGORY_EXEC)
.doc(
"Selects the engine used to evaluate Spark regular-expression expressions. " +
s"`$REGEXP_ENGINE_JAVA` (default) routes through the Arrow-direct codegen dispatcher " +
"so Spark's own `doGenCode` (backed by `java.util.regex.Pattern`) runs inside the " +
s"Comet pipeline; this requires ${COMET_SCALA_UDF_CODEGEN_ENABLED.key}=true and " +
s"falls back to Spark otherwise. `$REGEXP_ENGINE_RUST` runs the native DataFusion " +
"regexp engine when an implementation exists; setting this is itself the opt-in " +
"for the semantic differences between Java and Rust regex. Affected expressions: " +
"rlike, regexp_extract, regexp_extract_all, regexp_replace, regexp_instr, and " +
"split (the extract/instr family has no native Rust path; they fall back to Spark " +
s"under `$REGEXP_ENGINE_RUST`).")
.stringConf
.transform(_.toLowerCase(Locale.ROOT))
.checkValues(Set(REGEXP_ENGINE_RUST, REGEXP_ENGINE_JAVA))
.createWithDefault(REGEXP_ENGINE_JAVA)

val COMET_EXEC_SHUFFLE_WITH_HASH_PARTITIONING_ENABLED: ConfigEntry[Boolean] =
conf("spark.comet.native.shuffle.partitioning.hash.enabled")
.category(CATEGORY_SHUFFLE)
Expand Down
2 changes: 1 addition & 1 deletion spark/src/main/scala/org/apache/comet/GenerateDocs.scala
Original file line number Diff line number Diff line change
Expand Up @@ -313,7 +313,7 @@ object GenerateDocs {
annotations += ((fromTypeName, toTypeName, note.trim.replace("(10,2)", "")))
}
"C"
case Incompatible(notes) =>
case Incompatible(notes, _) =>
notes.filter(_.trim.nonEmpty).foreach { note =>
annotations += ((fromTypeName, toTypeName, note.trim.replace("(10,2)", "")))
}
Expand Down
32 changes: 0 additions & 32 deletions spark/src/main/scala/org/apache/comet/expressions/RegExp.scala

This file was deleted.

Original file line number Diff line number Diff line change
Expand Up @@ -723,7 +723,7 @@ case class CometExecRule(session: SparkSession)
case Unsupported(notes) =>
withInfo(op, notes.getOrElse(""))
false
case Incompatible(notes) =>
case Incompatible(notes, _) =>
val allowIncompat = CometConf.isOperatorAllowIncompat(opName)
val incompatConf = CometConf.getOperatorAllowIncompatConfigKey(opName)
if (allowIncompat) {
Expand Down
62 changes: 46 additions & 16 deletions spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala
Original file line number Diff line number Diff line change
Expand Up @@ -185,6 +185,9 @@ object QueryPlanSerde extends Logging with CometExprShim with CometTypeShim {
classOf[Like] -> CometLike,
classOf[Lower] -> CometLower,
classOf[OctetLength] -> CometScalarFunction("octet_length"),
classOf[RegExpExtract] -> CometRegExpExtract,
classOf[RegExpExtractAll] -> CometRegExpExtractAll,
classOf[RegExpInStr] -> CometRegExpInStr,
classOf[RegExpReplace] -> CometRegExpReplace,
classOf[Reverse] -> CometReverse,
classOf[RLike] -> CometRLike,
Expand Down Expand Up @@ -578,23 +581,29 @@ object QueryPlanSerde extends Logging with CometExprShim with CometTypeShim {
case Unsupported(notes) =>
withInfo(fn, notes.getOrElse(""))
None
case Incompatible(notes) =>
case Incompatible(notes, optedInBy) =>
val exprAllowIncompat = CometConf.isExprAllowIncompat(exprConfName)
if (exprAllowIncompat) {
val namedConfOptIn = optedInBy.exists(isOptedInVia)
if (exprAllowIncompat || namedConfOptIn) {
if (notes.isDefined) {
logWarning(
s"Comet supports $fn when " +
s"${CometConf.getExprAllowIncompatConfigKey(exprConfName)}=true " +
s"but has notes: ${notes.get}")
val optInDesc = if (namedConfOptIn) {
optedInBy.get
} else {
s"${CometConf.getExprAllowIncompatConfigKey(exprConfName)}=true"
}
logWarning(s"Comet supports $fn when $optInDesc but has notes: ${notes.get}")
}
aggHandler.convert(aggExpr, fn, inputs, binding, conf)
} else {
val optionalNotes = notes.map(str => s" ($str)").getOrElse("")
val extraOptIn = optedInBy
.map(kv => s" or by setting $kv")
.getOrElse("")
withInfo(
fn,
s"$fn is not fully compatible with Spark$optionalNotes. To enable it anyway, " +
s"set ${CometConf.getExprAllowIncompatConfigKey(exprConfName)}=true. " +
s"${CometConf.COMPAT_GUIDE}.")
s"set ${CometConf.getExprAllowIncompatConfigKey(exprConfName)}=true" +
s"$extraOptIn. ${CometConf.COMPAT_GUIDE}.")
None
}
case Compatible(notes) =>
Expand Down Expand Up @@ -670,6 +679,21 @@ object QueryPlanSerde extends Logging with CometExprShim with CometTypeShim {
exprToProtoInternal(newExpr, inputs, binding)
}

/**
* True when the current SQLConf has the named config set to the given value. The argument is a
* `key=value` string used by `Incompatible.optedInBy` to declare which config opts the user
* into running an otherwise-incompatible expression. The configured value is compared
* case-insensitively after splitting on the first `=`.
*/
private def isOptedInVia(keyEqualsValue: String): Boolean = {
keyEqualsValue.split("=", 2) match {
case Array(key, expected) =>
Option(SQLConf.get.getConfString(key, null))
.exists(_.equalsIgnoreCase(expected))
case _ => false
}
}

/**
* Convert a Spark expression to a protocol-buffer representation of a native Comet/DataFusion
* expression.
Expand Down Expand Up @@ -703,23 +727,29 @@ object QueryPlanSerde extends Logging with CometExprShim with CometTypeShim {
case Unsupported(notes) =>
withInfo(expr, notes.getOrElse(""))
None
case Incompatible(notes) =>
case Incompatible(notes, optedInBy) =>
val exprAllowIncompat = CometConf.isExprAllowIncompat(exprConfName)
if (exprAllowIncompat) {
val namedConfOptIn = optedInBy.exists(isOptedInVia)
if (exprAllowIncompat || namedConfOptIn) {
if (notes.isDefined) {
logWarning(
s"Comet supports $expr when " +
s"${CometConf.getExprAllowIncompatConfigKey(exprConfName)}=true " +
s"but has notes: ${notes.get}")
val optInDesc = if (namedConfOptIn) {
optedInBy.get
} else {
s"${CometConf.getExprAllowIncompatConfigKey(exprConfName)}=true"
}
logWarning(s"Comet supports $expr when $optInDesc but has notes: ${notes.get}")
}
handler.convert(expr, inputs, binding)
} else {
val optionalNotes = notes.map(str => s" ($str)").getOrElse("")
val extraOptIn = optedInBy
.map(kv => s" or by setting $kv")
.getOrElse("")
withInfo(
expr,
s"$expr is not fully compatible with Spark$optionalNotes. To enable it anyway, " +
s"set ${CometConf.getExprAllowIncompatConfigKey(exprConfName)}=true. " +
s"${CometConf.COMPAT_GUIDE}.")
s"set ${CometConf.getExprAllowIncompatConfigKey(exprConfName)}=true" +
s"$extraOptIn. ${CometConf.COMPAT_GUIDE}.")
None
}
case Compatible(notes) =>
Expand Down
11 changes: 10 additions & 1 deletion spark/src/main/scala/org/apache/comet/serde/SupportLevel.scala
Original file line number Diff line number Diff line change
Expand Up @@ -37,8 +37,17 @@ case class Compatible(notes: Option[String] = None) extends SupportLevel
*
* Any compatibility differences are noted in the
* [[https://datafusion.apache.org/comet/user-guide/compatibility.html Comet Compatibility Guide]].
*
* @param notes
* Optional human-readable notes about the incompatibility.
* @param optedInBy
* Optional `key=value` pair naming a SQLConf entry that, when set to `value`, opts the user
* into running this expression despite the incompatibility, in addition to the per-expression
* `spark.comet.expression.<Name>.allowIncompatible` flag. Use this when a broader config (for
* example, an engine selector) already encodes the user's acceptance of the trade-off.
*/
case class Incompatible(notes: Option[String] = None) extends SupportLevel
case class Incompatible(notes: Option[String] = None, optedInBy: Option[String] = None)
extends SupportLevel

/** Comet does not support this feature */
case class Unsupported(notes: Option[String] = None) extends SupportLevel
Expand Down
Loading
Loading