SPANDigital · richardwooding · Apr 28, 2026 · Apr 27, 2026 · Apr 28, 2026
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -4,7 +4,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
 
 ## What This Project Does
 
-pycel2sql converts CEL (Common Expression Language) expressions into SQL WHERE clauses. It supports five SQL dialects: PostgreSQL, DuckDB, BigQuery, MySQL, and SQLite.
+pycel2sql converts CEL (Common Expression Language) expressions into SQL WHERE clauses. It supports six SQL dialects: PostgreSQL, DuckDB, BigQuery, MySQL, SQLite, and Apache Spark.
 
 ## Commands
 
@@ -68,7 +68,7 @@ Lark grammar rule names encode operators: `relation_eq`, `addition_add`, `multip
 | `__init__.py` | Public API: `convert()`, `convert_parameterized()`, `analyze()`, `introspect()` |
 | `_converter.py` | Core Converter — Lark Interpreter with visitor methods for every grammar rule |
 | `dialect/_base.py` | `Dialect` ABC (40+ abstract methods), `WriteFunc` type alias, `IndexAdvisor` protocol |
-| `dialect/{postgres,duckdb,bigquery,mysql,sqlite}.py` | Concrete dialect implementations |
+| `dialect/{postgres,duckdb,bigquery,mysql,sqlite,spark}.py` | Concrete dialect implementations |
 | `schema.py` | `Schema` / `FieldSchema` for JSON/array field detection |
 | `_analysis.py` | `IndexAnalyzer` — second-pass tree walker for index recommendations |
 | `_utils.py` | Validation, escaping, RE2→SQL regex conversion |
@@ -78,11 +78,12 @@ Lark grammar rule names encode operators: `relation_eq`, `addition_add`, `multip
 
 ### Dialect Differences
 
-- **PostgreSQL**: `$N` params, `ARRAY[...]`, `~ / ~*` regex, `->>/->` JSON, `POSITION()` for contains
-- **DuckDB**: `$N` params, `[...]` arrays, RE2 regex, `CONTAINS()`, `STRING_SPLIT()`
-- **BigQuery**: `@pN` params, `[...]` arrays, `REGEXP_CONTAINS()`, `JSON_VALUE()`, `TIMESTAMP_ADD/SUB()`
-- **MySQL**: `?` params, `JSON_ARRAY()`, `REGEXP`, `JSON_TABLE()` for unnest
-- **SQLite**: `?` params, `json_array()`, no regex/split/join, `json_each()` for unnest
+- **PostgreSQL**: `$N` params, `ARRAY[...]`, `~ / ~*` regex, `->>/->` JSON, `POSITION()` for contains, `FORMAT()`
+- **DuckDB**: `$N` params, `[...]` arrays, RE2 regex, `CONTAINS()`, `STRING_SPLIT()`, `printf()`
+- **BigQuery**: `@pN` params, `[...]` arrays, `REGEXP_CONTAINS()`, `JSON_VALUE()`, `TIMESTAMP_ADD/SUB()`, `FORMAT()`
+- **MySQL**: `?` params, `JSON_ARRAY()`, `REGEXP`, `JSON_TABLE()` for unnest, `format()` raises `UnsupportedDialectFeatureError`
+- **SQLite**: `?` params, `json_array()`, no regex/split/join, `json_each()` for unnest, `printf()`
+- **Apache Spark**: `?` positional params, `array(...)`, `RLIKE`, `get_json_object()`, `concat()`, `array_contains(arr, elem)` (arg order swap), `EXPLODE` / `(SELECT collect_list(...))`, `format_string()`, `(dayofweek(t) - 1)` for day-of-week, JSON array membership raises (no boolean predicate available)
 
 ### Test Organization
 
@@ -97,4 +98,8 @@ Unit tests (`tests/test_*.py`) cover each feature area per dialect. Integration
 - Depth tracking: `_visit_child()` increments/decrements `_depth` and checks limits
 - Error types use dual messaging pattern to prevent information disclosure (CWE-209)
 - `validate_schema` parameter: opt-in strict validation on `convert()`/`convert_parameterized()`/`analyze()`. Validates `table.field` references exist in schemas; skips comprehension variables, bare identifiers, and nested JSON keys beyond the first field. Raises `InvalidSchemaError` (with dual messaging). Requires schemas to be provided.
+- `json_variables` parameter: opt-in declaration that named CEL variables are flat JSONB columns. Field access (dot or bracket) emits dialect-specific JSON extraction. Takes precedence over schema-declared JSON. Comprehension iter vars shadow `json_variables` (collisions are not treated as JSON inside the comprehension body).
+- `column_aliases` parameter: maps CEL identifier names to SQL column names. The alias is validated against the dialect's identifier rules; the original CEL name remains the schema key (alias is output-only).
+- `param_start_index` parameter (only on `convert_parameterized()`): shifts the placeholder counter so the first parameter is `$N` / `@pN` instead of `$1` / `@p1`. Values < 1 are clamped to 1. Positional-`?` dialects (MySQL, SQLite, Spark) ignore the index in placeholder text but still preserve parameter ordering.
+- `format()` is dispatched per-dialect via `Dialect.write_format`: PostgreSQL/BigQuery emit `FORMAT(...)`, SQLite/DuckDB emit `printf(...)`, Apache Spark emits `format_string(...)`, MySQL raises `UnsupportedDialectFeatureError`.
 - Ruff for linting, mypy strict for type checking, line length 100, target Python 3.12+
diff --git a/README.md b/README.md
@@ -10,6 +10,7 @@
 [![BigQuery](https://img.shields.io/badge/BigQuery-669DF6?logo=googlebigquery&logoColor=white)](https://cloud.google.com/bigquery)
 [![MySQL](https://img.shields.io/badge/MySQL-4479A1?logo=mysql&logoColor=white)](https://www.mysql.com/)
 [![SQLite](https://img.shields.io/badge/SQLite-003B57?logo=sqlite&logoColor=white)](https://www.sqlite.org/)
+[![Apache Spark](https://img.shields.io/badge/Apache%20Spark-E25A1C?logo=apachespark&logoColor=white)](https://spark.apache.org/)
 
 Convert [CEL (Common Expression Language)](https://cel.dev/) expressions to SQL WHERE clauses.
 
@@ -38,7 +39,7 @@ sql = convert('status == "active" || tags.size() > 0')
 
 ## Dialects
 
-Five SQL dialects are supported:
+Six SQL dialects are supported:
 
 ```python
 from pycel2sql import convert
@@ -50,9 +51,13 @@ sql = convert('name == "alice"', dialect=get_dialect("mysql"))
 sql = convert('name == "alice"', dialect=get_dialect("sqlite"))
 sql = convert('name == "alice"', dialect=get_dialect("duckdb"))
 sql = convert('name == "alice"', dialect=get_dialect("bigquery"))
+sql = convert('name == "alice"', dialect=get_dialect("spark"))
 
 # Or instantiate directly
-from pycel2sql import PostgresDialect, MySQLDialect, SQLiteDialect, DuckDBDialect, BigQueryDialect
+from pycel2sql import (
+    PostgresDialect, MySQLDialect, SQLiteDialect, DuckDBDialect,
+    BigQueryDialect, SparkDialect,
+)
 
 sql = convert('name == "alice"', dialect=MySQLDialect())
 ```
@@ -75,13 +80,76 @@ result = convert_parameterized('name == "alice"', dialect=MySQLDialect())
 
 Placeholder styles per dialect:
 
-| Dialect    | Placeholder |
-|------------|-------------|
-| PostgreSQL | `$1`, `$2`, ... |
-| DuckDB     | `$1`, `$2`, ... |
-| BigQuery   | `@p1`, `@p2`, ... |
-| MySQL      | `?` |
-| SQLite     | `?` |
+| Dialect       | Placeholder        |
+|---------------|--------------------|
+| PostgreSQL    | `$1`, `$2`, ...    |
+| DuckDB        | `$1`, `$2`, ...    |
+| BigQuery      | `@p1`, `@p2`, ...  |
+| MySQL         | `?` (positional)   |
+| SQLite        | `?` (positional)   |
+| Apache Spark  | `?` (positional)   |
+
+## Conversion Options
+
+### `json_variables`
+
+Declare CEL variable names that correspond to flat JSONB columns. Field access via dot notation or bracket notation emits dialect-specific JSON extraction:
+
+```python
+from pycel2sql import convert
+
+# PostgreSQL: dot and bracket notation both produce ->> operators
+sql = convert("context.host == 'a'", json_variables={"context"})
+# => context->>'host' = 'a'
+
+sql = convert('context["host"] == "a"', json_variables={"context"})
+# => context->>'host' = 'a'
+
+# Nested paths: intermediate keys use ->, final key uses ->>
+sql = convert("tags.corpus.section == 'x'", json_variables={"tags"})
+# => tags->'corpus'->>'section' = 'x'
+```
+
+`json_variables` takes precedence over schema-declared JSON. Comprehension iter vars shadow `json_variables` (collisions are not treated as JSON inside the comprehension body).
+
+### `column_aliases`
+
+Map CEL identifier names to SQL column names. Useful when database columns use prefixed names while user-facing CEL expressions use clean names:
+
+```python
+sql = convert("name == 'a'", column_aliases={"name": "usr_name"})
+# => usr_name = 'a'
+```
+
+The alias is validated against the dialect's identifier rules. The original CEL name remains the schema key — alias is output-only.
+
+### `param_start_index`
+
+Shift the placeholder counter for `convert_parameterized()` when embedding the generated fragment into a larger pre-parameterized query:
+
+```python
+result = convert_parameterized(
+    "name == 'a' && age > 30",
+    param_start_index=5,
+)
+# result.sql => 'name = $5 AND age > $6'
+# result.parameters => ['a', 30]
+```
+
+Values less than 1 are clamped to 1. For positional-`?` dialects (MySQL, SQLite, Apache Spark) the placeholder text is unchanged but the parameter ordering is preserved.
+
+### `format()` per-dialect mapping
+
+CEL's `string.format(args)` dispatches to dialect-specific SQL:
+
+| Dialect       | Output                  |
+|---------------|-------------------------|
+| PostgreSQL    | `FORMAT('...', ...)`    |
+| BigQuery      | `FORMAT('...', ...)`    |
+| SQLite        | `printf('...', ...)`    |
+| DuckDB        | `printf('...', ...)`    |
+| Apache Spark  | `format_string('...', ...)` |
+| MySQL         | raises `UnsupportedDialectFeatureError` |
 
 ## JSON Fields
 
@@ -173,7 +241,7 @@ schemas = introspect_sqlite(
 )
 ```
 
-All five dialects are supported: `introspect_postgres`, `introspect_duckdb`, `introspect_bigquery`, `introspect_mysql`, `introspect_sqlite`.
+All five JDBC-style dialects are supported: `introspect_postgres`, `introspect_duckdb`, `introspect_bigquery`, `introspect_mysql`, `introspect_sqlite`. Apache Spark introspection is not provided — construct `Schema` directly.
 
 ## Supported CEL Features
 

diff --git a/src/pycel2sql/__init__.py b/src/pycel2sql/__init__.py
@@ -19,6 +19,7 @@
 from pycel2sql.dialect.duckdb import DuckDBDialect
 from pycel2sql.dialect.mysql import MySQLDialect
 from pycel2sql.dialect.postgres import PostgresDialect
+from pycel2sql.dialect.spark import SparkDialect
 from pycel2sql.dialect.sqlite import SQLiteDialect
 from pycel2sql.introspect import introspect
 from pycel2sql.schema import Schema
@@ -38,6 +39,7 @@
     "DuckDBDialect",
     "MySQLDialect",
     "PostgresDialect",
+    "SparkDialect",
     "SQLiteDialect",
 ]
 
@@ -60,6 +62,8 @@ def convert(
     max_depth: int | None = None,
     max_output_length: int | None = None,
     validate_schema: bool = False,
+    json_variables: set[str] | frozenset[str] | list[str] | None = None,
+    column_aliases: dict[str, str] | None = None,
 ) -> str:
     """Convert a CEL expression to an inline SQL WHERE clause string.
 
@@ -71,6 +75,13 @@ def convert(
         max_output_length: Maximum SQL output length. Defaults to 50000.
         validate_schema: If True, raise InvalidSchemaError for unrecognized
             table or field references. Requires schemas to be provided.
+        json_variables: CEL variable names that correspond to flat JSONB
+            columns. Field access (dot or bracket notation) against these
+            variables emits dialect-specific JSON extraction instead of
+            plain dot notation.
+        column_aliases: Map CEL identifier names to SQL column names. When a
+            CEL identifier matches a key, the alias is emitted (and validated
+            against the dialect's identifier rules).
 
     Returns:
         The SQL WHERE clause string.
@@ -84,6 +95,30 @@ def convert(
 
     tree = _parser.parse(cel_expr)
 
+    kwargs: dict[str, Any] = _build_kwargs(
+        schemas=schemas,
+        max_depth=max_depth,
+        max_output_length=max_output_length,
+        validate_schema=validate_schema,
+        json_variables=json_variables,
+        column_aliases=column_aliases,
+    )
+
+    converter = Converter(dialect, **kwargs)
+    converter.visit(tree)
+    return converter.result
+
+
+def _build_kwargs(
+    *,
+    schemas: dict[str, Schema] | None = None,
+    max_depth: int | None = None,
+    max_output_length: int | None = None,
+    validate_schema: bool = False,
+    json_variables: set[str] | frozenset[str] | list[str] | None = None,
+    column_aliases: dict[str, str] | None = None,
+    param_start_index: int | None = None,
+) -> dict[str, Any]:
     kwargs: dict[str, Any] = {}
     if schemas is not None:
         kwargs["schemas"] = schemas
@@ -93,10 +128,13 @@ def convert(
         kwargs["max_output_length"] = max_output_length
     if validate_schema:
         kwargs["validate_schema"] = validate_schema
-
-    converter = Converter(dialect, **kwargs)
-    converter.visit(tree)
-    return converter.result
+    if json_variables is not None:
+        kwargs["json_variables"] = frozenset(json_variables)
+    if column_aliases is not None:
+        kwargs["column_aliases"] = dict(column_aliases)
+    if param_start_index is not None:
+        kwargs["param_start_index"] = max(1, param_start_index)
+    return kwargs
 
 
 def convert_parameterized(
@@ -107,6 +145,9 @@ def convert_parameterized(
     max_depth: int | None = None,
     max_output_length: int | None = None,
     validate_schema: bool = False,
+    json_variables: set[str] | frozenset[str] | list[str] | None = None,
+    column_aliases: dict[str, str] | None = None,
+    param_start_index: int | None = None,
 ) -> Result:
     """Convert a CEL expression to a parameterized SQL WHERE clause.
 
@@ -118,6 +159,11 @@ def convert_parameterized(
         max_output_length: Maximum SQL output length. Defaults to 50000.
         validate_schema: If True, raise InvalidSchemaError for unrecognized
             table or field references. Requires schemas to be provided.
+        json_variables: CEL variable names that correspond to flat JSONB columns.
+        column_aliases: Map CEL identifier names to SQL column names.
+        param_start_index: First placeholder index. Defaults to 1. Useful when
+            embedding the generated fragment in a larger parameterized query.
+            Values less than 1 are clamped to 1.
 
     Returns:
         Result with SQL containing $1, $2, ... placeholders and parameter list.
@@ -131,15 +177,16 @@ def convert_parameterized(
 
     tree = _parser.parse(cel_expr)
 
-    kwargs: dict[str, Any] = {"parameterize": True}
-    if schemas is not None:
-        kwargs["schemas"] = schemas
-    if max_depth is not None:
-        kwargs["max_depth"] = max_depth
-    if max_output_length is not None:
-        kwargs["max_output_length"] = max_output_length
-    if validate_schema:
-        kwargs["validate_schema"] = validate_schema
+    kwargs: dict[str, Any] = _build_kwargs(
+        schemas=schemas,
+        max_depth=max_depth,
+        max_output_length=max_output_length,
+        validate_schema=validate_schema,
+        json_variables=json_variables,
+        column_aliases=column_aliases,
+        param_start_index=param_start_index,
+    )
+    kwargs["parameterize"] = True
 
     converter = Converter(dialect, **kwargs)
     converter.visit(tree)
@@ -162,6 +209,8 @@ def analyze(
     max_depth: int | None = None,
     max_output_length: int | None = None,
     validate_schema: bool = False,
+    json_variables: set[str] | frozenset[str] | list[str] | None = None,
+    column_aliases: dict[str, str] | None = None,
 ) -> AnalysisResult:
     """Analyze a CEL expression for SQL conversion and index recommendations.
 
@@ -173,6 +222,8 @@ def analyze(
         max_output_length: Maximum SQL output length.
         validate_schema: If True, raise InvalidSchemaError for unrecognized
             table or field references. Requires schemas to be provided.
+        json_variables: CEL variable names that correspond to flat JSONB columns.
+        column_aliases: Map CEL identifier names to SQL column names.
 
     Returns:
         AnalysisResult with SQL and index recommendations.
@@ -189,15 +240,14 @@ def analyze(
     tree = _parser.parse(cel_expr)
 
     # Pass 1: Generate SQL
-    kwargs: dict[str, Any] = {}
-    if schemas is not None:
-        kwargs["schemas"] = schemas
-    if max_depth is not None:
-        kwargs["max_depth"] = max_depth
-    if max_output_length is not None:
-        kwargs["max_output_length"] = max_output_length
-    if validate_schema:
-        kwargs["validate_schema"] = validate_schema
+    kwargs: dict[str, Any] = _build_kwargs(
+        schemas=schemas,
+        max_depth=max_depth,
+        max_output_length=max_output_length,
+        validate_schema=validate_schema,
+        json_variables=json_variables,
+        column_aliases=column_aliases,
+    )
 
     converter = Converter(dialect, **kwargs)
     converter.visit(tree)