Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 12 additions & 7 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co

## What This Project Does

pycel2sql converts CEL (Common Expression Language) expressions into SQL WHERE clauses. It supports five SQL dialects: PostgreSQL, DuckDB, BigQuery, MySQL, and SQLite.
pycel2sql converts CEL (Common Expression Language) expressions into SQL WHERE clauses. It supports six SQL dialects: PostgreSQL, DuckDB, BigQuery, MySQL, SQLite, and Apache Spark.

## Commands

Expand Down Expand Up @@ -68,7 +68,7 @@ Lark grammar rule names encode operators: `relation_eq`, `addition_add`, `multip
| `__init__.py` | Public API: `convert()`, `convert_parameterized()`, `analyze()`, `introspect()` |
| `_converter.py` | Core Converter — Lark Interpreter with visitor methods for every grammar rule |
| `dialect/_base.py` | `Dialect` ABC (40+ abstract methods), `WriteFunc` type alias, `IndexAdvisor` protocol |
| `dialect/{postgres,duckdb,bigquery,mysql,sqlite}.py` | Concrete dialect implementations |
| `dialect/{postgres,duckdb,bigquery,mysql,sqlite,spark}.py` | Concrete dialect implementations |
| `schema.py` | `Schema` / `FieldSchema` for JSON/array field detection |
| `_analysis.py` | `IndexAnalyzer` — second-pass tree walker for index recommendations |
| `_utils.py` | Validation, escaping, RE2→SQL regex conversion |
Expand All @@ -78,11 +78,12 @@ Lark grammar rule names encode operators: `relation_eq`, `addition_add`, `multip

### Dialect Differences

- **PostgreSQL**: `$N` params, `ARRAY[...]`, `~ / ~*` regex, `->>/->` JSON, `POSITION()` for contains
- **DuckDB**: `$N` params, `[...]` arrays, RE2 regex, `CONTAINS()`, `STRING_SPLIT()`
- **BigQuery**: `@pN` params, `[...]` arrays, `REGEXP_CONTAINS()`, `JSON_VALUE()`, `TIMESTAMP_ADD/SUB()`
- **MySQL**: `?` params, `JSON_ARRAY()`, `REGEXP`, `JSON_TABLE()` for unnest
- **SQLite**: `?` params, `json_array()`, no regex/split/join, `json_each()` for unnest
- **PostgreSQL**: `$N` params, `ARRAY[...]`, `~ / ~*` regex, `->>/->` JSON, `POSITION()` for contains, `FORMAT()`
- **DuckDB**: `$N` params, `[...]` arrays, RE2 regex, `CONTAINS()`, `STRING_SPLIT()`, `printf()`
- **BigQuery**: `@pN` params, `[...]` arrays, `REGEXP_CONTAINS()`, `JSON_VALUE()`, `TIMESTAMP_ADD/SUB()`, `FORMAT()`
- **MySQL**: `?` params, `JSON_ARRAY()`, `REGEXP`, `JSON_TABLE()` for unnest, `format()` raises `UnsupportedDialectFeatureError`
- **SQLite**: `?` params, `json_array()`, no regex/split/join, `json_each()` for unnest, `printf()`
- **Apache Spark**: `?` positional params, `array(...)`, `RLIKE`, `get_json_object()`, `concat()`, `array_contains(arr, elem)` (arg order swap), `EXPLODE` / `(SELECT collect_list(...))`, `format_string()`, `(dayofweek(t) - 1)` for day-of-week, JSON array membership raises (no boolean predicate available)

### Test Organization

Expand All @@ -97,4 +98,8 @@ Unit tests (`tests/test_*.py`) cover each feature area per dialect. Integration
- Depth tracking: `_visit_child()` increments/decrements `_depth` and checks limits
- Error types use dual messaging pattern to prevent information disclosure (CWE-209)
- `validate_schema` parameter: opt-in strict validation on `convert()`/`convert_parameterized()`/`analyze()`. Validates `table.field` references exist in schemas; skips comprehension variables, bare identifiers, and nested JSON keys beyond the first field. Raises `InvalidSchemaError` (with dual messaging). Requires schemas to be provided.
- `json_variables` parameter: opt-in declaration that named CEL variables are flat JSONB columns. Field access (dot or bracket) emits dialect-specific JSON extraction. Takes precedence over schema-declared JSON. Comprehension iter vars shadow `json_variables` (collisions are not treated as JSON inside the comprehension body).
- `column_aliases` parameter: maps CEL identifier names to SQL column names. The alias is validated against the dialect's identifier rules; the original CEL name remains the schema key (alias is output-only).
- `param_start_index` parameter (only on `convert_parameterized()`): shifts the placeholder counter so the first parameter is `$N` / `@pN` instead of `$1` / `@p1`. Values < 1 are clamped to 1. Positional-`?` dialects (MySQL, SQLite, Spark) ignore the index in placeholder text but still preserve parameter ordering.
- `format()` is dispatched per-dialect via `Dialect.write_format`: PostgreSQL/BigQuery emit `FORMAT(...)`, SQLite/DuckDB emit `printf(...)`, Apache Spark emits `format_string(...)`, MySQL raises `UnsupportedDialectFeatureError`.
- Ruff for linting, mypy strict for type checking, line length 100, target Python 3.12+
88 changes: 78 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
[![BigQuery](https://img.shields.io/badge/BigQuery-669DF6?logo=googlebigquery&logoColor=white)](https://cloud.google.com/bigquery)
[![MySQL](https://img.shields.io/badge/MySQL-4479A1?logo=mysql&logoColor=white)](https://www.mysql.com/)
[![SQLite](https://img.shields.io/badge/SQLite-003B57?logo=sqlite&logoColor=white)](https://www.sqlite.org/)
[![Apache Spark](https://img.shields.io/badge/Apache%20Spark-E25A1C?logo=apachespark&logoColor=white)](https://spark.apache.org/)

Convert [CEL (Common Expression Language)](https://cel.dev/) expressions to SQL WHERE clauses.

Expand Down Expand Up @@ -38,7 +39,7 @@ sql = convert('status == "active" || tags.size() > 0')

## Dialects

Five SQL dialects are supported:
Six SQL dialects are supported:

```python
from pycel2sql import convert
Expand All @@ -50,9 +51,13 @@ sql = convert('name == "alice"', dialect=get_dialect("mysql"))
sql = convert('name == "alice"', dialect=get_dialect("sqlite"))
sql = convert('name == "alice"', dialect=get_dialect("duckdb"))
sql = convert('name == "alice"', dialect=get_dialect("bigquery"))
sql = convert('name == "alice"', dialect=get_dialect("spark"))

# Or instantiate directly
from pycel2sql import PostgresDialect, MySQLDialect, SQLiteDialect, DuckDBDialect, BigQueryDialect
from pycel2sql import (
PostgresDialect, MySQLDialect, SQLiteDialect, DuckDBDialect,
BigQueryDialect, SparkDialect,
)

sql = convert('name == "alice"', dialect=MySQLDialect())
```
Expand All @@ -75,13 +80,76 @@ result = convert_parameterized('name == "alice"', dialect=MySQLDialect())

Placeholder styles per dialect:

| Dialect | Placeholder |
|------------|-------------|
| PostgreSQL | `$1`, `$2`, ... |
| DuckDB | `$1`, `$2`, ... |
| BigQuery | `@p1`, `@p2`, ... |
| MySQL | `?` |
| SQLite | `?` |
| Dialect | Placeholder |
|---------------|--------------------|
| PostgreSQL | `$1`, `$2`, ... |
| DuckDB | `$1`, `$2`, ... |
| BigQuery | `@p1`, `@p2`, ... |
| MySQL | `?` (positional) |
| SQLite | `?` (positional) |
| Apache Spark | `?` (positional) |

## Conversion Options

### `json_variables`

Declare CEL variable names that correspond to flat JSONB columns. Field access via dot notation or bracket notation emits dialect-specific JSON extraction:

```python
from pycel2sql import convert

# PostgreSQL: dot and bracket notation both produce ->> operators
sql = convert("context.host == 'a'", json_variables={"context"})
# => context->>'host' = 'a'

sql = convert('context["host"] == "a"', json_variables={"context"})
# => context->>'host' = 'a'

# Nested paths: intermediate keys use ->, final key uses ->>
sql = convert("tags.corpus.section == 'x'", json_variables={"tags"})
# => tags->'corpus'->>'section' = 'x'
```

`json_variables` takes precedence over schema-declared JSON. Comprehension iter vars shadow `json_variables` (collisions are not treated as JSON inside the comprehension body).

### `column_aliases`

Map CEL identifier names to SQL column names. Useful when database columns use prefixed names while user-facing CEL expressions use clean names:

```python
sql = convert("name == 'a'", column_aliases={"name": "usr_name"})
# => usr_name = 'a'
```

The alias is validated against the dialect's identifier rules. The original CEL name remains the schema key — alias is output-only.

### `param_start_index`

Shift the placeholder counter for `convert_parameterized()` when embedding the generated fragment into a larger pre-parameterized query:

```python
result = convert_parameterized(
"name == 'a' && age > 30",
param_start_index=5,
)
# result.sql => 'name = $5 AND age > $6'
# result.parameters => ['a', 30]
```

Values less than 1 are clamped to 1. For positional-`?` dialects (MySQL, SQLite, Apache Spark) the placeholder text is unchanged but the parameter ordering is preserved.

### `format()` per-dialect mapping

CEL's `string.format(args)` dispatches to dialect-specific SQL:

| Dialect | Output |
|---------------|-------------------------|
| PostgreSQL | `FORMAT('...', ...)` |
| BigQuery | `FORMAT('...', ...)` |
| SQLite | `printf('...', ...)` |
| DuckDB | `printf('...', ...)` |
| Apache Spark | `format_string('...', ...)` |
| MySQL | raises `UnsupportedDialectFeatureError` |

## JSON Fields

Expand Down Expand Up @@ -173,7 +241,7 @@ schemas = introspect_sqlite(
)
```

All five dialects are supported: `introspect_postgres`, `introspect_duckdb`, `introspect_bigquery`, `introspect_mysql`, `introspect_sqlite`.
All five JDBC-style dialects are supported: `introspect_postgres`, `introspect_duckdb`, `introspect_bigquery`, `introspect_mysql`, `introspect_sqlite`. Apache Spark introspection is not provided — construct `Schema` directly.

## Supported CEL Features

Expand Down
94 changes: 72 additions & 22 deletions src/pycel2sql/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
from pycel2sql.dialect.duckdb import DuckDBDialect
from pycel2sql.dialect.mysql import MySQLDialect
from pycel2sql.dialect.postgres import PostgresDialect
from pycel2sql.dialect.spark import SparkDialect
from pycel2sql.dialect.sqlite import SQLiteDialect
from pycel2sql.introspect import introspect
from pycel2sql.schema import Schema
Expand All @@ -38,6 +39,7 @@
"DuckDBDialect",
"MySQLDialect",
"PostgresDialect",
"SparkDialect",
"SQLiteDialect",
]

Expand All @@ -60,6 +62,8 @@ def convert(
max_depth: int | None = None,
max_output_length: int | None = None,
validate_schema: bool = False,
json_variables: set[str] | frozenset[str] | list[str] | None = None,
column_aliases: dict[str, str] | None = None,
) -> str:
"""Convert a CEL expression to an inline SQL WHERE clause string.

Expand All @@ -71,6 +75,13 @@ def convert(
max_output_length: Maximum SQL output length. Defaults to 50000.
validate_schema: If True, raise InvalidSchemaError for unrecognized
table or field references. Requires schemas to be provided.
json_variables: CEL variable names that correspond to flat JSONB
columns. Field access (dot or bracket notation) against these
variables emits dialect-specific JSON extraction instead of
plain dot notation.
column_aliases: Map CEL identifier names to SQL column names. When a
CEL identifier matches a key, the alias is emitted (and validated
against the dialect's identifier rules).

Returns:
The SQL WHERE clause string.
Expand All @@ -84,6 +95,30 @@ def convert(

tree = _parser.parse(cel_expr)

kwargs: dict[str, Any] = _build_kwargs(
schemas=schemas,
max_depth=max_depth,
max_output_length=max_output_length,
validate_schema=validate_schema,
json_variables=json_variables,
column_aliases=column_aliases,
)

converter = Converter(dialect, **kwargs)
converter.visit(tree)
return converter.result


def _build_kwargs(
*,
schemas: dict[str, Schema] | None = None,
max_depth: int | None = None,
max_output_length: int | None = None,
validate_schema: bool = False,
json_variables: set[str] | frozenset[str] | list[str] | None = None,
column_aliases: dict[str, str] | None = None,
param_start_index: int | None = None,
) -> dict[str, Any]:
kwargs: dict[str, Any] = {}
if schemas is not None:
kwargs["schemas"] = schemas
Expand All @@ -93,10 +128,13 @@ def convert(
kwargs["max_output_length"] = max_output_length
if validate_schema:
kwargs["validate_schema"] = validate_schema

converter = Converter(dialect, **kwargs)
converter.visit(tree)
return converter.result
if json_variables is not None:
kwargs["json_variables"] = frozenset(json_variables)
if column_aliases is not None:
kwargs["column_aliases"] = dict(column_aliases)
if param_start_index is not None:
kwargs["param_start_index"] = max(1, param_start_index)
return kwargs


def convert_parameterized(
Expand All @@ -107,6 +145,9 @@ def convert_parameterized(
max_depth: int | None = None,
max_output_length: int | None = None,
validate_schema: bool = False,
json_variables: set[str] | frozenset[str] | list[str] | None = None,
column_aliases: dict[str, str] | None = None,
param_start_index: int | None = None,
) -> Result:
"""Convert a CEL expression to a parameterized SQL WHERE clause.

Expand All @@ -118,6 +159,11 @@ def convert_parameterized(
max_output_length: Maximum SQL output length. Defaults to 50000.
validate_schema: If True, raise InvalidSchemaError for unrecognized
table or field references. Requires schemas to be provided.
json_variables: CEL variable names that correspond to flat JSONB columns.
column_aliases: Map CEL identifier names to SQL column names.
param_start_index: First placeholder index. Defaults to 1. Useful when
embedding the generated fragment in a larger parameterized query.
Values less than 1 are clamped to 1.

Returns:
Result with SQL containing $1, $2, ... placeholders and parameter list.
Expand All @@ -131,15 +177,16 @@ def convert_parameterized(

tree = _parser.parse(cel_expr)

kwargs: dict[str, Any] = {"parameterize": True}
if schemas is not None:
kwargs["schemas"] = schemas
if max_depth is not None:
kwargs["max_depth"] = max_depth
if max_output_length is not None:
kwargs["max_output_length"] = max_output_length
if validate_schema:
kwargs["validate_schema"] = validate_schema
kwargs: dict[str, Any] = _build_kwargs(
schemas=schemas,
max_depth=max_depth,
max_output_length=max_output_length,
validate_schema=validate_schema,
json_variables=json_variables,
column_aliases=column_aliases,
param_start_index=param_start_index,
)
kwargs["parameterize"] = True

converter = Converter(dialect, **kwargs)
converter.visit(tree)
Expand All @@ -162,6 +209,8 @@ def analyze(
max_depth: int | None = None,
max_output_length: int | None = None,
validate_schema: bool = False,
json_variables: set[str] | frozenset[str] | list[str] | None = None,
column_aliases: dict[str, str] | None = None,
) -> AnalysisResult:
"""Analyze a CEL expression for SQL conversion and index recommendations.

Expand All @@ -173,6 +222,8 @@ def analyze(
max_output_length: Maximum SQL output length.
validate_schema: If True, raise InvalidSchemaError for unrecognized
table or field references. Requires schemas to be provided.
json_variables: CEL variable names that correspond to flat JSONB columns.
column_aliases: Map CEL identifier names to SQL column names.

Returns:
AnalysisResult with SQL and index recommendations.
Expand All @@ -189,15 +240,14 @@ def analyze(
tree = _parser.parse(cel_expr)

# Pass 1: Generate SQL
kwargs: dict[str, Any] = {}
if schemas is not None:
kwargs["schemas"] = schemas
if max_depth is not None:
kwargs["max_depth"] = max_depth
if max_output_length is not None:
kwargs["max_output_length"] = max_output_length
if validate_schema:
kwargs["validate_schema"] = validate_schema
kwargs: dict[str, Any] = _build_kwargs(
schemas=schemas,
max_depth=max_depth,
max_output_length=max_output_length,
validate_schema=validate_schema,
json_variables=json_variables,
column_aliases=column_aliases,
)

converter = Converter(dialect, **kwargs)
converter.visit(tree)
Expand Down
Loading
Loading