Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 63 additions & 0 deletions .github/ISSUE_TEMPLATE/contentious-construct.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
name: Contentious construct (a parser may reasonably reject)
description: Propose a statement that the engine accepts but that a parser may reasonably decline to support.
title: "[contentious] "
labels: ["contentious"]
body:
- type: markdown
attributes:
value: |
Use this form when the reference engine accepts a statement but you think a parser may reasonably reject it anyway (for example an engine-specific quirk, a non-standard extension, or a construct with surprising semantics). This feeds the contentious-construct rule registry described in [docs/contentious-constructs.md](https://github.com/LucaCappelletti94/sql_ast_benchmark/blob/main/docs/contentious-constructs.md). The strict, engine-graded recall is never changed by this. Most fields are prefilled when you arrive here from the failures view.
- type: input
id: dialect
attributes:
label: Dialect
description: The reference dialect whose engine accepts the statement.
validations:
required: true
- type: input
id: parser
attributes:
label: Parser
description: The parser whose failures view you came from (optional context).
validations:
required: false
- type: textarea
id: statement
attributes:
label: Statement
description: The SQL the engine accepts but a parser may reasonably reject. Prefilled from the failures view.
validations:
required: true
- type: textarea
id: parser_error
attributes:
label: Parser error
description: The error the parser returned (prefilled from the failures view).
validations:
required: false
- type: dropdown
id: category
attributes:
label: Category
description: Why is this construct contentious? See the design doc for the meaning of each category.
options:
- engine-specific
- non-standard
- lossy-or-ambiguous
- deprecated
validations:
required: true
- type: textarea
id: rationale
attributes:
label: Why a parser may reasonably reject it
description: Explain the divergence and why declining to support it is defensible.
validations:
required: true
- type: textarea
id: references
attributes:
label: References
description: Links to the SQL standard, engine documentation, or peer-parser behavior that support the category.
validations:
required: false
51 changes: 51 additions & 0 deletions .github/ISSUE_TEMPLATE/not-contentious.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
name: Dispute a contentious classification
description: Argue that a statement currently marked as a contentious construct is not actually contentious and a parser should be expected to accept it.
title: "[not-contentious] "
labels: ["contentious-dispute"]
body:
- type: markdown
attributes:
value: |
Use this form to push back on a statement the site marks as a contentious construct (an intentional divergence). If you think it is a normal statement that a parser should accept, not a quirk a parser may reasonably reject, say so here. This challenges the rule, not any parser. See [docs/contentious-constructs.md](https://github.com/LucaCappelletti94/sql_ast_benchmark/blob/main/docs/contentious-constructs.md). Most fields are prefilled when you arrive here from the failures view.
- type: input
id: rule
attributes:
label: Contentious rule being disputed
description: The rule that currently tags this statement (prefilled from the failures view).
validations:
required: true
- type: input
id: dialect
attributes:
label: Dialect
description: The reference dialect the statement is from.
validations:
required: true
- type: input
id: parser
attributes:
label: Parser
description: The parser whose failures view you came from (optional context).
validations:
required: false
- type: textarea
id: statement
attributes:
label: Statement
description: The statement tagged contentious. Prefilled from the failures view.
validations:
required: true
- type: textarea
id: parser_error
attributes:
label: Parser error
description: The error the parser returned (prefilled from the failures view).
validations:
required: false
- type: textarea
id: rationale
attributes:
label: Why it is not contentious
description: Explain why this is a normal statement a parser should accept, not a quirk a parser may reasonably reject.
validations:
required: true
51 changes: 51 additions & 0 deletions .github/ISSUE_TEMPLATE/should-be-rejected.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
name: Statement should be rejected (invalid syntax)
description: Report a statement the reference engine accepted but that is actually invalid, so its valid label is wrong.
title: "[invalid-label] "
labels: ["oracle-label"]
body:
- type: markdown
attributes:
value: |
Use this form when a statement is shown as a parser miss but you believe the statement is actually invalid and should be rejected. The benchmark labels a statement valid only when the reference database engine accepts it, so this report is about the engine's label, not about any parser. Most fields are prefilled when you arrive here from the failures view.
- type: input
id: dialect
attributes:
label: Dialect
description: The reference dialect whose engine labeled the statement valid.
validations:
required: true
- type: input
id: parser
attributes:
label: Parser
description: The parser whose failures view you came from (optional context).
validations:
required: false
- type: textarea
id: statement
attributes:
label: Statement
description: The SQL the engine accepted. Prefilled from the failures view.
validations:
required: true
- type: textarea
id: parser_error
attributes:
label: Parser error
description: The error the parser returned (prefilled from the failures view).
validations:
required: false
- type: textarea
id: rationale
attributes:
label: Why it should be rejected
description: Explain why this statement is invalid SQL despite the engine accepting it.
validations:
required: true
- type: textarea
id: references
attributes:
label: References
description: Links to the SQL standard, the engine's own documentation, or anything showing the statement should be rejected.
validations:
required: false
18 changes: 18 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,24 @@ cargo run --bin sqlbench -- export # read all of the above, write web/asse

The charts are rendered in the browser from the JSON by the shared `viz` crate (plotters, SVG backend), so no chart images are committed.

## Contentious constructs

A contentious construct is one the reference engine accepts but a parser may reasonably decline to support (a niche engine quirk, a non-standard extension, a lossy or deprecated form). The benchmark keeps strict, oracle-graded recall as the headline number and adds a secondary "recall excluding contentious" beside it, plus a per-statement badge on the failures view. The design is written up in [docs/contentious-constructs.md](docs/contentious-constructs.md).

Rules are data: one TOML file per rule under `contentious/`. A regex rule needs no Rust (the pattern is matched against a masked form of the statement, so string literals and comments cannot trigger it). A structural rule (for a property a regex cannot express, like a repeated identifier) names a built-in predicate added in `src/contentious.rs`. Each file declares `id`, `title`, `category` (`engine-specific`, `non-standard`, `lossy-or-ambiguous`, or `deprecated`), the `dialects` it may fire in, `description`, `references`, and `matches` / `non_matches` example statements. See the two existing files for the shape.

To add a rule:

```bash
# 1. write contentious/<id>.toml
cargo test -p sql_ast_benchmark --lib contentious # guards: examples match/non-match,
# ids unique, each rule covers >=1
# engine-valid corpus statement
cargo run --release --bin sqlbench -- export # refresh web/assets/bench.json.zst
```

Then open a PR. A regex rule is a data-only change. Review is a data review: is the construct genuinely engine-valid, is the category honest, are the references real. The classifier only ever runs on engine-valid statements, so a rule can never change strict recall or excuse genuinely-invalid SQL.

## Time machine (per-version history)

The `timemachine` crate benchmarks several historical versions of each pure-Rust parser and writes `web/assets/history.json.zst` (committed, embedded and decompressed in wasm with `ruzstd`, so the site still does no runtime fetch). It hosts many versions of one crate at once with `package`-rename aliases, which works because different `0.x` minors are semver-incompatible. The FFI parsers (`pg_query`) are excluded: two libpg_query builds export the same C symbols and collide at link.
Expand Down
5 changes: 5 additions & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -39,13 +39,18 @@ sqlite3-parser = "0.16.0"
turso_parser = "0.6.1"
fallible-iterator = "0.3.0"
viz = { path = "viz" }
serde = { version = "1", features = ["derive"] }
serde_json = "1"
# Compress the per-(dialect, parser) rejected-statement TSV downloads.
zstd = "0.13"
# Pre-render the failing-statement previews to static highlighted HTML at export
# time, so the web viewer ships no runtime syntax highlighter. `default-fancy`
# uses the pure-Rust regex engine, avoiding the oniguruma C dependency.
syntect = { version = "5", default-features = false, features = ["default-fancy"] }
# Contentious-construct rule registry: regex rules matched in guaranteed linear
# time (no backreferences) and one TOML file per rule under `contentious/`.
regex = "1"
toml = "0.8"

[[bench]]
name = "parsing"
Expand Down
23 changes: 23 additions & 0 deletions contentious/duplicate-target-columns.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
id = "duplicate-target-columns"
title = "Duplicate columns in a target list"
category = "non-standard"
dialects = ["sqlite"]
# A repeated identifier inside one group cannot be a regex (Rust's regex has no
# backreferences), so this is a structural rule backed by a named predicate.
kind = "structural"
predicate = "duplicate_columns"
description = "A column named more than once in an INSERT target list, an UPDATE SET column group, or a USING clause. SQLite accepts these (INSERT keeps the first value for a repeated column, UPDATE ... SET keeps the last), but PostgreSQL rejects them and a parser that does the same is arguably safer."
references = [
"https://github.com/gwenn/lemon-rs/issues/89",
"https://www.postgresql.org/docs/current/sql-insert.html",
]
matches = [
"INSERT INTO dup1(a,b,c,a,b,c) VALUES(1,2,3,4,5,6)",
"UPDATE t1 SET (a,a,a,b)=(SELECT 99,100,101,102)",
"SELECT y FROM t1 JOIN t1 USING (y,y)",
]
non_matches = [
"INSERT INTO t(a,b,c) VALUES(1,2,3)",
"UPDATE t1 SET (a,b)=(SELECT 1,2)",
"SELECT y FROM t1 JOIN t2 USING (y)",
]
27 changes: 27 additions & 0 deletions contentious/sqlite-nonstandard-alter.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
id = "sqlite-nonstandard-alter"
title = "Non-standard SQLite ALTER TABLE operations"
category = "non-standard"
dialects = ["sqlite"]
kind = "regex"
# SQLite's documented ALTER TABLE supports only RENAME, ADD COLUMN, RENAME COLUMN,
# and DROP COLUMN. Its grammar nonetheless tolerates ADD/DROP CONSTRAINT, ADD
# CHECK, and ALTER [COLUMN] <col> SET/DROP NOT NULL without a syntax error, so the
# engine labels them valid. The alternation is anchored on ALTER TABLE and targets
# only the unsupported operations, so the supported forms never match.
pattern = '(?i)\balter\s+table\b.*(\b(add|drop)\s+constraint\b|\badd\s+check\b|\balter\s+(column\s+)?\S+\s+(set|drop)\s+not\s+null\b)'
description = "ALTER TABLE operations the SQLite grammar accepts without a syntax error but that are not part of supported SQLite (ADD/DROP CONSTRAINT, ADD CHECK, ALTER COLUMN SET/DROP NOT NULL). A parser modeling real SQLite may reasonably reject them."
references = [
"https://www.sqlite.org/lang_altertable.html",
"https://github.com/tursodatabase/turso",
]
matches = [
"ALTER TABLE t1 ADD CONSTRAINT c1 CHECK(a=b)",
"ALTER TABLE t2 ALTER x SET NOT NULL",
"ALTER TABLE abc DROP CONSTRAINT two",
"ALTER TABLE abc ADD CHECK (z>=0)",
]
non_matches = [
"ALTER TABLE t ADD COLUMN c INTEGER",
"ALTER TABLE t RENAME TO t2",
"ALTER TABLE t DROP COLUMN c",
]
24 changes: 24 additions & 0 deletions contentious/tcl-variables.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
id = "tcl-variables"
title = "TCL bind variables"
category = "engine-specific"
dialects = ["sqlite"]
kind = "regex"
# Matched against the masked form (string and blob literals and comments
# replaced), so `$::x` inside a string literal does not trigger. Covers `$::name`
# and `$namespace::name`.
pattern = '\$\w*::'
description = "Bind variables using TCL :: namespaces, valid only because SQLite began as a TCL extension and meaningless outside a TCL interpreter."
references = [
"https://sqlite.org/lang_expr.html#parameters",
"https://github.com/gwenn/lemon-rs/issues/102",
]
matches = [
"INSERT INTO t1 VALUES($::w,$::x,$::y,$::z)",
"select $testnamespace::xyz",
"SELECT strftime($::FMT,$::TS,'unixepoch')",
]
non_matches = [
"SELECT :w, :x",
"SELECT '$::x' AS lit",
"SELECT x::int",
]
Loading
Loading