Repair corpus and harden oracle (issue #22), plus sampled batch axis, fidelity removal, and overall score by LucaCappelletti94 · Pull Request #24 · LucaCappelletti94/sql_ast_benchmark

LucaCappelletti94 · 2026-06-14T12:49:23Z

Two things in one PR. First, the issue-#22 corpus and oracle fix: the old extractor split compound statements on inner ;, shredding CREATE TRIGGER ... BEGIN ... END bodies, PL/SQL blocks, and GO/DELIMITER batch separators into invalid fragments (raised on gwenn/lemon-rs#102). The SQLite, Spark, and Oracle suites are rebuilt from source with dialect-aware splitters (build_sqlite_suite, build_proc_suites), repair_corpus cleans the rest in place, and a guard test checks every CREATE TRIGGER line parses. The real-engine oracle is hardened so a transport error or engine crash is never recorded as valid: ClickHouse 23.3 and the Postgres backend each segfault on a specific statement, and both now reconnect, resume, and skip the poison statement. All six validity caches are regenerated, correcting 2411 Postgres, 767 ClickHouse, and 380 SQLite labels, all from valid to invalid.

Second, three parked features consolidated on top: a sampled batch axis (k=200 batches of m=128 drawn from each parser's accepted set, reported as a batch parse-accuracy plus per-statement time and memory), removal of the fidelity metric across grading, export, schema, UI, and docs, and a per-parser overall score (0 to 100) in web/src/score.rs shown as a leaderboard and a per-parser stat. The merge kept batch-accuracy's src/batch.rs (already a superset of drop-fidelity's batch fixes) and reconciled the score with the fidelity removal. bench.json and history.json are regenerated against the repaired corpus and corrected labels. One open question carried over from the parked work: whether provenance-only dialects over-reward permissive parsers in the overall score.

Fixes #22.

Add a single overall score per parser (0 to 100) blending correctness, robustness, project health, speed, and memory, with the five sub-scores shown alongside it. Scoring lives in web/src/score.rs and is computed from the existing bundle, featurescan, and metadata, so no benchmark re-run is needed. Surfaced as a ranked leaderboard on the overview and an overall-score section plus hero stat on each parser page. Weights: correctness 45, robustness 20, health 15, speed 12, memory 8. Each parser is scored only over the dialects it models; correctness and health are absolute while speed and memory are ranked against the field within each dialect. Parked as work in progress: open question is whether provenance-only dialects over-reward permissive parsers.

…R bug report Parked work in progress. Includes the fidelity-metric removal (grading, export, schema, UI, docs), exploratory batch fixes in src/batch.rs (terminator on its own line, COPY-FROM-STDIN exclusion), the regenerated bench.json, and the standalone sqlparser CREATE USER / ALTER USER terminator-swallow bug report.

Replace the all-or-nothing whole-script batch with k=200 random batches of m=128 statements drawn from the set each parser individually accepts. Report batch accuracy (share of batches that reparse to the exact statement count) plus per-statement time and memory over the correctly parsed batches. Sampling localizes a terminator bug to its batch instead of voiding the whole measurement, so real but narrow parser bugs surface at their true rate (sqlparser CREATE USER/ALTER USER) while clean parsers sit at 100 percent. Shared sampler in src/batch.rs (seeded SplitMix64), wired into the bench, membench, time machine, schema, export, and the web UI (batch ok% column). pg_query summary is excluded since it reports statement types, not a count.

The old extractor split compound statements on inner semicolons, shredding trigger bodies, PL/SQL blocks, and batch separators into invalid fragments. build_sqlite_suite rebuilds the SQLite official suite with a trigger and CASE aware splitter that keeps BEGIN..END bodies intact. build_proc_suites rebuilds Spark and Oracle from source, keeping whole PL/SQL blocks (Oracle anonymous blocks isolated under datasets/special, with their inner DML harvested as individual statements). repair_corpus cleans the remaining files in place: it splits T-SQL GO batch separators, drops procedural fragments and DELIMITER wreckage, and removes leaked non-SQL lines. Adds a guard test asserting every CREATE TRIGGER line parses, registers the new binaries, and repacks datasets.tar.zst.

The labelers defaulted transport errors and engine crashes to valid, which silently mislabeled large tails. ClickHouse 23.3 segfaults parsing one statement and the Postgres backend crashes on another, so the old error to valid fallback marked every statement after the crash valid (deterministic, so re-running reproduced it exactly). Each labeler now treats only a real engine answer as a verdict, drains responses, retries transient failures, and either reconnects and resumes past a confirmed poison statement (ClickHouse, Postgres) or aborts without writing a cache (MySQL, T-SQL, SQLite). Postgres also special-cases COPY TO STDOUT, which is valid but breaks the simple-query protocol. Regenerates all six caches against the repaired corpus, correcting 2411 Postgres, 767 ClickHouse, and 380 SQLite labels, all from valid to invalid.

…abels Full `sqlbench regen` run (featurescan, depth probe, benches, membench, time machine --full, export) against the reconstructed corpus and the hardened oracle labels. Updates bench.json.zst and history.json.zst plus the per-parser failing-statement downloads. The fixes show through: SQLite native parsers are back near 100 percent recall (the spurious truncated-trigger failures are gone), T-SQL sqlparser recall rises with the GO batch-separator split, and PostgreSQL and ClickHouse are graded against the corrected validity labels. Also gitignores the local regen.log.

…h axis, add CREATE USER bug report

…orrectness sub-score)

LucaCappelletti94 added 11 commits June 11, 2026 09:31

Merge batch-accuracy: sampled-batch axis with parse-accuracy metric

c4dd02e

Merge composite-score: per-parser overall score and leaderboard

c4c60c0

Merge drop-fidelity: remove fidelity metric, keep batch-accuracy batc…

dacadb0

…h axis, add CREATE USER bug report

Reconcile composite score with fidelity removal (drop fidelity from c…

144280e

…orrectness sub-score)

Regenerate benchmark results and history for the combined parked work

07400ad

LucaCappelletti94 changed the base branch from fix-corpus-triggers to main June 14, 2026 13:11

LucaCappelletti94 changed the title ~~Sampled batch axis, fidelity removal, and per-parser overall score~~ Repair corpus and harden oracle (issue #22), plus sampled batch axis, fidelity removal, and overall score Jun 14, 2026

LucaCappelletti94 merged commit c0ff21b into main Jun 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repair corpus and harden oracle (issue #22), plus sampled batch axis, fidelity removal, and overall score#24

Repair corpus and harden oracle (issue #22), plus sampled batch axis, fidelity removal, and overall score#24
LucaCappelletti94 merged 11 commits into
mainfrom
combined-parked

LucaCappelletti94 commented Jun 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

LucaCappelletti94 commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

LucaCappelletti94 commented Jun 14, 2026 •

edited

Loading