Skip to content

Repair corpus and harden oracle (issue #22), plus sampled batch axis, fidelity removal, and overall score#24

Merged
LucaCappelletti94 merged 11 commits into
mainfrom
combined-parked
Jun 14, 2026
Merged

Repair corpus and harden oracle (issue #22), plus sampled batch axis, fidelity removal, and overall score#24
LucaCappelletti94 merged 11 commits into
mainfrom
combined-parked

Conversation

@LucaCappelletti94

@LucaCappelletti94 LucaCappelletti94 commented Jun 14, 2026

Copy link
Copy Markdown
Owner

Two things in one PR. First, the issue-#22 corpus and oracle fix: the old extractor split compound statements on inner ;, shredding CREATE TRIGGER ... BEGIN ... END bodies, PL/SQL blocks, and GO/DELIMITER batch separators into invalid fragments (raised on gwenn/lemon-rs#102). The SQLite, Spark, and Oracle suites are rebuilt from source with dialect-aware splitters (build_sqlite_suite, build_proc_suites), repair_corpus cleans the rest in place, and a guard test checks every CREATE TRIGGER line parses. The real-engine oracle is hardened so a transport error or engine crash is never recorded as valid: ClickHouse 23.3 and the Postgres backend each segfault on a specific statement, and both now reconnect, resume, and skip the poison statement. All six validity caches are regenerated, correcting 2411 Postgres, 767 ClickHouse, and 380 SQLite labels, all from valid to invalid.

Second, three parked features consolidated on top: a sampled batch axis (k=200 batches of m=128 drawn from each parser's accepted set, reported as a batch parse-accuracy plus per-statement time and memory), removal of the fidelity metric across grading, export, schema, UI, and docs, and a per-parser overall score (0 to 100) in web/src/score.rs shown as a leaderboard and a per-parser stat. The merge kept batch-accuracy's src/batch.rs (already a superset of drop-fidelity's batch fixes) and reconciled the score with the fidelity removal. bench.json and history.json are regenerated against the repaired corpus and corrected labels. One open question carried over from the parked work: whether provenance-only dialects over-reward permissive parsers in the overall score.

Fixes #22.

Add a single overall score per parser (0 to 100) blending correctness, robustness, project health, speed, and memory, with the five sub-scores shown alongside it. Scoring lives in web/src/score.rs and is computed from the existing bundle, featurescan, and metadata, so no benchmark re-run is needed. Surfaced as a ranked leaderboard on the overview and an overall-score section plus hero stat on each parser page.

Weights: correctness 45, robustness 20, health 15, speed 12, memory 8. Each parser is scored only over the dialects it models; correctness and health are absolute while speed and memory are ranked against the field within each dialect. Parked as work in progress: open question is whether provenance-only dialects over-reward permissive parsers.
…R bug report

Parked work in progress. Includes the fidelity-metric removal (grading, export, schema, UI, docs), exploratory batch fixes in src/batch.rs (terminator on its own line, COPY-FROM-STDIN exclusion), the regenerated bench.json, and the standalone sqlparser CREATE USER / ALTER USER terminator-swallow bug report.
Replace the all-or-nothing whole-script batch with k=200 random batches of m=128 statements drawn from the set each parser individually accepts. Report batch accuracy (share of batches that reparse to the exact statement count) plus per-statement time and memory over the correctly parsed batches. Sampling localizes a terminator bug to its batch instead of voiding the whole measurement, so real but narrow parser bugs surface at their true rate (sqlparser CREATE USER/ALTER USER) while clean parsers sit at 100 percent. Shared sampler in src/batch.rs (seeded SplitMix64), wired into the bench, membench, time machine, schema, export, and the web UI (batch ok% column). pg_query summary is excluded since it reports statement types, not a count.
The old extractor split compound statements on inner semicolons, shredding trigger bodies, PL/SQL blocks, and batch separators into invalid fragments. build_sqlite_suite rebuilds the SQLite official suite with a trigger and CASE aware splitter that keeps BEGIN..END bodies intact. build_proc_suites rebuilds Spark and Oracle from source, keeping whole PL/SQL blocks (Oracle anonymous blocks isolated under datasets/special, with their inner DML harvested as individual statements). repair_corpus cleans the remaining files in place: it splits T-SQL GO batch separators, drops procedural fragments and DELIMITER wreckage, and removes leaked non-SQL lines. Adds a guard test asserting every CREATE TRIGGER line parses, registers the new binaries, and repacks datasets.tar.zst.
The labelers defaulted transport errors and engine crashes to valid, which silently mislabeled large tails. ClickHouse 23.3 segfaults parsing one statement and the Postgres backend crashes on another, so the old error to valid fallback marked every statement after the crash valid (deterministic, so re-running reproduced it exactly). Each labeler now treats only a real engine answer as a verdict, drains responses, retries transient failures, and either reconnects and resumes past a confirmed poison statement (ClickHouse, Postgres) or aborts without writing a cache (MySQL, T-SQL, SQLite). Postgres also special-cases COPY TO STDOUT, which is valid but breaks the simple-query protocol. Regenerates all six caches against the repaired corpus, correcting 2411 Postgres, 767 ClickHouse, and 380 SQLite labels, all from valid to invalid.
…abels

Full `sqlbench regen` run (featurescan, depth probe, benches, membench, time machine --full, export) against the reconstructed corpus and the hardened oracle labels. Updates bench.json.zst and history.json.zst plus the per-parser failing-statement downloads. The fixes show through: SQLite native parsers are back near 100 percent recall (the spurious truncated-trigger failures are gone), T-SQL sqlparser recall rises with the GO batch-separator split, and PostgreSQL and ClickHouse are graded against the corrected validity labels. Also gitignores the local regen.log.
@LucaCappelletti94 LucaCappelletti94 changed the base branch from fix-corpus-triggers to main June 14, 2026 13:11
@LucaCappelletti94 LucaCappelletti94 changed the title Sampled batch axis, fidelity removal, and per-parser overall score Repair corpus and harden oracle (issue #22), plus sampled batch axis, fidelity removal, and overall score Jun 14, 2026
@LucaCappelletti94 LucaCappelletti94 merged commit c0ff21b into main Jun 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SQLite corpus splits CREATE TRIGGER BEGIN...END bodies, and the oracle mislabels the fragments valid

1 participant