A robust, single-node SQL query engine that reads and writes Apache Iceberg tables, built in Rust on Apache DataFusion + apache/iceberg-rust, with MinIO (S3-compatible) object storage. Apache Arrow is the in-memory format.
See the design/roadmap in the approved plan for the full architecture. This README tracks the current state.
| Layer | Choice |
|---|---|
| Language | Rust (Tokio async) |
| Query engine | Apache DataFusion (52.2, matching iceberg-datafusion 0.9.1) |
| Iceberg | apache/iceberg-rust 0.9.1 (iceberg, iceberg-catalog-rest, iceberg-datafusion) |
| Catalog | Iceberg REST catalog |
| Storage | MinIO (S3-compatible) |
| File format | Parquet |
crates/
engine/ # config, catalog connectivity, SessionContext builder, CoW DML
extensions/ # custom scalar/aggregate UDFs (replaces "Gandiva expressions")
cli/ # `snt` binary
docker/ # local Postgres-backed Iceberg REST + MinIO stack
seeder/ # PyIceberg NYC-taxi data seeder
- Start the local stack:
docker compose -f docker/docker-compose.yml up -d
- Build and run the CLI (lists catalog namespaces/tables):
cargo run -p cli
Configuration comes from environment variables (see .env.example); defaults match
the docker-compose stack.
- M0 — scaffold + connect to REST catalog (list namespaces/tables). ✅ done
- M1 — read path via DataFusion (
SELECT, projection + predicate pushdown). ✅ done- Verified against
nyc.taxi(NYC yellow-taxi data): exactcount(*), group-by, aggregations, andEXPLAINshowing pushdown intoIcebergTableScan. - Caveat: the seed delete was applied as copy-on-write by PyIceberg (no delete files), so merge-on-read delete merging is not yet exercised — see M1-follow-up.
- Verified against
- M2 — append writes (
CREATE TABLE,INSERT INTO,DROP TABLE). ✅ done- Verified: empty
CREATE TABLE, threeINSERT INTO ... SELECTappends → 3 real Iceberg snapshots (added-records 57/75/361) + 3 Parquet data files in MinIO;DROPremoves the table. - REST catalog now backed by Postgres (the demo's in-memory SQLite threw
SQLITE_BUSYon concurrent commits) — durable across restarts. - Limitation:
CREATE TABLE AS SELECT(CTAS) is not supported byiceberg-datafusion 0.9.1(register_table does not support tables with data); useCREATE TABLE+INSERT INTOinstead.
- Verified: empty
- M3 — extensions (custom UDFs). ✅ done
extensionscrate registers scalar (payment_label,miles_to_km) and aggregate (geo_mean) UDFs on theSessionContext; verified via SQL againstnyc.taxi. This is the DataFusion-native replacement for "custom Gandiva expressions".
- M4 — DML (
DELETE/UPDATE). ✅ done- The released
iceberg-datafusion/iceberg-ruststack has no native DML and an append-only transaction API. SoDELETE/UPDATEare implemented as copy-on-write inengine::dml: rewrite the table into a temp table (created with the source's exact Iceberg schema), then swap via catalogdrop+rename. - Verified on a test table:
DELETEremoved the right rows;UPDATEapplied literal and expression assignments scoped by predicate, with row counts preserved. - Limitations: swap is two catalog calls (not one atomic commit); snapshot history
restarts; single-level namespaces; unpartitioned tables;
MERGEunsupported (DataFusion 52 can't parse it).table_existsis avoided (the REST adapter rejects its HTTP HEAD with 400).
- The released
- M5 — hardening (observability, commit-conflict retry, docs). ← next