Skip to content

marun224/query_engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SingleNodeTransformation

A robust, single-node SQL query engine that reads and writes Apache Iceberg tables, built in Rust on Apache DataFusion + apache/iceberg-rust, with MinIO (S3-compatible) object storage. Apache Arrow is the in-memory format.

See the design/roadmap in the approved plan for the full architecture. This README tracks the current state.

Stack

Layer Choice
Language Rust (Tokio async)
Query engine Apache DataFusion (52.2, matching iceberg-datafusion 0.9.1)
Iceberg apache/iceberg-rust 0.9.1 (iceberg, iceberg-catalog-rest, iceberg-datafusion)
Catalog Iceberg REST catalog
Storage MinIO (S3-compatible)
File format Parquet

Workspace layout

crates/
  engine/      # config, catalog connectivity, SessionContext builder, CoW DML
  extensions/  # custom scalar/aggregate UDFs (replaces "Gandiva expressions")
  cli/         # `snt` binary
docker/        # local Postgres-backed Iceberg REST + MinIO stack
seeder/        # PyIceberg NYC-taxi data seeder

Quickstart

  1. Start the local stack:
    docker compose -f docker/docker-compose.yml up -d
  2. Build and run the CLI (lists catalog namespaces/tables):
    cargo run -p cli

Configuration comes from environment variables (see .env.example); defaults match the docker-compose stack.

Milestones

  • M0 — scaffold + connect to REST catalog (list namespaces/tables). ✅ done
  • M1 — read path via DataFusion (SELECT, projection + predicate pushdown). ✅ done
    • Verified against nyc.taxi (NYC yellow-taxi data): exact count(*), group-by, aggregations, and EXPLAIN showing pushdown into IcebergTableScan.
    • Caveat: the seed delete was applied as copy-on-write by PyIceberg (no delete files), so merge-on-read delete merging is not yet exercised — see M1-follow-up.
  • M2 — append writes (CREATE TABLE, INSERT INTO, DROP TABLE). ✅ done
    • Verified: empty CREATE TABLE, three INSERT INTO ... SELECT appends → 3 real Iceberg snapshots (added-records 57/75/361) + 3 Parquet data files in MinIO; DROP removes the table.
    • REST catalog now backed by Postgres (the demo's in-memory SQLite threw SQLITE_BUSY on concurrent commits) — durable across restarts.
    • Limitation: CREATE TABLE AS SELECT (CTAS) is not supported by iceberg-datafusion 0.9.1 (register_table does not support tables with data); use CREATE TABLE + INSERT INTO instead.
  • M3 — extensions (custom UDFs). ✅ done
    • extensions crate registers scalar (payment_label, miles_to_km) and aggregate (geo_mean) UDFs on the SessionContext; verified via SQL against nyc.taxi. This is the DataFusion-native replacement for "custom Gandiva expressions".
  • M4 — DML (DELETE/UPDATE). ✅ done
    • The released iceberg-datafusion/iceberg-rust stack has no native DML and an append-only transaction API. So DELETE/UPDATE are implemented as copy-on-write in engine::dml: rewrite the table into a temp table (created with the source's exact Iceberg schema), then swap via catalog drop + rename.
    • Verified on a test table: DELETE removed the right rows; UPDATE applied literal and expression assignments scoped by predicate, with row counts preserved.
    • Limitations: swap is two catalog calls (not one atomic commit); snapshot history restarts; single-level namespaces; unpartitioned tables; MERGE unsupported (DataFusion 52 can't parse it). table_exists is avoided (the REST adapter rejects its HTTP HEAD with 400).
  • M5 — hardening (observability, commit-conflict retry, docs). ← next

About

Single-node SQL query engine for Apache Iceberg (Rust + DataFusion + iceberg-rust, MinIO/S3)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors