Experimental MVP: a high-performance Arrow Flight inference ML server for batch, streaming, and data-native ML workloads.
This project is an experimental foundation, not a production-ready inference platform. The current MVP focuses on Arrow RecordBatch in, schema validation, resource limits, ONNX Runtime inference, and Arrow RecordBatch out over Apache Arrow Flight.
Iskander is a Rust inference server built around Apache Arrow Flight. Clients send typed Arrow batches, the server validates a model schema contract, applies security and resource limits, dispatches to a pluggable backend, and returns typed Arrow batches.
For ONNX models, the server can load an Arrow schema manifest next to the model artifact. That manifest defines the Arrow-facing input/output contract, lets the backend compile tensor mappings at model load time, and gives clients a stable schema contract instead of relying on runtime shape guessing.
The target workloads are embeddings, recommendations, tabular scoring, micro-batch streaming, lakehouse inference, ETL feature generation, and scientific or industrial batch analytics.
It is not a generic Triton replacement, REST-first prediction API, LLM token streaming server, hard real-time robotics control loop, or universal inference server for every model family.
Arrow Flight carries columnar data over gRPC without converting every request into JSON or row-oriented payloads. It fits systems that already produce Arrow, Parquet, DataFusion, Polars, Spark, Iceberg, Delta, feature-store, or vector indexing pipelines.
Rust gives this project memory safety, typed errors, async networking, strong Arrow ecosystem support, and clean integration points for model backends. ONNX Runtime support is implemented through the ort crate, while safetensors is used for safe tensor artifact metadata and loading rather than compute.
The design keeps Arrow buffers as the transport and validation representation. It does not claim full zero-copy inference. The ONNX backend has an input fast path that borrows non-null Arrow float32 / int64 primitive and fixed-size-list buffers as ORT tensor views, and it uses ONNX Runtime I/O binding with preallocated float32 outputs for known output shapes. Copies may still happen during Flight decode, nullable input handling, CPU-to-GPU transfer, and final Arrow output materialization. Copy boundaries are documented in docs/memory_model.md.
The current runtime is optimized for data-native batch and micro-batch inference. It is strongest when clients already hold Arrow-compatible columnar data and want Arrow-compatible results back. It is not optimized for single-row REST latency.
Recommended quickstart (MovieLens two-tower model from devmodels):
uv run devmodels/movielens_two_tower/train_two_tower.py
cargo run --release -p iskander-server --features onnx -- \
--config examples/movielens-two-tower/config.tomlDefault server address:
127.0.0.1:50051
The config points the Rust server directly at the exported ONNX artifact.
[[models]]
name = "movielens-two-tower"
backend = "onnx"
path = "devmodels/movielens_two_tower/artifacts/two_tower.onnx"
execution_providers = ["cpu"]
intra_threads = 4
inter_threads = 1
optimization_level = "level3"
parallel_execution = false
memory_pattern = trueAs a local CPU-only proof of concept, the repository now includes a matched-baseline open-loop benchmark harness in crates/iskander-bench. One representative movielens-two-tower run used:
- batch size:
128 - offered load:
3000 req/s - duration:
90s - zero drops and zero errors for all three SUTs
| SUT | Transport | p50 | p95 | p99 | p99.9 | avg |
|---|---|---|---|---|---|---|
| Iskander | Arrow Flight | 709 us | 1181 us | 3427 us | 8746 us | 815 us |
| Iskander | OIP v2 gRPC | 941 us | 1533 us | 5285 us | 14011 us | 1144 us |
| Triton + ORT | OIP v2 gRPC | 1125 us | 3550 us | 7698 us | 16555 us | 1536 us |
In this snapshot, all three systems sustained the same fixed offered load, and Iskander over Arrow Flight showed the lowest client-observed latency. Iskander over OIP v2 also stayed below Triton on the same request shape and ORT-aligned CPU baseline.
These numbers are a proof-of-concept snapshot, not a universal claim. They come from one local matched-baseline setup and should be read together with the methodology in BENCHMARK.md and the raw artifacts in results/benchmarks/2026-05-08-matched-baseline/.
Generate embeddings for documents, products, users, or images; run batch re-indexing; feed vector databases, feature stores, or lakehouse tables.
Input: id: utf8, text: utf8
Output: id: utf8, embedding: fixed_size_list<float32>[768]
Use for churn prediction, fraud scoring, credit risk, lead scoring, price prediction, and demand forecasting.
Input: entity_id: utf8, feature_1: float32, feature_2: float32, feature_n: float32
Output: entity_id: utf8, score: float32, label: utf8
Score user-item candidates for feeds, ads, marketplaces, and matching systems.
Input: user_id: utf8, item_id: utf8, user_features: struct, item_features: struct
Output: user_id: utf8, item_id: utf8, relevance_score: float32, rank: int32
Apply inference to fraud detection, anomaly detection, IoT telemetry, clickstream scoring, and monitoring streams.
This server is designed for micro-batches, not single-event REST latency.
Run offline inference over Parquet, Iceberg, Delta, and Arrow pipelines. Add prediction columns, generate features, and write enriched datasets.
Handle lidar point cloud batches, sensor windows, telemetry anomaly detection, fleet analytics, and perception pipeline support.
Not intended for hard real-time control loops.
Support genomics, manufacturing sensors, energy grids, finance time series, climate data, and simulation outputs.
| Backend | Status | Notes |
|---|---|---|
onnx |
Implemented / feature-gated | Uses ort 2.0.0-rc.12; supports Float32/Int64 inputs with borrowed Arrow-buffer fast paths for compatible non-null layouts. |
torch-worker |
Planned | External Python/C++ worker over IPC/gRPC/Arrow IPC planned. |
python-worker |
Planned | Mosec-style worker process for pickle/sklearn/Python models. |
TBD.