Skip to content

aagumin/iskander

Repository files navigation

Iskander

Experimental MVP: a high-performance Arrow Flight inference ML server for batch, streaming, and data-native ML workloads.

Status

This project is an experimental foundation, not a production-ready inference platform. The current MVP focuses on Arrow RecordBatch in, schema validation, resource limits, ONNX Runtime inference, and Arrow RecordBatch out over Apache Arrow Flight.

What It Is

Iskander is a Rust inference server built around Apache Arrow Flight. Clients send typed Arrow batches, the server validates a model schema contract, applies security and resource limits, dispatches to a pluggable backend, and returns typed Arrow batches.

For ONNX models, the server can load an Arrow schema manifest next to the model artifact. That manifest defines the Arrow-facing input/output contract, lets the backend compile tensor mappings at model load time, and gives clients a stable schema contract instead of relying on runtime shape guessing.

The target workloads are embeddings, recommendations, tabular scoring, micro-batch streaming, lakehouse inference, ETL feature generation, and scientific or industrial batch analytics.

What It Is Not

It is not a generic Triton replacement, REST-first prediction API, LLM token streaming server, hard real-time robotics control loop, or universal inference server for every model family.

Why Arrow Flight

Arrow Flight carries columnar data over gRPC without converting every request into JSON or row-oriented payloads. It fits systems that already produce Arrow, Parquet, DataFusion, Polars, Spark, Iceberg, Delta, feature-store, or vector indexing pipelines.

Why Rust

Rust gives this project memory safety, typed errors, async networking, strong Arrow ecosystem support, and clean integration points for model backends. ONNX Runtime support is implemented through the ort crate, while safetensors is used for safe tensor artifact metadata and loading rather than compute.

Zero-Copy Aware Design

The design keeps Arrow buffers as the transport and validation representation. It does not claim full zero-copy inference. The ONNX backend has an input fast path that borrows non-null Arrow float32 / int64 primitive and fixed-size-list buffers as ORT tensor views, and it uses ONNX Runtime I/O binding with preallocated float32 outputs for known output shapes. Copies may still happen during Flight decode, nullable input handling, CPU-to-GPU transfer, and final Arrow output materialization. Copy boundaries are documented in docs/memory_model.md.

The current runtime is optimized for data-native batch and micro-batch inference. It is strongest when clients already hold Arrow-compatible columnar data and want Arrow-compatible results back. It is not optimized for single-row REST latency.

Quickstart

Recommended quickstart (MovieLens two-tower model from devmodels):

uv run devmodels/movielens_two_tower/train_two_tower.py
cargo run --release -p iskander-server --features onnx -- \
  --config examples/movielens-two-tower/config.toml

Default server address:

127.0.0.1:50051

The config points the Rust server directly at the exported ONNX artifact.

[[models]]
name = "movielens-two-tower"
backend = "onnx"
path = "devmodels/movielens_two_tower/artifacts/two_tower.onnx"
execution_providers = ["cpu"]
intra_threads = 4
inter_threads = 1
optimization_level = "level3"
parallel_execution = false
memory_pattern = true

Proof Of Concept Benchmark Snapshot

As a local CPU-only proof of concept, the repository now includes a matched-baseline open-loop benchmark harness in crates/iskander-bench. One representative movielens-two-tower run used:

  • batch size: 128
  • offered load: 3000 req/s
  • duration: 90s
  • zero drops and zero errors for all three SUTs
SUT Transport p50 p95 p99 p99.9 avg
Iskander Arrow Flight 709 us 1181 us 3427 us 8746 us 815 us
Iskander OIP v2 gRPC 941 us 1533 us 5285 us 14011 us 1144 us
Triton + ORT OIP v2 gRPC 1125 us 3550 us 7698 us 16555 us 1536 us

In this snapshot, all three systems sustained the same fixed offered load, and Iskander over Arrow Flight showed the lowest client-observed latency. Iskander over OIP v2 also stayed below Triton on the same request shape and ORT-aligned CPU baseline.

These numbers are a proof-of-concept snapshot, not a universal claim. They come from one local matched-baseline setup and should be read together with the methodology in BENCHMARK.md and the raw artifacts in results/benchmarks/2026-05-08-matched-baseline/.

Use Cases

A. Batch Embeddings

Generate embeddings for documents, products, users, or images; run batch re-indexing; feed vector databases, feature stores, or lakehouse tables.

Input: id: utf8, text: utf8

Output: id: utf8, embedding: fixed_size_list<float32>[768]

B. Tabular Batch Scoring

Use for churn prediction, fraud scoring, credit risk, lead scoring, price prediction, and demand forecasting.

Input: entity_id: utf8, feature_1: float32, feature_2: float32, feature_n: float32

Output: entity_id: utf8, score: float32, label: utf8

C. Recommendation Reranking

Score user-item candidates for feeds, ads, marketplaces, and matching systems.

Input: user_id: utf8, item_id: utf8, user_features: struct, item_features: struct

Output: user_id: utf8, item_id: utf8, relevance_score: float32, rank: int32

D. Streaming Micro-Batch Inference

Apply inference to fraud detection, anomaly detection, IoT telemetry, clickstream scoring, and monitoring streams.

This server is designed for micro-batches, not single-event REST latency.

E. Lakehouse / ETL Inference

Run offline inference over Parquet, Iceberg, Delta, and Arrow pipelines. Add prediction columns, generate features, and write enriched datasets.

F. Robotics / Sensor Analytics

Handle lidar point cloud batches, sensor windows, telemetry anomaly detection, fleet analytics, and perception pipeline support.

Not intended for hard real-time control loops.

G. Scientific / Industrial Data

Support genomics, manufacturing sensors, energy grids, finance time series, climate data, and simulation outputs.

Backend Support Matrix

Backend Status Notes
onnx Implemented / feature-gated Uses ort 2.0.0-rc.12; supports Float32/Int64 inputs with borrowed Arrow-buffer fast paths for compatible non-null layouts.
torch-worker Planned External Python/C++ worker over IPC/gRPC/Arrow IPC planned.
python-worker Planned Mosec-style worker process for pickle/sklearn/Python models.

License

TBD.

About

High perfomance Arrow Flight inference ML server

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors