A C++ task scheduler for AI inference on Apple Silicon. Prioritizes real-time LLM requests over batch work, overlaps SSD reads with GPU compute, and hot-swaps models without downtime.
Concurrent request scheduling (Llama-3.2-1B-Instruct-4bit, 6 clients):
| Metric | Naive FIFO | Rais | Speedup |
|---|---|---|---|
| Interactive TTFT | 4,829 ms | 1,438 ms | 3.4x |
| Interactive E2E | 5,653 ms | 2,254 ms | 2.5x |
Layer-streaming throughput (IO/compute overlapped):
| Model | Naive | Rais | Speedup |
|---|---|---|---|
| SmolLM2-135M (257 MB) | 157 tok/s | 188 tok/s | 1.20x |
| TinyLlama-1.1B (2.1 GB) | 15.5 tok/s | 17.8 tok/s | 1.15x |
git clone https://github.com/deepsoftworks/rais.git && cd rais
./install.sh
cmake --build build --target priority_example
./build/priority_examplerais::Scheduler sched;
sched.submit([&] {
generate(prompt);
}, rais::Lane::Interactive);WITH_PYTHON=1 ./install.sh
PYTHONPATH=build python3 -c "import rais; print(rais.Scheduler)"Five priority lanes:
| Lane | Purpose |
|---|---|
Interactive |
Real-time user requests (< 5ms submit-to-start) |
Background |
Model hot-swap, logging, embeddings |
Bulk |
Batch jobs, eval runs |
GPU |
Metal compute dispatch |
IO |
Dedicated threads for SSD weight reads |
Key internals: lock-free MPMC ring + Chase-Lev work-stealing deques, earliest-deadline-first scheduling, starvation promotion, triple-buffered layer streaming, slab allocator (~83ns/alloc).
Works with MLX/mlx-lm, llama.cpp, and PyTorch. See examples/ for integration patterns:
examples/minimal_submit.cpp-- basic scheduler usageexamples/llama_cpp_integration.cpp-- llama.cpp integrationexamples/rais_server.cpp-- server mode
Requires macOS on Apple Silicon (M1+), CMake 3.20+, Xcode CLI tools, Catch2 v3.
brew install catch2
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build
ctest --test-dir build --output-on-failureMIT
