The MVP applies request limits before backend execution:
max_batch_rowsmax_batch_bytesmax_columnsmax_string_bytesmax_nested_depthrequest_timeout_ms
These limits reduce accidental OOM risk and reject malformed or unexpectedly large Arrow payloads before they reach model runtimes. Estimated bytes are based on Arrow array memory size and should be treated as a guardrail, not an exact allocator accounting.
request_timeout_ms is a scheduler wait timeout. It bounds how long the async server path waits for a backend result. For ONNX, Iskander also passes the timeout into ORT RunOptions and triggers RunOptions::terminate() when the timer expires. This is cooperative cancellation inside ONNX Runtime, not a hard kill. External worker backends should use process-level cancellation or worker recycling for stronger isolation.
Malformed Arrow payload handling currently relies on Arrow Flight and Arrow IPC decoding errors. Future hardening should add fuzz tests and stricter error reporting.
Planned controls:
- TLS and mTLS.
- Authentication and authorization.
- Model-level permissions.
- Model manifest signing or checksum verification.
- Per-tenant quotas.
- Per-model concurrency limits.
- Backpressure and queue limits.
- Metrics for rejected requests and timeout rates.
ONNX Runtime runs in-process. It should be treated as trusted native code loaded by the server operator. The Arrow schema manifest constrains the Arrow-facing contract but does not sandbox the model runtime.
SafeTensors support is metadata/artifact loading, not execution. Torch and arbitrary Python/pickle models should run in external worker processes so Python runtime failures, C++ ABI issues, and GPU library conflicts do not compromise the Rust server process.