tinyorca is a minimal implementation of an Orca-style LLM serving engine.
It focuses on iteration-level scheduling and selective batching for mixed prefill and decode workloads.
Both demos below use the same setup:
max_batch_size=2- 5 concurrent requests
- 2 requests(req-0, req-2) intentionally much shorter than the others
In the baseline, the first admitted batch is effectively pinned until its slowest request completes. Even if one request finishes early, that vacant spot is not turned into useful work right away, so later requests keep waiting.
In tinyorca, scheduling happens at iteration granularity instead of request granularity.
When a short request(e.g. "Hi") finishes, its slot can be reused on the next iteration, so waiting requests can join earlier without waiting for the longest request in the current batch to finish. This helps each step to keep the max batch size, leading to better throughput.
For a deeper walkthrough of the paper and this implementation, see: Understanding Orca through tinyorca
uv venv
uv sync
uv run python -m tinyorca.examplefrom tinyorca import OrcaConfig, OrcaServe, SamplingConfig
serve = OrcaServe(
OrcaConfig(
model="Qwen/Qwen3-0.6B",
max_batch_size=2,
sampling=SamplingConfig(max_new_tokens=32),
)
)
for event in serve.generate(["Hello", "Hi."]):
print(event.request.request_id, event.token_id)uv run python -m benchBy default, the benchmark runs two synthetic workloads:
equal_size: 8 requests of(128, 128)short_long_mix: interleaved short(32, 32)and long(512, 128)requests
To run just one workload:
uv run python -m bench --workload short_long_mix

