MLX Batch Server

High-performance local AI inference server for Apple Silicon with batch processing

MLX Batch Server is a production-grade inference server optimized for Apple Silicon, featuring concurrent batch processing, OpenAI Responses API, and Harmony parser for GPT-OSS models.

Features • Quick Start • API Reference • Configuration

Origin & Acknowledgments

This project is a standalone fork of mlx-omni-server by @madroidmaq, whose excellent work laid the foundation for local MLX inference with OpenAI/Anthropic API compatibility.

LibraxisAI extended the original project with:

Batch inference coordinator (10+ concurrent requests)
Full OpenAI Responses API (/v1/responses)
Streaming Harmony parser for GPT-OSS models
Production hardening for 24/7 operation

We maintain this as a separate project due to significant architectural divergence, while continuing to contribute improvements back to the upstream project where applicable.

Features

Feature	Description
Batch Processing	Handle 10+ concurrent requests via mlx-lm BatchGenerator
Responses API	Full OpenAI `/v1/responses` with SSE streaming
Harmony Parser	Native GPT-OSS model support with channel parsing
Dual API	Compatible with OpenAI and Anthropic SDKs
Model Management	Dynamic load/unload endpoints
Privacy-First	All processing happens locally on your Mac

What's Different From Upstream

├── Batch Coordinator      → Concurrent request batching (NEW)
├── /v1/responses          → OpenAI Responses API (NEW)
├── Harmony Streaming      → GPT-OSS channel parser (NEW)
├── /v1/models/load        → Dynamic model loading (NEW)
├── /v1/models/unload      → Model unloading (NEW)
└── Production Config      → Environment-based settings (NEW)

Quick Start

Installation

# Clone
git clone https://github.com/LibraxisAI/mlx-batch-server.git
cd mlx-batch-server

# Core install (inference only)
uv sync
# Or
pip install -e .

# Full surface (auth + operator UI)
uv sync --extra auth --extra operator
# Or
pip install -e ".[auth,operator]"

Extra	Pulls	Enables
`auth`	`redis`, `pyjwt`	Session auth + Redis-backed API keys/HMAC + rate limiting
`operator`	`click`, `jinja2`, `python-multipart`, `ruamel.yaml`	`mlx-batch-operator` CLI + htmx admin UI

Local development uses the editable sibling dependency ../mlx-vlm-local, so upstream-facing mlx-vlm fixes land in the server immediately after uv sync.

Run the Server

# Default (port 10240)
mlx-batch-server

# Custom port
mlx-batch-server --port 10240

# With debug logging
MLX_BATCH_LOG_LEVEL=debug mlx-batch-server

Test It

# Chat completion
curl http://localhost:10240/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

# Responses API (streaming)
curl http://localhost:10240/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3-0.6B-4bit",
    "input": [{"role": "user", "content": [{"type": "input_text", "text": "Hello!"}]}],
    "stream": true
  }'

Preparing Qwen3.6-VL-30B

The local mlx-vlm dependency already understands qwen3.6-vl and qwen3.6-vl-moe aliases during conversion. To expose a converted qwen3.6-vl-30b cleanly through the server, point MODEL_ALIASES or an in-process runtime alias at the converted model path or repo id.

API Reference

OpenAI Compatible (`/v1/*`)

Endpoint	Description	Status
`POST /v1/responses`	Responses API with SSE streaming	Stable
`POST /v1/chat/completions`	Chat with tools, streaming, structured output	Stable
`GET /v1/batch/stats`	Batch coordinator statistics	Stable
`POST /v1/models/load`	Dynamic model loading	Stable
`POST /v1/models/unload`	Model unloading	Stable
`POST /v1/audio/speech`	Text-to-Speech	Stable
`POST /v1/audio/transcriptions`	Speech-to-Text (Whisper)	Stable
`POST /v1/images/generations`	Image Generation	Stable
`POST /v1/embeddings`	Text Embeddings	Stable
`GET /v1/models`	List available models	Stable

Anthropic Compatible (`/anthropic/v1/*`)

Endpoint	Description	Status
`POST /anthropic/v1/messages`	Messages with tools, streaming, thinking	Stable
`GET /anthropic/v1/models`	Model listing with pagination	Stable

Configuration

Environment Variables

Variable	Description	Default
`MLX_BATCH_LOG_LEVEL`	Logging level (`debug`, `info`, `warning`)	`info`
`MLX_BATCH_CORS`	CORS origins (comma-separated)	`*`
`MLX_BATCH_ENABLE_BATCH`	Enable batch inference	`true`
`MLX_BATCH_BATCH_WINDOW_MS`	Batch collection window (ms)	`50`
`MLX_BATCH_MAX_BATCH_SIZE`	Maximum concurrent requests	`10`
`MLX_BATCH_DEFAULT_MODEL`	Model to load on startup	-

Batch Processing

Batch processing collects incoming requests within a time window and processes them together, significantly improving throughput on Apple Silicon:

# Tune for your workload
MLX_BATCH_BATCH_WINDOW_MS=100 \
MLX_BATCH_MAX_BATCH_SIZE=16 \
mlx-batch-server

Performance (M3 Ultra, 512GB):

Single request: ~50 tok/s
Batched (10 requests): ~35 tok/s per request = 350 tok/s total

Security

The server ships open by default (SECURITY_LEVEL=0) so existing deployments keep working. Set a level to lock the surface down:

Level	Behavior
`0`	Open. No auth, every request maps to a stable pseudo-owner. Default.
`1`	Deprecated. Treated internally as `2` with a warning.
`2`	HMAC or session token or API key (any one of them).
`3`	Session token only (HMAC + API key fallback disabled).

When the level is >0, every protected route — including /api/admin/models/{load,unload,alias} — requires a credential. /health and /v1/ready stay open at all levels for load balancers.

# Static API key (simplest, single-secret deploys)
SECURITY_LEVEL=2 API_KEY=sk-mlx-… mlx-batch-server

# HMAC clients (machine-to-machine, /hmac/register issues secrets)
SECURITY_LEVEL=2 API_KEY=sk-… mlx-batch-server
curl -H "x-api-key: sk-…" -X POST http://127.0.0.1:10240/hmac/register \
     -d '{"client_id":"node-1","description":"build agent"}' \
     -H "Content-Type: application/json"

# Session-only (browser sessions via /auth/login)
SECURITY_LEVEL=3 SESSION_AUTH_ENABLED=true mlx-batch-server

Auxiliary auth env vars: API_KEY_HEADER (default x-api-key), REDIS_URL (Redis-backed sessions/API keys/rate-limit), SESSION_AUTH_ENABLED, SESSION_PROVIDER, SESSION_TTL_HOURS, RATE_LIMIT_ENABLED, ACCESS_REGISTRATION_SECRET (enables /access HTML registration page), MLX_BATCH_HMAC_SECRETS_FILE (XDG path by default), HMAC_TIMESTAMP_TOLERANCE.

The auth router family (/auth/*, /hmac/*, /access) is opt-in — it only mounts when at least one auth-related env var is configured.

Operator UI

A standalone htmx admin lives in mlx_batch_server.operator and runs as a sibling app on port 10241:

# Inference (port 10240)
mlx-batch-server &

# Operator UI (port 10241) — connects back to inference at 10240
mlx-batch-operator serve

# Custom inference URL / port
MLX_BATCH_OPERATOR_INFERENCE_BASE_URL=http://localhost:10240 \
    mlx-batch-operator serve --port 10241

Tabs: Fleet (live runtime + model summary), Sessions (recent playground sessions with delete-guard), Logs (tail + SSE follow), Lifecycle (status + restart/stop), Playground (in-browser SSE prompt with response chaining).

Auth posture inherits inference: if you start inference with SECURITY_LEVEL=2, the operator UI also requires that key. Override per side with:

Variable	Effect
`MLX_BATCH_OPERATOR_SECURITY_LEVEL`	Force a different operator level than inference.
`MLX_BATCH_OPERATOR_REQUIRE_AUTH=true`	Force operator auth even when inference is open (useful behind a public proxy).
`MLX_BATCH_INTERNAL_API_KEY`	Key the operator forwards on the loopback playground proxy.

The operator's /health and /api/health stay open at all levels so monitoring keeps working.

There is also a thin landing page at http://127.0.0.1:10240/admin on the inference port — it links to the richer operator UI on port 10241 and is gated by the same SECURITY_LEVEL.

Readiness

Two health surfaces, used for different purposes:

Endpoint	Purpose	Auth	Body
`GET /health`	Lightweight liveness for load balancers	open	`{"status":"ok", …}`
`GET /v1/ready`	Rich readiness — process, models loaded, batch coordinators, config, auth backends	open	`{"ready":bool, "checks":{…}}`

/v1/ready returns 200 only when every check passes; otherwise 503 with the failing check called out. When SECURITY_LEVEL>0 the readiness payload also includes an auth_backends block reporting Redis connectivity for sessions/API keys.

HF Model Cards

Tooling for keeping the LibraxisAI Hugging Face model cards consistent. Sources of truth:

templates/HF_MODEL_CARD.md — canonical card template with placeholders, fixed ## Inference tested on section pointing here, and the canonical Vibecrafted footer.
scripts/rewrite_hf_model_cards.py — full rewrite of every LibraxisAI card from the template, preserving metrics and base lineage when present.
scripts/backfill_hf_inference_section.py — conservative patch that only adds ## Inference tested on to cards that don't have it yet.
scripts/backfill_hf_canonical_footer.py — conservative patch that only appends the canonical Vibecrafted footer to cards that don't have any form of it yet.

All scripts default to dry-run (list which cards would change without pushing). Add --apply to actually push, or use the *-apply Make targets:

# One-time auth
hf auth login

# Dry-run a full rewrite
make hf-rewrite

# Push a full rewrite (idempotent commit message: "card: full rewrite from canonical template")
make hf-rewrite-apply

# Backfill only the inference section across cards that lack it
make hf-backfill-inference         # dry-run
make hf-backfill-inference-apply   # push

# Backfill only the canonical footer across cards that lack any form of it
make hf-backfill-footer            # dry-run
make hf-backfill-footer-apply      # push

# Filters (work with all the above)
make hf-rewrite HF_LIMIT=5
make hf-backfill-inference HF_ONLY="Bielik Qwen"

The backfill scripts are intentionally conservative: they never delete content, never replace existing variants, and skip any card that already has the target section. Use hf-rewrite when you want a full normalisation pass.

Development

# Setup
make setup           # Install deps + pre-commit hooks

# Run
make dev             # Start with hot-reload
make dev PORT=10240  # Custom port

# Test
make test            # All tests
make test-responses  # Responses API tests
make test-fast       # Skip slow tests

# Quality
make lint            # Run linters
make format          # Format code
make check           # Full CI check

# Model management
make load MODEL=mlx-community/Qwen3-0.6B-4bit
make unload
make ps              # List loaded models
make batch-stats     # Coordinator stats

Documentation

Resource	Description
Responses API Guide	Full Responses API reference
Batch Processing Guide	Batch inference configuration
Harmony Parser	GPT-OSS channel parsing
OpenAI API Guide	OpenAI compatibility reference
Anthropic API Guide	Anthropic compatibility reference
Examples	Practical usage examples

Requirements

macOS with Apple Silicon (M1/M2/M3/M4)
Python 3.11+
MLX framework (auto-installed)

Contributing

git clone https://github.com/LibraxisAI/mlx-batch-server.git
cd mlx-batch-server
make setup && make test

Pull requests welcome! For major changes, please open an issue first.

License

MIT License

Original project: mlx-omni-server by @madroidmaq

Maintained by: LibraxisAI

Built with MLX by Apple • FastAPI • MLX-LM

Not affiliated with OpenAI, Anthropic, or Apple

Name		Name	Last commit message	Last commit date
Latest commit History 270 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
logs		logs
playground		playground
scripts		scripts
src/mlx_batch_server		src/mlx_batch_server
templates		templates
tests		tests
tools		tools
.gitignore		.gitignore
.loctignore		.loctignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.semgrepignore		.semgrepignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
Makefile		Makefile
Makefile.include		Makefile.include
README.md		README.md
architecture_note.md		architecture_note.md
install.sh		install.sh
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MLX Batch Server

Origin & Acknowledgments

Features

What's Different From Upstream

Quick Start

Installation

Run the Server

Test It

Preparing Qwen3.6-VL-30B

API Reference

OpenAI Compatible (`/v1/*`)

Anthropic Compatible (`/anthropic/v1/*`)

Configuration

Environment Variables

Batch Processing

Security

Operator UI

Readiness

HF Model Cards

Development

Documentation

Requirements

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MLX Batch Server

Origin & Acknowledgments

Features

What's Different From Upstream

Quick Start

Installation

Run the Server

Test It

Preparing Qwen3.6-VL-30B

API Reference

OpenAI Compatible (/v1/*)

Anthropic Compatible (/anthropic/v1/*)

Configuration

Environment Variables

Batch Processing

Security

Operator UI

Readiness

HF Model Cards

Development

Documentation

Requirements

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

OpenAI Compatible (`/v1/*`)

Anthropic Compatible (`/anthropic/v1/*`)

Packages