Skip to content

Latest commit

 

History

History
69 lines (52 loc) · 2.9 KB

File metadata and controls

69 lines (52 loc) · 2.9 KB

Miles Diffusion Router

Load-balances requests across multiple sglang-diffusion worker instances using least-request routing with background health checks and worker quarantine.

Quick Start

# Start the router with two diffusion backends
python examples/diffusion_router/demo.py --port 30080 \
    --worker-urls http://localhost:10090 http://localhost:10091

# Or start empty and add workers dynamically
python examples/diffusion_router/demo.py --port 30080
curl -X POST 'http://localhost:30080/add_worker?url=http://localhost:10090'

API Endpoints

Method Path Description
POST /generate Image generation (forwards to /v1/images/generations)
POST /generate_video Video generation (forwards to /v1/videos/generations)
GET /health Aggregated router health
GET /health_workers Per-worker health and load info
POST /add_worker Register a diffusion worker (?url=... or JSON body)
GET /list_workers List registered workers
POST /update_weights_from_disk Broadcast weight reload to all workers
* /{path} Catch-all proxy to least-loaded worker

Load Balancing

The router uses a least-request strategy: each incoming request is forwarded to the worker with the fewest in-flight requests. This is workload-aware and avoids hot-spotting compared to round-robin. When a request completes, the worker's count is decremented, keeping the load state accurate in real time.

Workers that fail consecutive health checks (default: 3) are quarantined and excluded from the routing pool. A background loop pings each worker's GET /health endpoint at a configurable interval (default: 10s).

Notes

  • Health check endpoint follows Miles/SGLang convention: GET /health.
  • Responses are fully buffered; streaming and large-response handling are not supported yet (planned for a follow-up PR).

Example Requests

# Check health
curl http://localhost:30080/health

# Generate an image
curl -X POST http://localhost:30080/generate \
    -H 'Content-Type: application/json' \
    -d '{"model": "stabilityai/stable-diffusion-3", "prompt": "a cat", "n": 1, "size": "1024x1024"}'

# Reload weights on all workers
curl -X POST http://localhost:30080/update_weights_from_disk \
    -H 'Content-Type: application/json' \
    -d '{"model_path": "/path/to/new/weights"}'

CLI Options

--host                          Bind address (default: 0.0.0.0)
--port                          Port (default: 30080)
--worker-urls                   Initial worker URLs
--max-connections               Max concurrent connections (default: 100)
--timeout                       Request timeout in seconds
--health-check-interval         Seconds between health checks (default: 10)
--health-check-failure-threshold  Failures before quarantine (default: 3)
--verbose                       Enable verbose logging