Mitigating Semantic Hotspots in Mixture-of-Experts LLM Inference via Router-Assisted Batch Selection
CAPS addresses the "semantic hotspot" problem in MoE (Mixture-of-Experts) model inference. When users submit semantically similar requests in bursts (e.g., a stream of coding questions), traditional FCFS (First-Come-First-Served) scheduling causes certain experts to become overloaded while others sit idle, increasing tail latency due to straggler effects.
CAPS solves this by:
- Predicting expert usage before execution using the MoE model's own gating mechanism (router dry-run)
- Selecting complementary requests from the queue that balance load across experts
- Minimizing load variance to reduce the straggler effect in batch execution
Results from discrete-event simulation comparing 12 batch selection strategies across 4 arrival rates, 4 random seeds, and 2 arrival patterns (3000 requests per experiment).
The primary use case for CAPS. Bursty arrivals simulate real-world semantic clustering (e.g., many coding questions arriving together).
| Strategy | P99 Improvement | Throughput Gain | Imbalance Reduction |
|---|---|---|---|
greedy_dot_product |
+18.6% to +46.9% | +12.6% to +12.8% | +11.3% to +11.4% |
norm_minimizing |
+18.7% to +46.9% | +12.6% to +12.8% | +11.3% to +11.4% |
best_fit |
+18.6% to +46.9% | +12.7% to +12.8% | +11.3% to +11.4% |
greedy_variance |
+18.6% to +46.9% | +12.7% to +12.8% | +11.3% to +11.4% |
power_of_d |
+18.3% to +47.0% | +12.5% to +12.6% | +11.1% to +11.2% |
lpt_balanced |
+18.4% to +45.6% | +12.5% to +12.7% | +11.1% to +11.3% |
random |
+15.4% to +39.6% | +10.3% to +10.6% | +9.4% to +9.6% |
karmarkar_karp |
+10.6% to +20.6% | +7.2% to +7.3% | +6.7% to +6.8% |
lpt |
+7.4% to +8.7% | +5.1% to +5.2% | +4.9% to +5.0% |
Observation: The top 6 strategies (greedy_dot_product, norm_minimizing, best_fit, greedy_variance, power_of_d, lpt_balanced) perform nearly identically. This confirms the literature finding that variance minimization, L2 norm minimization, and dot product minimization are mathematically equivalent for this problem.
| Arrival Rate | Bursty P99 Improvement | Poisson P99 Improvement |
|---|---|---|
| 150 req/s | +46.9% | -16% to -19% (FCFS better) |
| 200 req/s | +26.5% | +27.5% to +28.0% |
| 250 req/s | +21.1% | +15.9% to +16.4% |
| 300 req/s | +18.6% | +12.6% to +12.8% |
Observation: At low load with uniform (Poisson) arrivals, FCFS performs better because queue depth is insufficient for optimization. CAPS benefits increase with load and are most pronounced under bursty traffic.
| Strategy | P99 Improvement (200-300 req/s) |
|---|---|
greedy_dot_product |
+12.7% to +28.0% |
best_fit |
+12.8% to +27.9% |
greedy_variance |
+12.8% to +27.9% |
power_of_d |
+8.5% to +18.4% |
CAPS/
├── configs/
│ └── l4_config.json # Hardware and scheduler configuration
├── data/
│ └── prompts.csv # Sample prompts dataset
├── src/
│ ├── __init__.py # Package init with lazy imports
│ ├── model_engine.py # 4-bit quantized Qwen MoE model loader
│ ├── predictor.py # Expert load estimation from prompts
│ ├── scheduler.py # Pluggable batch selection scheduler
│ ├── simulator.py # Simple batch simulator
│ ├── strategies/ # Batch selection algorithms (12 total)
│ │ ├── base.py # Abstract SelectionStrategy interface
│ │ ├── fcfs.py # First-Come-First-Served baseline
│ │ ├── greedy_variance.py # Original CAPS: minimize Var(L)
│ │ ├── greedy_dot_product.py # Literature optimal: minimize v·L
│ │ ├── norm_minimizing.py # Minimize ||L+v||² directly
│ │ ├── best_fit.py # VBP residual minimization
│ │ ├── power_of_d.py # Sample k, pick min variance
│ │ ├── power_of_d_dot.py # Sample k, pick min dot product
│ │ ├── lpt.py # Longest Processing Time variants
│ │ ├── karmarkar_karp.py # Differencing algorithms
│ │ └── random_sample.py # Random selection baseline
│ └── des/ # Discrete-Event Simulation framework
│ ├── events.py # Event types and priority queue
│ ├── arrivals.py # Arrival generators (Poisson, Bursty)
│ ├── execution.py # MoE-aware execution time model
│ ├── engine.py # Simulation engine and event loop
│ └── metrics.py # Latency percentiles and throughput
├── scripts/
│ ├── setup_env.sh # Environment setup script
│ ├── run_inference.py # Simple simulation entry point
│ ├── run_simulation.py # DES simulation entry point
│ └── run_paper_evaluation.py # Multi-strategy comparison suite
├── paper_results/ # Evaluation outputs (source of truth)
│ ├── results.json # Machine-readable metrics
│ ├── evaluation.log # Human-readable results
│ └── fig*.pdf # Publication figures
├── requirements.txt
└── README.md
cd CAPS
# Option A: Use setup script
bash scripts/setup_env.sh
source .venv/bin/activate
# Option B: Manual install
pip install -r requirements.txt# Quick test (500 requests, 1 seed, 1 rate)
python scripts/run_paper_evaluation.py --requests 500 --seeds 42 --rates 200
# Full evaluation (3000 requests, 4 seeds, 4 rates, all 12 strategies)
python scripts/run_paper_evaluation.py --requests 3000 --seeds 42 123 456 789 --rates 150 200 250 300
# Compare specific strategies only
python scripts/run_paper_evaluation.py --strategies greedy_variance power_of_d fcfspython scripts/run_inference.py --with-modelThese strategies achieve identical performance because they optimize equivalent objectives:
| Strategy | Algorithm | Complexity | Reference |
|---|---|---|---|
greedy_dot_product |
Minimize v·L (dot product) | O(B×W×d) | Panigrahy et al. |
norm_minimizing |
Minimize ||L+v||² | O(B×W×d) | Chakrabarty & Swamy |
best_fit |
Minimize distance to uniform | O(B×W×d) | VBP Best Fit |
greedy_variance |
Minimize Var(L+v) | O(B×W×d) | Original CAPS |
differencing |
Multi-dim Karmarkar-Karp | O(B×W×d) | - |
| Strategy | Algorithm | Complexity | Notes |
|---|---|---|---|
power_of_d |
Sample k candidates, min variance | O(B×k×d) | k=8 default |
power_of_d_dot |
Sample k candidates, min dot | O(B×k×d) | Literature recommended |
lpt_balanced |
LPT with load balancing | O(W log W + B×d) | Good for heavy requests |
| Strategy | Algorithm | Complexity | Notes |
|---|---|---|---|
fcfs |
First-come-first-served | O(B) | Baseline, no optimization |
random |
Random selection | O(B) | Lower bound on benefits |
lpt |
Longest Processing Time | O(W log W) | Harmful for load balance |
karmarkar_karp |
Number partitioning | O(W log W) | Suboptimal for vectors |
In MoE models, each token routes to a subset of experts (e.g., 4 of 60). Semantically similar requests activate the same experts:
Burst of coding questions:
Request 1 (Python) -> Experts [12, 15, 23, 45]
Request 2 (JavaScript) -> Experts [12, 15, 23, 47]
Request 3 (Rust) -> Experts [12, 15, 23, 44]
Result: Experts 12, 15, 23 are overloaded
Batch latency = max(expert_load) -> High due to stragglers
CAPS reorders the queue to batch semantically diverse requests together:
1. Router dry-run: Extract expert activation vectors via gating layers only
2. Queue management: Tag each request with its load vector
3. Batch formation:
a. Seed with oldest request (anti-starvation)
b. Greedily add requests that minimize aggregate load variance
c. Result: Load balanced across all experts
Mixed batch (coding + history + math) -> Even expert load -> Lower peak -> Lower latency
The discrete-event simulation uses an MoE-aware execution model:
execution_time = base_time × imbalance_factor
where:
base_time = prefill_time + decode_time
imbalance_factor = 1 + CV(load_vector) × sensitivity
CV = coefficient of variation (std/mean)
This captures the straggler effect: batch latency is determined by the most loaded expert.
Edit configs/l4_config.json:
{
"model_name": "Qwen/Qwen1.5-MoE-A2.7B-Chat",
"quantization": "nf4",
"num_experts": 60,
"top_k_experts": 4,
"scheduler": {
"strategy": "greedy_variance",
"max_batch_size": 8,
"window_size": 32,
"min_batch_trigger": 16,
"scheduling_interval_ms": 100
}
}| Parameter | Description | Default |
|---|---|---|
strategy |
Batch selection algorithm (see Available Strategies) | greedy_variance |
max_batch_size |
Maximum requests per batch | 8 |
window_size |
Lookahead window for candidate selection | 32 |
min_batch_trigger |
Minimum queue depth before scheduling | 16 |
num_experts |
Number of experts in the MoE model | 60 |
top_k_experts |
Experts activated per token | 4 |
paper_results/results.json is the authoritative source for all metrics:
{
"config": {
"requests": 3000,
"seeds": [42, 123, 456, 789],
"rates": [150.0, 200.0, 250.0, 300.0],
"generated": "2025-12-05T12:42:17"
},
"strategies": ["best_fit", "differencing", "fcfs", ...],
"bursty": {
"150": {
"greedy_variance": {
"p50_mean": 1575.48, "p50_std": 312.78,
"p90_mean": 2598.68, "p90_std": 505.77,
"p99_mean": 3031.60, "p99_std": 528.69,
"throughput_mean": 131.61, "throughput_std": 0.76,
"imbalance_mean": 1.593, "imbalance_std": 0.003
},
"_improvements": {
"greedy_variance": {
"p99_reduction_pct": 46.9,
"throughput_gain_pct": 12.7,
"imbalance_reduction_pct": 11.3
}
}
}
},
"poisson": { ... }
}def select_batch(candidates: list[Request], batch_size: int) -> list[Request]:
# Anti-starvation: always include oldest request
selected = [candidates[0]]
current_load = candidates[0].load_vector.clone()
remaining = set(range(1, len(candidates)))
while len(selected) < batch_size and remaining:
# Find request that minimizes variance when added
best_idx = min(
remaining,
key=lambda i: (current_load + candidates[i].load_vector).var()
)
selected.append(candidates[best_idx])
current_load += candidates[best_idx].load_vector
remaining.remove(best_idx)
return selectedFor non-negative load vectors with fixed total load:
Var(L) = E[L²] - E[L]²
∝ ||L||² (L2 norm squared)
Minimizing variance is equivalent to:
- Minimizing L2 norm of the aggregate load
- Minimizing the maximum expert load (approximately)
- Finding orthogonal/complementary vectors
| Operation | Time | Space |
|---|---|---|
| Greedy selection | O(B × W × d) | O(W × d) |
| Power of d | O(B × k × d) | O(k × d) |
| FCFS | O(B) | O(1) |
Where: B = batch size, W = window size, d = number of experts, k = sample size
Practical overhead: <1ms per scheduling decision for typical parameters.
-
Single-GPU focus: Implementation targets single-GPU inference. Multi-node expert placement is not addressed.
-
Synthetic workloads: The bursty arrival generator simulates semantic clustering but production traces would strengthen validation.
-
Simplified execution model: The MoE execution model uses coefficient of variation as a proxy for straggler effects. Real systems have additional factors (network latency, memory bandwidth, load balancing overhead).
-
Low-load overhead: At very low arrival rates with uniform traffic, the optimization overhead may exceed benefits. FCFS is preferable when queue depth is consistently <
min_batch_trigger.
Core (simulation only):
- Python 3.9+
- PyTorch 2.0+
- NumPy, Matplotlib, tqdm
Model inference (optional):
- NVIDIA GPU with 24GB VRAM (L4, A10, RTX 4090)
- transformers, accelerate, bitsandbytes
- Panigrahy et al., "Heuristics for Vector Bin Packing" (2011)
- Mitzenmacher, "The Power of Two Choices in Randomized Load Balancing" (2001)
- Qwen1.5-MoE Architecture: https://qwenlm.github.io/blog/qwen-moe/
- Switch Transformer: Fedus et al. (2021)
MIT License