CAPS: Content-Aware Prioritized Scheduling

Mitigating Semantic Hotspots in Mixture-of-Experts LLM Inference via Router-Assisted Batch Selection

Overview

CAPS addresses the "semantic hotspot" problem in MoE (Mixture-of-Experts) model inference. When users submit semantically similar requests in bursts (e.g., a stream of coding questions), traditional FCFS (First-Come-First-Served) scheduling causes certain experts to become overloaded while others sit idle, increasing tail latency due to straggler effects.

CAPS solves this by:

Predicting expert usage before execution using the MoE model's own gating mechanism (router dry-run)
Selecting complementary requests from the queue that balance load across experts
Minimizing load variance to reduce the straggler effect in batch execution

Key Results

Results from discrete-event simulation comparing 12 batch selection strategies across 4 arrival rates, 4 random seeds, and 2 arrival patterns (3000 requests per experiment).

Bursty Arrivals (Semantic Hotspots)

The primary use case for CAPS. Bursty arrivals simulate real-world semantic clustering (e.g., many coding questions arriving together).

Strategy	P99 Improvement	Throughput Gain	Imbalance Reduction
`greedy_dot_product`	+18.6% to +46.9%	+12.6% to +12.8%	+11.3% to +11.4%
`norm_minimizing`	+18.7% to +46.9%	+12.6% to +12.8%	+11.3% to +11.4%
`best_fit`	+18.6% to +46.9%	+12.7% to +12.8%	+11.3% to +11.4%
`greedy_variance`	+18.6% to +46.9%	+12.7% to +12.8%	+11.3% to +11.4%
`power_of_d`	+18.3% to +47.0%	+12.5% to +12.6%	+11.1% to +11.2%
`lpt_balanced`	+18.4% to +45.6%	+12.5% to +12.7%	+11.1% to +11.3%
`random`	+15.4% to +39.6%	+10.3% to +10.6%	+9.4% to +9.6%
`karmarkar_karp`	+10.6% to +20.6%	+7.2% to +7.3%	+6.7% to +6.8%
`lpt`	+7.4% to +8.7%	+5.1% to +5.2%	+4.9% to +5.0%

Observation: The top 6 strategies (greedy_dot_product, norm_minimizing, best_fit, greedy_variance, power_of_d, lpt_balanced) perform nearly identically. This confirms the literature finding that variance minimization, L2 norm minimization, and dot product minimization are mathematically equivalent for this problem.

Load Sensitivity (Best Strategies)

Arrival Rate	Bursty P99 Improvement	Poisson P99 Improvement
150 req/s	+46.9%	-16% to -19% (FCFS better)
200 req/s	+26.5%	+27.5% to +28.0%
250 req/s	+21.1%	+15.9% to +16.4%
300 req/s	+18.6%	+12.6% to +12.8%

Observation: At low load with uniform (Poisson) arrivals, FCFS performs better because queue depth is insufficient for optimization. CAPS benefits increase with load and are most pronounced under bursty traffic.

Poisson Arrivals (Uniform Traffic)

Strategy	P99 Improvement (200-300 req/s)
`greedy_dot_product`	+12.7% to +28.0%
`best_fit`	+12.8% to +27.9%
`greedy_variance`	+12.8% to +27.9%
`power_of_d`	+8.5% to +18.4%

Project Structure

CAPS/
├── configs/
│   └── l4_config.json          # Hardware and scheduler configuration
├── data/
│   └── prompts.csv             # Sample prompts dataset
├── src/
│   ├── __init__.py             # Package init with lazy imports
│   ├── model_engine.py         # 4-bit quantized Qwen MoE model loader
│   ├── predictor.py            # Expert load estimation from prompts
│   ├── scheduler.py            # Pluggable batch selection scheduler
│   ├── simulator.py            # Simple batch simulator
│   ├── strategies/             # Batch selection algorithms (12 total)
│   │   ├── base.py             # Abstract SelectionStrategy interface
│   │   ├── fcfs.py             # First-Come-First-Served baseline
│   │   ├── greedy_variance.py  # Original CAPS: minimize Var(L)
│   │   ├── greedy_dot_product.py # Literature optimal: minimize v·L
│   │   ├── norm_minimizing.py  # Minimize ||L+v||² directly
│   │   ├── best_fit.py         # VBP residual minimization
│   │   ├── power_of_d.py       # Sample k, pick min variance
│   │   ├── power_of_d_dot.py   # Sample k, pick min dot product
│   │   ├── lpt.py              # Longest Processing Time variants
│   │   ├── karmarkar_karp.py   # Differencing algorithms
│   │   └── random_sample.py    # Random selection baseline
│   └── des/                    # Discrete-Event Simulation framework
│       ├── events.py           # Event types and priority queue
│       ├── arrivals.py         # Arrival generators (Poisson, Bursty)
│       ├── execution.py        # MoE-aware execution time model
│       ├── engine.py           # Simulation engine and event loop
│       └── metrics.py          # Latency percentiles and throughput
├── scripts/
│   ├── setup_env.sh            # Environment setup script
│   ├── run_inference.py        # Simple simulation entry point
│   ├── run_simulation.py       # DES simulation entry point
│   └── run_paper_evaluation.py # Multi-strategy comparison suite
├── paper_results/              # Evaluation outputs (source of truth)
│   ├── results.json            # Machine-readable metrics
│   ├── evaluation.log          # Human-readable results
│   └── fig*.pdf                # Publication figures
├── requirements.txt
└── README.md

Quick Start

1. Setup Environment

cd CAPS

# Option A: Use setup script
bash scripts/setup_env.sh
source .venv/bin/activate

# Option B: Manual install
pip install -r requirements.txt

2. Run Evaluation

# Quick test (500 requests, 1 seed, 1 rate)
python scripts/run_paper_evaluation.py --requests 500 --seeds 42 --rates 200

# Full evaluation (3000 requests, 4 seeds, 4 rates, all 12 strategies)
python scripts/run_paper_evaluation.py --requests 3000 --seeds 42 123 456 789 --rates 150 200 250 300

# Compare specific strategies only
python scripts/run_paper_evaluation.py --strategies greedy_variance power_of_d fcfs

3. Run with Actual Model (Requires 24GB GPU)

python scripts/run_inference.py --with-model

Available Strategies

Tier 1: Optimal (Mathematically Equivalent)

These strategies achieve identical performance because they optimize equivalent objectives:

Strategy	Algorithm	Complexity	Reference
`greedy_dot_product`	Minimize v·L (dot product)	O(B×W×d)	Panigrahy et al.
`norm_minimizing`	Minimize \|\|L+v\|\|²	O(B×W×d)	Chakrabarty & Swamy
`best_fit`	Minimize distance to uniform	O(B×W×d)	VBP Best Fit
`greedy_variance`	Minimize Var(L+v)	O(B×W×d)	Original CAPS
`differencing`	Multi-dim Karmarkar-Karp	O(B×W×d)	-

Tier 2: Near-Optimal (Faster)

Strategy	Algorithm	Complexity	Notes
`power_of_d`	Sample k candidates, min variance	O(B×k×d)	k=8 default
`power_of_d_dot`	Sample k candidates, min dot	O(B×k×d)	Literature recommended
`lpt_balanced`	LPT with load balancing	O(W log W + B×d)	Good for heavy requests

Tier 3: Baselines

Strategy	Algorithm	Complexity	Notes
`fcfs`	First-come-first-served	O(B)	Baseline, no optimization
`random`	Random selection	O(B)	Lower bound on benefits
`lpt`	Longest Processing Time	O(W log W)	Harmful for load balance
`karmarkar_karp`	Number partitioning	O(W log W)	Suboptimal for vectors

Architecture

The Semantic Hotspot Problem

In MoE models, each token routes to a subset of experts (e.g., 4 of 60). Semantically similar requests activate the same experts:

Burst of coding questions:
  Request 1 (Python)     -> Experts [12, 15, 23, 45]
  Request 2 (JavaScript) -> Experts [12, 15, 23, 47]
  Request 3 (Rust)       -> Experts [12, 15, 23, 44]

Result: Experts 12, 15, 23 are overloaded
Batch latency = max(expert_load) -> High due to stragglers

CAPS Solution

CAPS reorders the queue to batch semantically diverse requests together:

1. Router dry-run: Extract expert activation vectors via gating layers only
2. Queue management: Tag each request with its load vector
3. Batch formation:
   a. Seed with oldest request (anti-starvation)
   b. Greedily add requests that minimize aggregate load variance
   c. Result: Load balanced across all experts

Mixed batch (coding + history + math) -> Even expert load -> Lower peak -> Lower latency

Execution Model

The discrete-event simulation uses an MoE-aware execution model:

execution_time = base_time × imbalance_factor

where:
  base_time = prefill_time + decode_time
  imbalance_factor = 1 + CV(load_vector) × sensitivity
  CV = coefficient of variation (std/mean)

This captures the straggler effect: batch latency is determined by the most loaded expert.

Configuration

Edit configs/l4_config.json:

{
    "model_name": "Qwen/Qwen1.5-MoE-A2.7B-Chat",
    "quantization": "nf4",
    "num_experts": 60,
    "top_k_experts": 4,
    "scheduler": {
        "strategy": "greedy_variance",
        "max_batch_size": 8,
        "window_size": 32,
        "min_batch_trigger": 16,
        "scheduling_interval_ms": 100
    }
}

Parameter	Description	Default
`strategy`	Batch selection algorithm (see Available Strategies)	`greedy_variance`
`max_batch_size`	Maximum requests per batch	8
`window_size`	Lookahead window for candidate selection	32
`min_batch_trigger`	Minimum queue depth before scheduling	16
`num_experts`	Number of experts in the MoE model	60
`top_k_experts`	Experts activated per token	4

Results Schema

paper_results/results.json is the authoritative source for all metrics:

{
  "config": {
    "requests": 3000,
    "seeds": [42, 123, 456, 789],
    "rates": [150.0, 200.0, 250.0, 300.0],
    "generated": "2025-12-05T12:42:17"
  },
  "strategies": ["best_fit", "differencing", "fcfs", ...],
  "bursty": {
    "150": {
      "greedy_variance": {
        "p50_mean": 1575.48, "p50_std": 312.78,
        "p90_mean": 2598.68, "p90_std": 505.77,
        "p99_mean": 3031.60, "p99_std": 528.69,
        "throughput_mean": 131.61, "throughput_std": 0.76,
        "imbalance_mean": 1.593, "imbalance_std": 0.003
      },
      "_improvements": {
        "greedy_variance": {
          "p99_reduction_pct": 46.9,
          "throughput_gain_pct": 12.7,
          "imbalance_reduction_pct": 11.3
        }
      }
    }
  },
  "poisson": { ... }
}

Algorithm Details

Core Algorithm (Greedy Variance Minimization)

def select_batch(candidates: list[Request], batch_size: int) -> list[Request]:
    # Anti-starvation: always include oldest request
    selected = [candidates[0]]
    current_load = candidates[0].load_vector.clone()
    
    remaining = set(range(1, len(candidates)))
    
    while len(selected) < batch_size and remaining:
        # Find request that minimizes variance when added
        best_idx = min(
            remaining,
            key=lambda i: (current_load + candidates[i].load_vector).var()
        )
        selected.append(candidates[best_idx])
        current_load += candidates[best_idx].load_vector
        remaining.remove(best_idx)
    
    return selected

Why Variance Minimization Works

For non-negative load vectors with fixed total load:

Var(L) = E[L²] - E[L]²
       ∝ ||L||² (L2 norm squared)

Minimizing variance is equivalent to:

Minimizing L2 norm of the aggregate load
Minimizing the maximum expert load (approximately)
Finding orthogonal/complementary vectors

Complexity Analysis

Operation	Time	Space
Greedy selection	O(B × W × d)	O(W × d)
Power of d	O(B × k × d)	O(k × d)
FCFS	O(B)	O(1)

Where: B = batch size, W = window size, d = number of experts, k = sample size

Practical overhead: <1ms per scheduling decision for typical parameters.

Limitations

Single-GPU focus: Implementation targets single-GPU inference. Multi-node expert placement is not addressed.
Synthetic workloads: The bursty arrival generator simulates semantic clustering but production traces would strengthen validation.
Simplified execution model: The MoE execution model uses coefficient of variation as a proxy for straggler effects. Real systems have additional factors (network latency, memory bandwidth, load balancing overhead).
Low-load overhead: At very low arrival rates with uniform traffic, the optimization overhead may exceed benefits. FCFS is preferable when queue depth is consistently < min_batch_trigger.

Requirements

Core (simulation only):

Python 3.9+
PyTorch 2.0+
NumPy, Matplotlib, tqdm

Model inference (optional):

NVIDIA GPU with 24GB VRAM (L4, A10, RTX 4090)
transformers, accelerate, bitsandbytes

References

Panigrahy et al., "Heuristics for Vector Bin Packing" (2011)
Mitzenmacher, "The Power of Two Choices in Randomized Load Balancing" (2001)
Qwen1.5-MoE Architecture: https://qwenlm.github.io/blog/qwen-moe/
Switch Transformer: Fedus et al. (2021)

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github/workflows		.github/workflows
configs		configs
data		data
scripts		scripts
src		src
.gitignore		.gitignore
.pylintrc		.pylintrc
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

CAPS: Content-Aware Prioritized Scheduling

Overview

Key Results

Bursty Arrivals (Semantic Hotspots)

Load Sensitivity (Best Strategies)

Poisson Arrivals (Uniform Traffic)

Project Structure

Quick Start

1. Setup Environment

2. Run Evaluation

3. Run with Actual Model (Requires 24GB GPU)

Available Strategies

Tier 1: Optimal (Mathematically Equivalent)

Tier 2: Near-Optimal (Faster)

Tier 3: Baselines

Architecture

The Semantic Hotspot Problem

CAPS Solution

Execution Model

Configuration

Results Schema

Algorithm Details

Core Algorithm (Greedy Variance Minimization)

Why Variance Minimization Works

Complexity Analysis

Limitations

Requirements

References

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages