Skip to content

venkat1924/CAPS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CAPS: Content-Aware Prioritized Scheduling

Mitigating Semantic Hotspots in Mixture-of-Experts LLM Inference via Router-Assisted Batch Selection

Overview

CAPS addresses the "semantic hotspot" problem in MoE (Mixture-of-Experts) model inference. When users submit semantically similar requests in bursts (e.g., a stream of coding questions), traditional FCFS (First-Come-First-Served) scheduling causes certain experts to become overloaded while others sit idle, increasing tail latency due to straggler effects.

CAPS solves this by:

  1. Predicting expert usage before execution using the MoE model's own gating mechanism (router dry-run)
  2. Selecting complementary requests from the queue that balance load across experts
  3. Minimizing load variance to reduce the straggler effect in batch execution

Key Results

Results from discrete-event simulation comparing 12 batch selection strategies across 4 arrival rates, 4 random seeds, and 2 arrival patterns (3000 requests per experiment).

Bursty Arrivals (Semantic Hotspots)

The primary use case for CAPS. Bursty arrivals simulate real-world semantic clustering (e.g., many coding questions arriving together).

Strategy P99 Improvement Throughput Gain Imbalance Reduction
greedy_dot_product +18.6% to +46.9% +12.6% to +12.8% +11.3% to +11.4%
norm_minimizing +18.7% to +46.9% +12.6% to +12.8% +11.3% to +11.4%
best_fit +18.6% to +46.9% +12.7% to +12.8% +11.3% to +11.4%
greedy_variance +18.6% to +46.9% +12.7% to +12.8% +11.3% to +11.4%
power_of_d +18.3% to +47.0% +12.5% to +12.6% +11.1% to +11.2%
lpt_balanced +18.4% to +45.6% +12.5% to +12.7% +11.1% to +11.3%
random +15.4% to +39.6% +10.3% to +10.6% +9.4% to +9.6%
karmarkar_karp +10.6% to +20.6% +7.2% to +7.3% +6.7% to +6.8%
lpt +7.4% to +8.7% +5.1% to +5.2% +4.9% to +5.0%

Observation: The top 6 strategies (greedy_dot_product, norm_minimizing, best_fit, greedy_variance, power_of_d, lpt_balanced) perform nearly identically. This confirms the literature finding that variance minimization, L2 norm minimization, and dot product minimization are mathematically equivalent for this problem.

Load Sensitivity (Best Strategies)

Arrival Rate Bursty P99 Improvement Poisson P99 Improvement
150 req/s +46.9% -16% to -19% (FCFS better)
200 req/s +26.5% +27.5% to +28.0%
250 req/s +21.1% +15.9% to +16.4%
300 req/s +18.6% +12.6% to +12.8%

Observation: At low load with uniform (Poisson) arrivals, FCFS performs better because queue depth is insufficient for optimization. CAPS benefits increase with load and are most pronounced under bursty traffic.

Poisson Arrivals (Uniform Traffic)

Strategy P99 Improvement (200-300 req/s)
greedy_dot_product +12.7% to +28.0%
best_fit +12.8% to +27.9%
greedy_variance +12.8% to +27.9%
power_of_d +8.5% to +18.4%

Project Structure

CAPS/
├── configs/
│   └── l4_config.json          # Hardware and scheduler configuration
├── data/
│   └── prompts.csv             # Sample prompts dataset
├── src/
│   ├── __init__.py             # Package init with lazy imports
│   ├── model_engine.py         # 4-bit quantized Qwen MoE model loader
│   ├── predictor.py            # Expert load estimation from prompts
│   ├── scheduler.py            # Pluggable batch selection scheduler
│   ├── simulator.py            # Simple batch simulator
│   ├── strategies/             # Batch selection algorithms (12 total)
│   │   ├── base.py             # Abstract SelectionStrategy interface
│   │   ├── fcfs.py             # First-Come-First-Served baseline
│   │   ├── greedy_variance.py  # Original CAPS: minimize Var(L)
│   │   ├── greedy_dot_product.py # Literature optimal: minimize v·L
│   │   ├── norm_minimizing.py  # Minimize ||L+v||² directly
│   │   ├── best_fit.py         # VBP residual minimization
│   │   ├── power_of_d.py       # Sample k, pick min variance
│   │   ├── power_of_d_dot.py   # Sample k, pick min dot product
│   │   ├── lpt.py              # Longest Processing Time variants
│   │   ├── karmarkar_karp.py   # Differencing algorithms
│   │   └── random_sample.py    # Random selection baseline
│   └── des/                    # Discrete-Event Simulation framework
│       ├── events.py           # Event types and priority queue
│       ├── arrivals.py         # Arrival generators (Poisson, Bursty)
│       ├── execution.py        # MoE-aware execution time model
│       ├── engine.py           # Simulation engine and event loop
│       └── metrics.py          # Latency percentiles and throughput
├── scripts/
│   ├── setup_env.sh            # Environment setup script
│   ├── run_inference.py        # Simple simulation entry point
│   ├── run_simulation.py       # DES simulation entry point
│   └── run_paper_evaluation.py # Multi-strategy comparison suite
├── paper_results/              # Evaluation outputs (source of truth)
│   ├── results.json            # Machine-readable metrics
│   ├── evaluation.log          # Human-readable results
│   └── fig*.pdf                # Publication figures
├── requirements.txt
└── README.md

Quick Start

1. Setup Environment

cd CAPS

# Option A: Use setup script
bash scripts/setup_env.sh
source .venv/bin/activate

# Option B: Manual install
pip install -r requirements.txt

2. Run Evaluation

# Quick test (500 requests, 1 seed, 1 rate)
python scripts/run_paper_evaluation.py --requests 500 --seeds 42 --rates 200

# Full evaluation (3000 requests, 4 seeds, 4 rates, all 12 strategies)
python scripts/run_paper_evaluation.py --requests 3000 --seeds 42 123 456 789 --rates 150 200 250 300

# Compare specific strategies only
python scripts/run_paper_evaluation.py --strategies greedy_variance power_of_d fcfs

3. Run with Actual Model (Requires 24GB GPU)

python scripts/run_inference.py --with-model

Available Strategies

Tier 1: Optimal (Mathematically Equivalent)

These strategies achieve identical performance because they optimize equivalent objectives:

Strategy Algorithm Complexity Reference
greedy_dot_product Minimize v·L (dot product) O(B×W×d) Panigrahy et al.
norm_minimizing Minimize ||L+v||² O(B×W×d) Chakrabarty & Swamy
best_fit Minimize distance to uniform O(B×W×d) VBP Best Fit
greedy_variance Minimize Var(L+v) O(B×W×d) Original CAPS
differencing Multi-dim Karmarkar-Karp O(B×W×d) -

Tier 2: Near-Optimal (Faster)

Strategy Algorithm Complexity Notes
power_of_d Sample k candidates, min variance O(B×k×d) k=8 default
power_of_d_dot Sample k candidates, min dot O(B×k×d) Literature recommended
lpt_balanced LPT with load balancing O(W log W + B×d) Good for heavy requests

Tier 3: Baselines

Strategy Algorithm Complexity Notes
fcfs First-come-first-served O(B) Baseline, no optimization
random Random selection O(B) Lower bound on benefits
lpt Longest Processing Time O(W log W) Harmful for load balance
karmarkar_karp Number partitioning O(W log W) Suboptimal for vectors

Architecture

The Semantic Hotspot Problem

In MoE models, each token routes to a subset of experts (e.g., 4 of 60). Semantically similar requests activate the same experts:

Burst of coding questions:
  Request 1 (Python)     -> Experts [12, 15, 23, 45]
  Request 2 (JavaScript) -> Experts [12, 15, 23, 47]
  Request 3 (Rust)       -> Experts [12, 15, 23, 44]

Result: Experts 12, 15, 23 are overloaded
Batch latency = max(expert_load) -> High due to stragglers

CAPS Solution

CAPS reorders the queue to batch semantically diverse requests together:

1. Router dry-run: Extract expert activation vectors via gating layers only
2. Queue management: Tag each request with its load vector
3. Batch formation:
   a. Seed with oldest request (anti-starvation)
   b. Greedily add requests that minimize aggregate load variance
   c. Result: Load balanced across all experts

Mixed batch (coding + history + math) -> Even expert load -> Lower peak -> Lower latency

Execution Model

The discrete-event simulation uses an MoE-aware execution model:

execution_time = base_time × imbalance_factor

where:
  base_time = prefill_time + decode_time
  imbalance_factor = 1 + CV(load_vector) × sensitivity
  CV = coefficient of variation (std/mean)

This captures the straggler effect: batch latency is determined by the most loaded expert.

Configuration

Edit configs/l4_config.json:

{
    "model_name": "Qwen/Qwen1.5-MoE-A2.7B-Chat",
    "quantization": "nf4",
    "num_experts": 60,
    "top_k_experts": 4,
    "scheduler": {
        "strategy": "greedy_variance",
        "max_batch_size": 8,
        "window_size": 32,
        "min_batch_trigger": 16,
        "scheduling_interval_ms": 100
    }
}
Parameter Description Default
strategy Batch selection algorithm (see Available Strategies) greedy_variance
max_batch_size Maximum requests per batch 8
window_size Lookahead window for candidate selection 32
min_batch_trigger Minimum queue depth before scheduling 16
num_experts Number of experts in the MoE model 60
top_k_experts Experts activated per token 4

Results Schema

paper_results/results.json is the authoritative source for all metrics:

{
  "config": {
    "requests": 3000,
    "seeds": [42, 123, 456, 789],
    "rates": [150.0, 200.0, 250.0, 300.0],
    "generated": "2025-12-05T12:42:17"
  },
  "strategies": ["best_fit", "differencing", "fcfs", ...],
  "bursty": {
    "150": {
      "greedy_variance": {
        "p50_mean": 1575.48, "p50_std": 312.78,
        "p90_mean": 2598.68, "p90_std": 505.77,
        "p99_mean": 3031.60, "p99_std": 528.69,
        "throughput_mean": 131.61, "throughput_std": 0.76,
        "imbalance_mean": 1.593, "imbalance_std": 0.003
      },
      "_improvements": {
        "greedy_variance": {
          "p99_reduction_pct": 46.9,
          "throughput_gain_pct": 12.7,
          "imbalance_reduction_pct": 11.3
        }
      }
    }
  },
  "poisson": { ... }
}

Algorithm Details

Core Algorithm (Greedy Variance Minimization)

def select_batch(candidates: list[Request], batch_size: int) -> list[Request]:
    # Anti-starvation: always include oldest request
    selected = [candidates[0]]
    current_load = candidates[0].load_vector.clone()
    
    remaining = set(range(1, len(candidates)))
    
    while len(selected) < batch_size and remaining:
        # Find request that minimizes variance when added
        best_idx = min(
            remaining,
            key=lambda i: (current_load + candidates[i].load_vector).var()
        )
        selected.append(candidates[best_idx])
        current_load += candidates[best_idx].load_vector
        remaining.remove(best_idx)
    
    return selected

Why Variance Minimization Works

For non-negative load vectors with fixed total load:

Var(L) = E[L²] - E[L]²
       ∝ ||L||² (L2 norm squared)

Minimizing variance is equivalent to:

  • Minimizing L2 norm of the aggregate load
  • Minimizing the maximum expert load (approximately)
  • Finding orthogonal/complementary vectors

Complexity Analysis

Operation Time Space
Greedy selection O(B × W × d) O(W × d)
Power of d O(B × k × d) O(k × d)
FCFS O(B) O(1)

Where: B = batch size, W = window size, d = number of experts, k = sample size

Practical overhead: <1ms per scheduling decision for typical parameters.

Limitations

  1. Single-GPU focus: Implementation targets single-GPU inference. Multi-node expert placement is not addressed.

  2. Synthetic workloads: The bursty arrival generator simulates semantic clustering but production traces would strengthen validation.

  3. Simplified execution model: The MoE execution model uses coefficient of variation as a proxy for straggler effects. Real systems have additional factors (network latency, memory bandwidth, load balancing overhead).

  4. Low-load overhead: At very low arrival rates with uniform traffic, the optimization overhead may exceed benefits. FCFS is preferable when queue depth is consistently < min_batch_trigger.

Requirements

Core (simulation only):

  • Python 3.9+
  • PyTorch 2.0+
  • NumPy, Matplotlib, tqdm

Model inference (optional):

  • NVIDIA GPU with 24GB VRAM (L4, A10, RTX 4090)
  • transformers, accelerate, bitsandbytes

References

  • Panigrahy et al., "Heuristics for Vector Bin Packing" (2011)
  • Mitzenmacher, "The Power of Two Choices in Randomized Load Balancing" (2001)
  • Qwen1.5-MoE Architecture: https://qwenlm.github.io/blog/qwen-moe/
  • Switch Transformer: Fedus et al. (2021)

License

MIT License

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors