LLM GPU Benchmark Suite

Automated LLM inference benchmarking on consumer GPUs via vast.ai

Spin up GPU instances, run comprehensive benchmarks, collect results, and tear down - all with a single command.

Features

One-command benchmarking: python benchmark_runner.py --suite config.yaml
Consumer GPU focus: RTX 5060 Ti, 5070 Ti, 5090 (and extensible to others)
Multiple workloads: RAG (long context), API (high concurrency), Agentic (LoRA multi-tenant)
Comprehensive metrics: Throughput, latency (TTFT, ITL, P95/P99), power, energy efficiency
Cost-efficient: Uses vast.ai spot instances, auto-terminates after completion
Paper-ready output: Auto-generates LaTeX tables and CSV exports

Quick Start

Prerequisites

Python 3.10+
vast.ai account with API key
AWS S3 bucket for results (optional but recommended)
HuggingFace account with token (for gated models)

Installation

git clone https://github.com/yourusername/llm-gpu-benchmark.git
cd llm-gpu-benchmark

# Install dependencies
pip install -r requirements.txt

# Configure credentials
cp .env.example .env
# Edit .env with your API keys

Your First Benchmark

# Run using one of our benchmark configs (reproduces paper results)
python benchmark_runner.py --suite research_results/results_config/rtx5090_1x.yaml

# Or create your own config from the template
cp configs/template.yaml configs/my_benchmark.yaml
# Edit configs/my_benchmark.yaml with your settings
python benchmark_runner.py --suite configs/my_benchmark.yaml

# The script will:
# 1. Find and rent a GPU on vast.ai
# 2. Set up the environment (vLLM, models)
# 3. Run all benchmarks in the config
# 4. Upload results to S3
# 5. Terminate the instance

Configuration

Benchmarks are defined in YAML files. Here's the structure:

name: RTX 5090 1x GPU
description: Benchmark suite for RTX 5090 32GB single GPU

instance:
  gpu_type: RTX 5090          # GPU to rent on vast.ai
  gpu_count: 1                # Number of GPUs
  disk_space_gb: 100          # Disk space for models
  image: holtmann/llm-benchmark:latest
  max_bid_price: 2.0          # Max $/hr for spot instance

s3:
  bucket: ${S3_BUCKET_NAME}   # From .env
  prefix: benchmarks/rtx5090_1x
  upload_json: true
  upload_csv: true

benchmarks:
  - name: qwen3-8b-nvfp4-rag-8k-c8
    model: nvidia/Qwen3-8B-NVFP4
    vllm:
      max_model_len: 9216
      gpu_memory_utilization: 0.9
      dtype: auto
    aiperf:
      endpoint_type: chat
      streaming: true
      concurrency: 8
      synthetic_input_tokens_mean: 8192
      output_tokens_mean: 512
      request_count: 500

Workload Types

RAG (Retrieval-Augmented Generation)

Long context, moderate concurrency - typical enterprise RAG pipelines.

- name: qwen3-8b-nvfp4-rag-16k-c4
  aiperf:
    concurrency: 4
    synthetic_input_tokens_mean: 16384  # 16k context
    output_tokens_mean: 512

API (High-Concurrency)

Short context, high concurrency - chatbot/API serving.

- name: qwen3-8b-nvfp4-api-c128
  aiperf:
    concurrency: 128
    synthetic_input_tokens_mean: 256   # Short prompts
    output_tokens_mean: 256

Agentic (Multi-LoRA)

LoRA adapter switching for multi-tenant deployments.

- name: qwen3-8b-nvfp4-agentic-lora-c32
  vllm:
    enable_lora: true
    max_loras: 3
    lora_modules:
      - name: customer-support
        path: /models/loras/customer-support
      - name: code-assistant
        path: /models/loras/code-assistant
  aiperf:
    model_selection_strategy: random  # Randomly select adapter per request

Metrics Collected

Metric	Description	Unit
`output_token_throughput`	Total tokens generated per second	tok/s
`output_token_throughput_per_user`	Per-request throughput	tok/s
`time_to_first_token`	Latency to first token (avg, P50, P95, P99)	ms
`inter_token_latency`	Streaming speed (avg, P50, P95, P99)	ms
`avg_power_w`	Average GPU power during benchmark	W
`wh_per_mtok`	Energy efficiency	Wh/million tokens
`max_temp_c`	Peak GPU temperature	°C
`throttle_pct`	Thermal throttling percentage	%

Running Multiple Configs in Parallel

Each config runs independently - perfect for parallel execution:

# Terminal 1
python benchmark_runner.py --suite research_results/results_config/rtx5060ti_1x.yaml

# Terminal 2
python benchmark_runner.py --suite research_results/results_config/rtx5070ti_1x.yaml

# Terminal 3
python benchmark_runner.py --suite research_results/results_config/rtx5090_1x.yaml

# Results go to separate S3 prefixes, no conflicts

Cost Estimates

GPU Config	Benchmarks	Est. Time	Est. Cost
RTX 5060 Ti 1x	33	~4-5 hrs	~$2-4
RTX 5060 Ti 2x	22	~3 hrs	~$3-5
RTX 5070 Ti 1x	34	~4-5 hrs	~$3-5
RTX 5070 Ti 2x	26	~3.5 hrs	~$5-8
RTX 5090 1x	38	~5-6 hrs	~$5-10
RTX 5090 2x	30	~4 hrs	~$8-15

Costs vary based on vast.ai spot pricing

Output Structure

Results are organized per benchmark run:

results/
└── 20241217_143052_RTX_5090_1x/
    ├── qwen3-8b-nvfp4-rag-8k-c8/
    │   ├── metadata.json           # Config and status
    │   ├── profile_export_aiperf.json  # Detailed metrics
    │   ├── gpu_metrics.log         # Power/temp timeline
    │   └── vllm.log               # Server logs
    ├── qwen3-8b-nvfp4-rag-16k-c4/
    │   └── ...
    └── summary.json               # Suite-level summary

Generating Paper Tables

Convert results to LaTeX tables:

python scripts/generate_paper_tables.py --results-dir results/ --output paper_tables.tex

Output:

\begin{table}[H]
  \caption{RAG workload throughput on RTX 5090}
  \begin{tabular}{l l c c c c}
    Model & Precision & Context & TPS & TTFT & ITL P95 \\
    \midrule
    Qwen3-8B & NVFP4 & 8k & 422.4 & 565 & 18.4 \\
    Qwen3-8B & NVFP4 & 16k & 225.3 & 1474 & 38.4 \\
    ...
  \end{tabular}
\end{table}

Environment Variables

Create a .env file:

# Required
VAST_API_KEY=your_vast_ai_api_key

# For gated models (Gemma, etc.)
HF_TOKEN=your_huggingface_token

# For S3 upload
AWS_ACCESS_KEY_ID=your_aws_key
AWS_SECRET_ACCESS_KEY=your_aws_secret
S3_BUCKET_NAME=your-bucket-name

# Optional: Custom LoRA adapters
LORA_CUSTOMER_SUPPORT=holtmann/qwen3-8b-customer-support-lora
LORA_TECHNICAL_DOCS=holtmann/qwen3-8b-technical-docs-lora

Adding New GPUs

Create a new config file:

cp configs/template.yaml configs/rtx4090_1x.yaml

Update the instance section:

instance:
  gpu_type: RTX 4090
  gpu_count: 1

Adjust benchmarks based on VRAM (24GB for 4090):
- Remove models that won't fit
- Adjust max context lengths
- Set appropriate concurrency levels
Run:

python benchmark_runner.py --suite configs/rtx4090_1x.yaml

Model Compatibility

This suite works with any model supported by vLLM, including:

Standard HuggingFace models (Llama, Mistral, Qwen, Gemma, etc.)
Quantized models (GPTQ, AWQ, NVFP4, W4A16, etc.)
MoE models (Mixtral, GPT-OSS, etc.)
Any model with LoRA adapters

Simply specify the HuggingFace model ID in your config:

model: Qwen/Qwen3-8B                       # Popular 2025 model
model: mistralai/Mistral-Small-24B-Instruct-2501  # Mistral's latest
model: deepseek-ai/DeepSeek-R1-Distill-Qwen-7B    # DeepSeek R1 distilled
model: google/gemma-3-12b-it               # Google's Gemma 3

See vLLM Supported Models for the full list.

Docker Image

The benchmark environment is pre-built:

docker pull holtmann/llm-benchmark:latest

Includes:

vLLM with NVFP4/MXFP4 support
aiperf benchmarking tool
CUDA 12.x + cuDNN
HuggingFace Hub CLI
AWS CLI for S3 uploads

Build your own:

docker build -t my-benchmark:latest -f Dockerfile .

Troubleshooting

OOM Errors

Reduce max_model_len in config
Lower concurrency
Use more aggressive quantization (NVFP4 vs BF16)

vast.ai Instance Not Found

Increase max_bid_price
Try different GPU type
Check vast.ai availability

Model Download Fails

Verify HF_TOKEN is set
Check model exists on HuggingFace
Ensure sufficient disk space

LoRA Adapter Not Found

Check adapter name matches config
Verify HuggingFace repo exists
Check LORA_* env variables

Citation

If you use this benchmark suite in your research, please cite:

@misc{knoop2026privatellminferenceconsumer,
      title={Private LLM Inference on Consumer Blackwell GPUs: A Practical Guide for Cost-Effective Local Deployment in SMEs}, 
      author={Jonathan Knoop and Hendrik Holtmann},
      year={2026},
      eprint={2601.09527},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2601.09527}, 
}

License

MIT License - see LICENSE for details.

Contributing

Contributions welcome! Please:

Fork the repository
Create a feature branch
Add your GPU configs or improvements
Submit a pull request

Ideas for contribution:

New GPU configurations (RTX 4090, A6000, etc.)
Additional workload types
Improved metrics collection
Documentation improvements

Acknowledgments

vLLM - High-throughput LLM serving
aiperf - LLM benchmarking tool
vast.ai - GPU cloud marketplace
NVIDIA - NVFP4 quantization support

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
configs		configs
research_results		research_results
scripts		scripts
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
GPU_METRICS.md		GPU_METRICS.md
README.md		README.md
analyze_gpu_metrics.py		analyze_gpu_metrics.py
benchmark_runner.py		benchmark_runner.py
power_analysis.py		power_analysis.py
requirements.txt		requirements.txt
result_uploader.py		result_uploader.py
ssh_helper.py		ssh_helper.py
utils.py		utils.py
vast_manager.py		vast_manager.py

Folders and files

Latest commit

History

Repository files navigation

LLM GPU Benchmark Suite

Features

Quick Start

Prerequisites

Installation

Your First Benchmark

Configuration

Workload Types

RAG (Retrieval-Augmented Generation)

API (High-Concurrency)

Agentic (Multi-LoRA)

Metrics Collected

Running Multiple Configs in Parallel

Cost Estimates

Output Structure

Generating Paper Tables

Environment Variables

Adding New GPUs

Model Compatibility

Docker Image

Troubleshooting

OOM Errors

vast.ai Instance Not Found

Model Download Fails

LoRA Adapter Not Found

Citation

License

Contributing

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages