English | 简体中文
A comprehensive learning resource for AI infrastructure engineers, demonstrating full-stack optimization skills from compiler tuning to low-level kernel development.
TVM Compiler → ONNX Runtime → Triton Kernels → CUTLASS GEMM → cuTile Future
(Level 1) (Level 2) (Level 2.5) (Level 3) (Level 4)
2.9x Speedup Custom Ops GPU Fusion HPC Kernels Next-Gen GPU
- 🛠️ Requirements
- 🚀 Quick Start
- ✨ Features
- 🎯 Project Overview
- 📊 Performance
- 📁 Project Structure
- 🏗️ Architecture
- 📚 Documentation
- 🗺️ Roadmap
- 🧪 Testing
- 🤝 Contributing
- 📜 License
- 🚀 End-to-End TVM Optimization — Learn compiler tuning with Relay & TensorIR, achieving 2.9x speedup on ResNet50
- 🧩 Custom ONNX Runtime Operators — Extend ORT with your own CUDA kernels for GELU and beyond
- ⚡ Triton GPU Kernels — Write high-performance kernels in Python with FlashAttention and kernel fusion
- 🔥 CUTLASS 3.x GEMM — Master H100 Hopper-optimized matrix multiplication with WGMMA
- 🔮 Future GPU Programming — Explore cuTile abstraction for next-generation CUDA development
- 📊 Unified Benchmarking — Compare across frameworks with consistent performance metrics
- 🐳 Docker-Ready — One-command setup with CUDA 12.x/13.x environments
| Component | Minimum | Recommended |
|---|---|---|
| Python | 3.10 | 3.11 |
| CUDA | 12.x | 12.2+ |
| GPU | V100 | A100/H100 |
| CMake | 3.18 | 3.25+ |
| GCC/Clang | 9/10 | 11/15 |
Note: Modules 01 and 02 work on any CUDA-capable GPU. Module 03 (CUTLASS) requires H100/A100 for full WGMMA features.
This project provides a structured progression through AI system optimization:
| Module | Technology | Difficulty | Hardware | Value |
|---|---|---|---|---|
| 01 TVM | Relay, TensorIR | ⭐⭐⭐ | CUDA GPU | 2.5-2.9x speedup |
| 02 ORT | ONNX Runtime, CUDA | ⭐⭐⭐⭐ | CUDA GPU | Custom ops extension |
| 05 Triton | Triton, Python DSL | ⭐⭐⭐ | CUDA GPU | GPU kernel fusion |
| 03 CUTLASS | CUTLASS 3.x, WGMMA | ⭐⭐⭐⭐⭐ | H100/A100 | Peak GEMM performance |
| 04 CuTile | cuTile, CUDA 13 | ⭐⭐⭐ | Any GPU | Future GPU programming |
Key Highlights — Benchmarked on NVIDIA A100-80GB / H100
| Metric | Result | Module |
|---|---|---|
| ResNet50 Speedup | 2.9x | TVM Optimization |
| GEMM Efficiency | 20% (195 TFLOPS) | CUTLASS WGMMA |
| Bandwidth Gain | +44% (230 GB/s) | Custom CUDA GELU |
📈 View detailed performance tables
| Implementation | Latency (ms) | Speedup | Technique |
|---|---|---|---|
| PyTorch Eager | 10.0 | 1.0x | Baseline |
| TVM Baseline | 8.0 | 1.25x | Direct compile |
| TVM AutoScheduler | 4.0 | 2.5x | Ansor 1000 trials |
| TensorIR Manual | 3.5 | 2.9x | Expert tuning |
| Implementation | TFLOPS | Efficiency | Notes |
|---|---|---|---|
| cuBLAS | 170 | 17% | Reference |
| CUTLASS WGMMA | 195 | 20% | Hopper optimized |
| Implementation | Latency | Bandwidth | Improvement |
|---|---|---|---|
| PyTorch GELU | 50 μs | 160 GB/s | Baseline |
| Custom CUDA | 35 μs | 230 GB/s | +44% bandwidth |
Choose your installation method:
# CUDA 12.x environment
docker build -f docker/Dockerfile.cuda12 -t ai-opt:cuda12 .
docker run --gpus all -it -v $(pwd):/workspace ai-opt:cuda12
# Run first example
python 01_TVM_End2End_Optimization/1_import_and_baseline.py# Clone repository
git clone https://github.com/LessUp/ai-system-optimization-series.git
cd ai-system-optimization-series
# Setup Python environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install dependencies
pip install -e ".[dev]" # Development only
# Or: pip install -e ".[all]" # Full dependencies
# Verify installation
python -c "import tvm; print('TVM:', tvm.__version__)"
make test💡 Want to contribute? Check out our Contributing Guide for development setup, code standards, and PR process.
ai-system-optimization-series/
├── 01_TVM_End2End_Optimization/ # Compiler optimization (TVM)
│ ├── src/tvm_optimization/ # Python package
│ │ ├── baseline.py
│ │ ├── auto_scheduler.py
│ │ └── tensorir_schedule.py
│ └── tests/
├── 02_ORT_Custom_CUDA_Op/ # Runtime extension (ONNX Runtime)
│ ├── src/ # C++/CUDA source
│ ├── python/ # Python bindings
│ └── tests/
├── 03_CUTLASS_Hopper_GEMM/ # High-performance GEMM (CUTLASS)
│ ├── include/ # Headers
│ ├── src/ # CUDA source
│ └── tests/
├── 04_CuTile_NextGen_CUDA/ # Future GPU programming (cuTile)
│ ├── src/cutile_cuda/ # Python package
│ └── tests/
├── common/ # Shared infrastructure
│ ├── benchmark/ # Performance testing framework
│ └── utils/ # Model loading, config utilities
├── configs/ # Reproducible configurations
├── docker/ # Docker environments
├── docs/ # Documentation (bilingual)
│ ├── en/ # English documentation
│ ├── zh/ # Chinese documentation
│ └── modules/ # Technical module docs
├── scripts/ # Automation scripts
├── Makefile # Unified commands
└── pyproject.toml # Unified dependency management
| Document | Description | Time |
|---|---|---|
| Quick Start | Setup and first run | 10 min |
| Prerequisites | Detailed installation | 20 min |
| Architecture | Module organization | 15 min |
| Learning Path | Study roadmap | 10 min |
| API Reference | Function documentation | 30 min |
| Performance Tuning | Optimization guide | 25 min |
| Troubleshooting | Common issues | 10 min |
| 文档 | 说明 | 时间 |
|---|---|---|
| 快速开始 | 环境搭建与运行 | 10 分钟 |
| 前置要求 | 详细安装指南 | 20 分钟 |
| 项目架构 | 模块组织结构 | 15 分钟 |
| 学习路线 | 学习路线图 | 10 分钟 |
| API 参考 | 函数文档 | 30 分钟 |
| 性能调优 | 优化指南 | 25 分钟 |
flowchart TB
subgraph Modules["📦 Optimization Modules"]
TVM["🔧 TVM Compiler<br/>ResNet50 2.9x Speedup"]
ORT["🧩 ONNX Runtime<br/>Custom CUDA Ops"]
CUTLASS["⚡ CUTLASS 3.x<br/>H100 GEMM 195 TFLOPS"]
CUTIL["🔮 cuTile<br/>Future GPU Programming"]
end
subgraph Infra["🛠️ Shared Infrastructure"]
BENCH["📊 Unified Benchmarking"]
UTILS["🔧 Common Utilities"]
CONF["⚙️ Reproducible Configs"]
end
TVM --> Infra
ORT --> Infra
CUTLASS --> Infra
CUTIL --> Infra
| Phase | Feature | Target | Status |
|---|---|---|---|
| 2024 Q1 | CUDA 13.x Support | Full compatibility | ✅ Complete |
| 2024 Q2 | Module 05: Triton Kernels | GPU kernel fusion | 🚧 In Progress |
| 2024 Q3 | ROCm Support | AMD GPU compatibility | 📅 Planned |
| 2024 Q4 | Auto-Tuning Pipeline | Cross-module optimization | 📅 Planned |
See Issues for detailed tracking.
# Run all tests (auto-skips hardware-dependent)
make test
# Run specific module tests
make test-tvm
make test-ort
make test-cutlass
make test-cutile
# Run with coverage
pytest --cov=common --cov-report=html
# Run benchmarks
make benchmark
# Or: bash scripts/run_all_benchmarks.shWe welcome contributions! See Contributing Guide for:
- Development setup
- Code standards
- Testing requirements
- PR process
MIT License - see LICENSE for details.
This project builds upon excellent open-source work: