Skip to content

LessUp/ai-system-optimization-series

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI System Optimization Series

CI Pages CUDA TVM ONNX Runtime CUTLASS Triton Docs License Stars

English | 简体中文

A comprehensive learning resource for AI infrastructure engineers, demonstrating full-stack optimization skills from compiler tuning to low-level kernel development.

TVM Compiler → ONNX Runtime → Triton Kernels → CUTLASS GEMM → cuTile Future
    (Level 1)    (Level 2)      (Level 2.5)      (Level 3)      (Level 4)
   2.9x Speedup   Custom Ops    GPU Fusion       HPC Kernels   Next-Gen GPU

📋 Table of Contents

✨ Features

  • 🚀 End-to-End TVM Optimization — Learn compiler tuning with Relay & TensorIR, achieving 2.9x speedup on ResNet50
  • 🧩 Custom ONNX Runtime Operators — Extend ORT with your own CUDA kernels for GELU and beyond
  • Triton GPU Kernels — Write high-performance kernels in Python with FlashAttention and kernel fusion
  • 🔥 CUTLASS 3.x GEMM — Master H100 Hopper-optimized matrix multiplication with WGMMA
  • 🔮 Future GPU Programming — Explore cuTile abstraction for next-generation CUDA development
  • 📊 Unified Benchmarking — Compare across frameworks with consistent performance metrics
  • 🐳 Docker-Ready — One-command setup with CUDA 12.x/13.x environments

🛠️ Requirements

Component Minimum Recommended
Python 3.10 3.11
CUDA 12.x 12.2+
GPU V100 A100/H100
CMake 3.18 3.25+
GCC/Clang 9/10 11/15

Note: Modules 01 and 02 work on any CUDA-capable GPU. Module 03 (CUTLASS) requires H100/A100 for full WGMMA features.

🎯 Project Overview

This project provides a structured progression through AI system optimization:

Module Technology Difficulty Hardware Value
01 TVM Relay, TensorIR ⭐⭐⭐ CUDA GPU 2.5-2.9x speedup
02 ORT ONNX Runtime, CUDA ⭐⭐⭐⭐ CUDA GPU Custom ops extension
05 Triton Triton, Python DSL ⭐⭐⭐ CUDA GPU GPU kernel fusion
03 CUTLASS CUTLASS 3.x, WGMMA ⭐⭐⭐⭐⭐ H100/A100 Peak GEMM performance
04 CuTile cuTile, CUDA 13 ⭐⭐⭐ Any GPU Future GPU programming

📊 Performance

Key Highlights — Benchmarked on NVIDIA A100-80GB / H100

Metric Result Module
ResNet50 Speedup 2.9x TVM Optimization
GEMM Efficiency 20% (195 TFLOPS) CUTLASS WGMMA
Bandwidth Gain +44% (230 GB/s) Custom CUDA GELU
📈 View detailed performance tables

TVM Optimization (ResNet50)

Implementation Latency (ms) Speedup Technique
PyTorch Eager 10.0 1.0x Baseline
TVM Baseline 8.0 1.25x Direct compile
TVM AutoScheduler 4.0 2.5x Ansor 1000 trials
TensorIR Manual 3.5 2.9x Expert tuning

CUTLASS GEMM (4096³, FP16, H100)

Implementation TFLOPS Efficiency Notes
cuBLAS 170 17% Reference
CUTLASS WGMMA 195 20% Hopper optimized

Custom CUDA

Implementation Latency Bandwidth Improvement
PyTorch GELU 50 μs 160 GB/s Baseline
Custom CUDA 35 μs 230 GB/s +44% bandwidth

🚀 Quick Start

Choose your installation method:

Option 1: Docker (Recommended for Beginners)

# CUDA 12.x environment
docker build -f docker/Dockerfile.cuda12 -t ai-opt:cuda12 .
docker run --gpus all -it -v $(pwd):/workspace ai-opt:cuda12

# Run first example
python 01_TVM_End2End_Optimization/1_import_and_baseline.py

Option 2: Local Installation

# Clone repository
git clone https://github.com/LessUp/ai-system-optimization-series.git
cd ai-system-optimization-series

# Setup Python environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install dependencies
pip install -e ".[dev]"       # Development only
# Or: pip install -e ".[all]"  # Full dependencies

# Verify installation
python -c "import tvm; print('TVM:', tvm.__version__)"
make test

Option 3: GitHub Codespaces

Open in GitHub Codespaces

💡 Want to contribute? Check out our Contributing Guide for development setup, code standards, and PR process.

📁 Project Structure

ai-system-optimization-series/
├── 01_TVM_End2End_Optimization/    # Compiler optimization (TVM)
│   ├── src/tvm_optimization/       # Python package
│   │   ├── baseline.py
│   │   ├── auto_scheduler.py
│   │   └── tensorir_schedule.py
│   └── tests/
├── 02_ORT_Custom_CUDA_Op/          # Runtime extension (ONNX Runtime)
│   ├── src/                        # C++/CUDA source
│   ├── python/                     # Python bindings
│   └── tests/
├── 03_CUTLASS_Hopper_GEMM/         # High-performance GEMM (CUTLASS)
│   ├── include/                    # Headers
│   ├── src/                        # CUDA source
│   └── tests/
├── 04_CuTile_NextGen_CUDA/         # Future GPU programming (cuTile)
│   ├── src/cutile_cuda/            # Python package
│   └── tests/
├── common/                          # Shared infrastructure
│   ├── benchmark/                   # Performance testing framework
│   └── utils/                       # Model loading, config utilities
├── configs/                         # Reproducible configurations
├── docker/                          # Docker environments
├── docs/                            # Documentation (bilingual)
│   ├── en/                          # English documentation
│   ├── zh/                          # Chinese documentation
│   └── modules/                     # Technical module docs
├── scripts/                         # Automation scripts
├── Makefile                         # Unified commands
└── pyproject.toml                   # Unified dependency management

📚 Documentation

English

Document Description Time
Quick Start Setup and first run 10 min
Prerequisites Detailed installation 20 min
Architecture Module organization 15 min
Learning Path Study roadmap 10 min
API Reference Function documentation 30 min
Performance Tuning Optimization guide 25 min
Troubleshooting Common issues 10 min

中文文档

文档 说明 时间
快速开始 环境搭建与运行 10 分钟
前置要求 详细安装指南 20 分钟
项目架构 模块组织结构 15 分钟
学习路线 学习路线图 10 分钟
API 参考 函数文档 30 分钟
性能调优 优化指南 25 分钟

🏗️ Architecture

flowchart TB
    subgraph Modules["📦 Optimization Modules"]
        TVM["🔧 TVM Compiler<br/>ResNet50 2.9x Speedup"]
        ORT["🧩 ONNX Runtime<br/>Custom CUDA Ops"]
        CUTLASS["⚡ CUTLASS 3.x<br/>H100 GEMM 195 TFLOPS"]
        CUTIL["🔮 cuTile<br/>Future GPU Programming"]
    end

    subgraph Infra["🛠️ Shared Infrastructure"]
        BENCH["📊 Unified Benchmarking"]
        UTILS["🔧 Common Utilities"]
        CONF["⚙️ Reproducible Configs"]
    end

    TVM --> Infra
    ORT --> Infra
    CUTLASS --> Infra
    CUTIL --> Infra
Loading

🗺️ Roadmap

Phase Feature Target Status
2024 Q1 CUDA 13.x Support Full compatibility ✅ Complete
2024 Q2 Module 05: Triton Kernels GPU kernel fusion 🚧 In Progress
2024 Q3 ROCm Support AMD GPU compatibility 📅 Planned
2024 Q4 Auto-Tuning Pipeline Cross-module optimization 📅 Planned

See Issues for detailed tracking.

🧪 Testing

# Run all tests (auto-skips hardware-dependent)
make test

# Run specific module tests
make test-tvm
make test-ort
make test-cutlass
make test-cutile

# Run with coverage
pytest --cov=common --cov-report=html

# Run benchmarks
make benchmark
# Or: bash scripts/run_all_benchmarks.sh

🤝 Contributing

We welcome contributions! See Contributing Guide for:

  • Development setup
  • Code standards
  • Testing requirements
  • PR process

📜 License

MIT License - see LICENSE for details.

🙏 Acknowledgments

This project builds upon excellent open-source work:


➡️ Start Learning | 📖 Full Documentation