AI System Optimization Series

A comprehensive learning resource for AI infrastructure engineers, demonstrating full-stack optimization skills from compiler tuning to low-level kernel development.

TVM Compiler → ONNX Runtime → Triton Kernels → CUTLASS GEMM → cuTile Future
    (Level 1)    (Level 2)      (Level 2.5)      (Level 3)      (Level 4)
   2.9x Speedup   Custom Ops    GPU Fusion       HPC Kernels   Next-Gen GPU

📋 Table of Contents

🛠️ Requirements
🚀 Quick Start
✨ Features
🎯 Project Overview
📊 Performance
📁 Project Structure
🏗️ Architecture
📚 Documentation
🗺️ Roadmap
🧪 Testing
🤝 Contributing
📜 License

✨ Features

🚀 End-to-End TVM Optimization — Learn compiler tuning with Relay & TensorIR, achieving 2.9x speedup on ResNet50
🧩 Custom ONNX Runtime Operators — Extend ORT with your own CUDA kernels for GELU and beyond
⚡ Triton GPU Kernels — Write high-performance kernels in Python with FlashAttention and kernel fusion
🔥 CUTLASS 3.x GEMM — Master H100 Hopper-optimized matrix multiplication with WGMMA
🔮 Future GPU Programming — Explore cuTile abstraction for next-generation CUDA development
📊 Unified Benchmarking — Compare across frameworks with consistent performance metrics
🐳 Docker-Ready — One-command setup with CUDA 12.x/13.x environments

🛠️ Requirements

Component	Minimum	Recommended
Python	3.10	3.11
CUDA	12.x	12.2+
GPU	V100	A100/H100
CMake	3.18	3.25+
GCC/Clang	9/10	11/15

Note: Modules 01 and 02 work on any CUDA-capable GPU. Module 03 (CUTLASS) requires H100/A100 for full WGMMA features.

🎯 Project Overview

This project provides a structured progression through AI system optimization:

Module	Technology	Difficulty	Hardware	Value
01 TVM	Relay, TensorIR	⭐⭐⭐	CUDA GPU	2.5-2.9x speedup
02 ORT	ONNX Runtime, CUDA	⭐⭐⭐⭐	CUDA GPU	Custom ops extension
05 Triton	Triton, Python DSL	⭐⭐⭐	CUDA GPU	GPU kernel fusion
03 CUTLASS	CUTLASS 3.x, WGMMA	⭐⭐⭐⭐⭐	H100/A100	Peak GEMM performance
04 CuTile	cuTile, CUDA 13	⭐⭐⭐	Any GPU	Future GPU programming

📊 Performance

Key Highlights — Benchmarked on NVIDIA A100-80GB / H100

Metric	Result	Module
ResNet50 Speedup	2.9x	TVM Optimization
GEMM Efficiency	20% (195 TFLOPS)	CUTLASS WGMMA
Bandwidth Gain	+44% (230 GB/s)	Custom CUDA GELU

📈 View detailed performance tables

TVM Optimization (ResNet50)

Implementation	Latency (ms)	Speedup	Technique
PyTorch Eager	10.0	1.0x	Baseline
TVM Baseline	8.0	1.25x	Direct compile
TVM AutoScheduler	4.0	2.5x	Ansor 1000 trials
TensorIR Manual	3.5	2.9x	Expert tuning

CUTLASS GEMM (4096³, FP16, H100)

Implementation	TFLOPS	Efficiency	Notes
cuBLAS	170	17%	Reference
CUTLASS WGMMA	195	20%	Hopper optimized

Custom CUDA

Implementation	Latency	Bandwidth	Improvement
PyTorch GELU	50 μs	160 GB/s	Baseline
Custom CUDA	35 μs	230 GB/s	+44% bandwidth

🚀 Quick Start

Choose your installation method:

Option 1: Docker (Recommended for Beginners)

# CUDA 12.x environment
docker build -f docker/Dockerfile.cuda12 -t ai-opt:cuda12 .
docker run --gpus all -it -v $(pwd):/workspace ai-opt:cuda12

# Run first example
python 01_TVM_End2End_Optimization/1_import_and_baseline.py

Option 2: Local Installation

# Clone repository
git clone https://github.com/LessUp/ai-system-optimization-series.git
cd ai-system-optimization-series

# Setup Python environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install dependencies
pip install -e ".[dev]"       # Development only
# Or: pip install -e ".[all]"  # Full dependencies

# Verify installation
python -c "import tvm; print('TVM:', tvm.__version__)"
make test

Option 3: GitHub Codespaces

💡 Want to contribute? Check out our Contributing Guide for development setup, code standards, and PR process.

📁 Project Structure

ai-system-optimization-series/
├── 01_TVM_End2End_Optimization/    # Compiler optimization (TVM)
│   ├── src/tvm_optimization/       # Python package
│   │   ├── baseline.py
│   │   ├── auto_scheduler.py
│   │   └── tensorir_schedule.py
│   └── tests/
├── 02_ORT_Custom_CUDA_Op/          # Runtime extension (ONNX Runtime)
│   ├── src/                        # C++/CUDA source
│   ├── python/                     # Python bindings
│   └── tests/
├── 03_CUTLASS_Hopper_GEMM/         # High-performance GEMM (CUTLASS)
│   ├── include/                    # Headers
│   ├── src/                        # CUDA source
│   └── tests/
├── 04_CuTile_NextGen_CUDA/         # Future GPU programming (cuTile)
│   ├── src/cutile_cuda/            # Python package
│   └── tests/
├── common/                          # Shared infrastructure
│   ├── benchmark/                   # Performance testing framework
│   └── utils/                       # Model loading, config utilities
├── configs/                         # Reproducible configurations
├── docker/                          # Docker environments
├── docs/                            # Documentation (bilingual)
│   ├── en/                          # English documentation
│   ├── zh/                          # Chinese documentation
│   └── modules/                     # Technical module docs
├── scripts/                         # Automation scripts
├── Makefile                         # Unified commands
└── pyproject.toml                   # Unified dependency management

📚 Documentation

English

Document	Description	Time
Quick Start	Setup and first run	10 min
Prerequisites	Detailed installation	20 min
Architecture	Module organization	15 min
Learning Path	Study roadmap	10 min
API Reference	Function documentation	30 min
Performance Tuning	Optimization guide	25 min
Troubleshooting	Common issues	10 min

中文文档

文档	说明	时间
快速开始	环境搭建与运行	10 分钟
前置要求	详细安装指南	20 分钟
项目架构	模块组织结构	15 分钟
学习路线	学习路线图	10 分钟
API 参考	函数文档	30 分钟
性能调优	优化指南	25 分钟

🏗️ Architecture

flowchart TB
    subgraph Modules["📦 Optimization Modules"]
        TVM["🔧 TVM Compiler<br/>ResNet50 2.9x Speedup"]
        ORT["🧩 ONNX Runtime<br/>Custom CUDA Ops"]
        CUTLASS["⚡ CUTLASS 3.x<br/>H100 GEMM 195 TFLOPS"]
        CUTIL["🔮 cuTile<br/>Future GPU Programming"]
    end

    subgraph Infra["🛠️ Shared Infrastructure"]
        BENCH["📊 Unified Benchmarking"]
        UTILS["🔧 Common Utilities"]
        CONF["⚙️ Reproducible Configs"]
    end

    TVM --> Infra
    ORT --> Infra
    CUTLASS --> Infra
    CUTIL --> Infra

🗺️ Roadmap

Phase	Feature	Target	Status
2024 Q1	CUDA 13.x Support	Full compatibility	✅ Complete
2024 Q2	Module 05: Triton Kernels	GPU kernel fusion	🚧 In Progress
2024 Q3	ROCm Support	AMD GPU compatibility	📅 Planned
2024 Q4	Auto-Tuning Pipeline	Cross-module optimization	📅 Planned

See Issues for detailed tracking.

🧪 Testing

# Run all tests (auto-skips hardware-dependent)
make test

# Run specific module tests
make test-tvm
make test-ort
make test-cutlass
make test-cutile

# Run with coverage
pytest --cov=common --cov-report=html

# Run benchmarks
make benchmark
# Or: bash scripts/run_all_benchmarks.sh

🤝 Contributing

We welcome contributions! See Contributing Guide for:

Development setup
Code standards
Testing requirements
PR process

📜 License

MIT License - see LICENSE for details.

🙏 Acknowledgments

This project builds upon excellent open-source work:

➡️ Start Learning | 📖 Full Documentation

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
.github/workflows		.github/workflows
.kiro		.kiro
.qwen		.qwen
.vitepress		.vitepress
01_TVM_End2End_Optimization		01_TVM_End2End_Optimization
02_ORT_Custom_CUDA_Op		02_ORT_Custom_CUDA_Op
03_CUTLASS_Hopper_GEMM		03_CUTLASS_Hopper_GEMM
04_CuTile_NextGen_CUDA		04_CuTile_NextGen_CUDA
05_Triton_GPU_Kernels		05_Triton_GPU_Kernels
changelog		changelog
common		common
configs		configs
docker		docker
docs		docs
openspec		openspec
scripts		scripts
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
QWEN.md		QWEN.md
README.md		README.md
README.zh-CN.md		README.zh-CN.md
RELEASE_NOTES.md		RELEASE_NOTES.md
conftest.py		conftest.py
index.md		index.md
lighthouserc.json		lighthouserc.json
package-lock.json		package-lock.json
package.json		package.json
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI System Optimization Series

📋 Table of Contents

✨ Features

🛠️ Requirements

🎯 Project Overview

📊 Performance

TVM Optimization (ResNet50)

CUTLASS GEMM (4096³, FP16, H100)

Custom CUDA

🚀 Quick Start

Option 1: Docker (Recommended for Beginners)

Option 2: Local Installation

Option 3: GitHub Codespaces

📁 Project Structure

📚 Documentation

English

中文文档

🏗️ Architecture

🗺️ Roadmap

🧪 Testing

🤝 Contributing

📜 License

🙏 Acknowledgments

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AI System Optimization Series

📋 Table of Contents

✨ Features

🛠️ Requirements

🎯 Project Overview

📊 Performance

TVM Optimization (ResNet50)

CUTLASS GEMM (4096³, FP16, H100)

Custom CUDA

🚀 Quick Start

Option 1: Docker (Recommended for Beginners)

Option 2: Local Installation

Option 3: GitHub Codespaces

📁 Project Structure

📚 Documentation

English

中文文档

🏗️ Architecture

🗺️ Roadmap

🧪 Testing

🤝 Contributing

📜 License

🙏 Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages