Skip to content

Latest commit

 

History

History
69 lines (36 loc) · 2.69 KB

File metadata and controls

69 lines (36 loc) · 2.69 KB

cuTile learning

Tutorials

Benchmark

All benchmarks were run with Torch 2.9.1, Triton 3.5.1, cuTile (cuda-tile) 1.0.0, and tileiras, using CUDA compilation tools 13.1 (V13.1.80).

Currently, I only have results from an RTX 5090 (sm_120), data in benchmark/5090. Contributions from Blackwell B200 (sm_100) users are very welcome!

5090 Transformers Inference

use NVIDIA/TileGym/tree/main/modeling/transformers and profile data in profile-data repository

Transformers Inference

5090 attention fwd

5090 attention

5090 softmax

softmax-performance

5090 layer normal

5090-layer-norm

5090 matmul

5090 matmul

My Zhihu article

如何评价 cuTile? —— BobHuang的回答

浅析cuTile执行流程

Documents

Github repositorys

NVIDIA/cutile-python

NVIDIA/TileGym

YouTube videoes

Deep Dive: How to Use cuTile Python

THE FUTURE IS TILED: using cuTile and CUDA Tile IR to write portable, high-performance GPU Kernels