69 lines (36 loc) · 2.69 KB

cuTile learning

Tutorials

Benchmark

All benchmarks were run with Torch 2.9.1, Triton 3.5.1, cuTile (cuda-tile) 1.0.0, and tileiras, using CUDA compilation tools 13.1 (V13.1.80).

Currently, I only have results from an RTX 5090 (sm_120), data in benchmark/5090. Contributions from Blackwell B200 (sm_100) users are very welcome!

5090 Transformers Inference

use NVIDIA/TileGym/tree/main/modeling/transformers and profile data in profile-data repository

5090 attention fwd

5090 softmax

5090 layer normal

5090 matmul

My Zhihu article

如何评价 cuTile? —— BobHuang的回答

浅析cuTile执行流程

Documents

Github repositorys

NVIDIA/cutile-python

YouTube videoes