[MetaSchedule] Introduce MMA Tensor Core Multilevel Tiling#14673
[MetaSchedule] Introduce MMA Tensor Core Multilevel Tiling#14673Hzfengsy merged 12 commits intoapache:mainfrom
Conversation
|
Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.
Generated by tvm-bot |
| TensorIntrin.register("m16n8k8_sync", m16n8k8_sync_desc, m16n8k8_sync_impl) | ||
| TensorIntrin.register( | ||
| "m16n8k8_store_C_row_major", m16n8k8_store_C_row_major_desc, m16n8k8_store_C_row_major_impl | ||
| ) |
There was a problem hiding this comment.
Why can't the existing intrinsic definitions for m16n16k16 be used?
|
How to reproduce it? I would like to test it on A100 @cblmemo |
You could find all scripts you need at https://github.com/cblmemo/TVMGemmAsync/tree/main/mma 🫡 @FrozenGene |
@cblmemo Small bug: https://github.com/cblmemo/TVMGemmAsync/blob/main/mma/GemmRuleGenerate.py#L135 should be |
|
@cblmemo Great job! I have tested it on A100. 1024x1024x1024, it could achieve the same level of cutlass. If we could add more mma variants, maybe we could achieve better result. |
@FrozenGene Thanks for point that out! I use a lot of dir names and it appears than I made a mistake when uploading my script 🙂 |
@FrozenGene Sure. The m16n8k16 and fp32 accumulator are WIP now 🧐 |
@cblmemo Sounds great! Also consider supporting signed (and unsigned ) i8 * i8 -> i32 (accumulator)(m16n8k32/m8n8k16), which is commonly used in the quantized model. If we have this, we could do more interesting benchmark! |
996063c to
8ad16d9
Compare
|
Please take another look if you are interested @spectrometerHBH @masahi @vinx13 @FrozenGene @junrushao |
Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>
Co-authored-by: Siyuan Feng <Hzfengsy@sjtu.edu.cn>
|
Thanks @cblmemo for the excellent work, together with the reviews from @spectrometerHBH and @FrozenGene |
This PR introduces mma in the multilevel tiling tensor core.
For the benchmark result, please refer to https://docs.google.com/spreadsheets/d/1thf1jsbX87WokRfESXO14fx40H3vYHDk6EWkb_wnv5Y
For all tuning logs, best performance scripts and python tuning & benchmarking scripts, please refer to https://github.com/cblmemo/TVMGemmAsync/tree/main/mma