Skip to content

[MetaSchedule] Introduce MMA Tensor Core Multilevel Tiling#14673

Merged
Hzfengsy merged 12 commits intoapache:mainfrom
cblmemo:mma-auto
Jun 28, 2023
Merged

[MetaSchedule] Introduce MMA Tensor Core Multilevel Tiling#14673
Hzfengsy merged 12 commits intoapache:mainfrom
cblmemo:mma-auto

Conversation

@cblmemo
Copy link
Copy Markdown
Contributor

@cblmemo cblmemo commented Apr 19, 2023

This PR introduces mma in the multilevel tiling tensor core.

For the benchmark result, please refer to https://docs.google.com/spreadsheets/d/1thf1jsbX87WokRfESXO14fx40H3vYHDk6EWkb_wnv5Y

For all tuning logs, best performance scripts and python tuning & benchmarking scripts, please refer to https://github.com/cblmemo/TVMGemmAsync/tree/main/mma

@tvm-bot
Copy link
Copy Markdown
Collaborator

tvm-bot commented Apr 19, 2023

Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.

Generated by tvm-bot

TensorIntrin.register("m16n8k8_sync", m16n8k8_sync_desc, m16n8k8_sync_impl)
TensorIntrin.register(
"m16n8k8_store_C_row_major", m16n8k8_store_C_row_major_desc, m16n8k8_store_C_row_major_impl
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't the existing intrinsic definitions for m16n16k16 be used?

@cblmemo cblmemo marked this pull request as ready for review June 1, 2023 07:32
@cblmemo cblmemo changed the title [WIP][MetaSchedule] Introduce MMA Tensor Core Multilevel Tiling [MetaSchedule] Introduce MMA Tensor Core Multilevel Tiling Jun 1, 2023
@FrozenGene
Copy link
Copy Markdown
Member

How to reproduce it? I would like to test it on A100 @cblmemo

@cblmemo
Copy link
Copy Markdown
Contributor Author

cblmemo commented Jun 2, 2023

How to reproduce it? I would like to test it on A100 @cblmemo

You could find all scripts you need at https://github.com/cblmemo/TVMGemmAsync/tree/main/mma 🫡 @FrozenGene

@FrozenGene
Copy link
Copy Markdown
Member

How to reproduce it? I would like to test it on A100 @cblmemo

You could find all scripts you need at https://github.com/cblmemo/TVMGemmAsync/tree/main/mma 🫡 @FrozenGene

@cblmemo Small bug: https://github.com/cblmemo/TVMGemmAsync/blob/main/mma/GemmRuleGenerate.py#L135 should be tensorcore_outputs not outputs

@FrozenGene
Copy link
Copy Markdown
Member

@cblmemo Great job! I have tested it on A100. 1024x1024x1024, it could achieve the same level of cutlass. If we could add more mma variants, maybe we could achieve better result.

@cblmemo
Copy link
Copy Markdown
Contributor Author

cblmemo commented Jun 3, 2023

How to reproduce it? I would like to test it on A100 @cblmemo

You could find all scripts you need at https://github.com/cblmemo/TVMGemmAsync/tree/main/mma 🫡 @FrozenGene

@cblmemo Small bug: https://github.com/cblmemo/TVMGemmAsync/blob/main/mma/GemmRuleGenerate.py#L135 should be tensorcore_outputs not outputs

@FrozenGene Thanks for point that out! I use a lot of dir names and it appears than I made a mistake when uploading my script 🙂

@cblmemo
Copy link
Copy Markdown
Contributor Author

cblmemo commented Jun 3, 2023

@cblmemo Great job! I have tested it on A100. 1024x1024x1024, it could achieve the same level of cutlass. If we could add more mma variants, maybe we could achieve better result.

@FrozenGene Sure. The m16n8k16 and fp32 accumulator are WIP now 🧐

@FrozenGene
Copy link
Copy Markdown
Member

@cblmemo Great job! I have tested it on A100. 1024x1024x1024, it could achieve the same level of cutlass. If we could add more mma variants, maybe we could achieve better result.

@FrozenGene Sure. The m16n8k16 and fp32 accumulator are WIP now 🧐

@cblmemo Sounds great! Also consider supporting signed (and unsigned ) i8 * i8 -> i32 (accumulator)(m16n8k32/m8n8k16), which is commonly used in the quantized model. If we have this, we could do more interesting benchmark!

Comment thread include/tvm/tir/transform.h Outdated
Comment thread python/tvm/tir/schedule/schedule.py
Comment thread python/tvm/tir/tensor_intrin/cuda.py Outdated
Comment thread src/driver/driver_api.cc
@cblmemo cblmemo force-pushed the mma-auto branch 2 times, most recently from 996063c to 8ad16d9 Compare June 25, 2023 04:05
Copy link
Copy Markdown
Member

@Hzfengsy Hzfengsy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread src/tir/transforms/inject_permuted_layout.cc Outdated
Comment thread src/tir/transforms/inject_permuted_layout.cc Outdated
@Hzfengsy
Copy link
Copy Markdown
Member

Please take another look if you are interested @spectrometerHBH @masahi @vinx13 @FrozenGene @junrushao

cblmemo and others added 5 commits June 26, 2023 13:55
@Hzfengsy Hzfengsy merged commit c8f5595 into apache:main Jun 28, 2023
@Hzfengsy
Copy link
Copy Markdown
Member

Thanks @cblmemo for the excellent work, together with the reviews from @spectrometerHBH and @FrozenGene

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants