Merged
Conversation
lvhan028
reviewed
Mar 4, 2026
| set_property(TARGET attention PROPERTY CUDA_RESOLVE_DEVICE_SYMBOLS ON) | ||
| target_compile_options(attention PRIVATE -O3 | ||
| $<$<COMPILE_LANGUAGE:CUDA>:-use_fast_math --expt-relaxed-constexpr>) | ||
| $<$<COMPILE_LANGUAGE:CUDA>:-use_fast_math --expt-relaxed-constexpr -Xptxas=-v --threads 16>) |
Collaborator
There was a problem hiding this comment.
Cool! Now the compiliation duration of target kv_cache_utils_v2 dropped from 600+s to 110+s
4 tasks
Contributor
There was a problem hiding this comment.
Pull request overview
This PR speeds up Turbomind attention/decoding for MLA (HeadDim=576) by adding an Sm80 cp.async MLA mainloop, relaxing CTA/WARP head tiling constraints, and retuning kernel hyperparameters to reduce register pressure/spills.
Changes:
- Add an Sm80 MLA mainloop path (selected via
Impl::MLA) and introduce 576-specific decoding/attention configs. - Generalize warp mapping/tiling to allow
CTA_Hto be a multiple ofWARP_H, updating thread maps and shared-memory indexing accordingly. - Retune KV-cache processing/flattening for HeadDim=576 (e.g., smaller
CTA_S) and adjust build flags for CUDA compilation.
Reviewed changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/turbomind/kernels/core/thread_map.h | Extends RakedThreadMap to support configurable warp partitioning and uses cdiv() for iteration counts. |
| src/turbomind/kernels/attention/mainloop_sm80.h | Adds MLA-specific Sm80 cp.async mainloop and routes selection via a new Impl::MLA boolean. |
| src/turbomind/kernels/attention/kv_cache_utils_v2.cu | Adjusts CTA shape for HeadDim=576 (share-KV) and updates checks/grid math. |
| src/turbomind/kernels/attention/impl_simt.h | Allows multi-warp head tiling (kWarpCntH>1) with updated warp-id mapping and sync behavior; adds MLA flag. |
| src/turbomind/kernels/attention/impl_884.h | Adds Impl::MLA = false for compatibility with Sm80 mainloop dispatch. |
| src/turbomind/kernels/attention/impl_81616.h | Updates warp-counting/shared-memory layout for multi-warp head tiling; adds MLA flag and warp-id helper. |
| src/turbomind/kernels/attention/impl_1688.h | Adds Impl::MLA flag for Sm75 tensorcore implementation. |
| src/turbomind/kernels/attention/impl_16816.h | Adds Impl::MLA flag for Sm80 tensorcore implementation. |
| src/turbomind/kernels/attention/decoding_config.h | Adds HeadDim=576 specializations and modifies Sm80 dispatch to avoid generic paths for 576. |
| src/turbomind/kernels/attention/codegen/*.cu | Removes some unused decoding instantiations for 576 variants (primarily Qh != 8 cases). |
| src/turbomind/kernels/attention/attention_config.h | Adds linear-cache attention configs for HeadDim=576 on Sm80 and Sm75. |
| src/turbomind/kernels/attention/CMakeLists.txt | Adds CUDA compile flags (-Xptxas=-v --threads 16) for the attention target. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
lvhan028
approved these changes
Mar 5, 2026
minor fix Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
CTA_Hto be multiple ofWARP_H