Faster MLA kernels by lzhangzz · Pull Request #4391 · InternLM/lmdeploy

lzhangzz · 2026-03-03T06:54:28Z

Add MLA mainloop for sm_80
Allow CTA_H to be multiple of WARP_H
Adjust hyper-parameters to lower register-spilling

lvhan028 · 2026-03-04T08:07:04Z

src/turbomind/kernels/attention/CMakeLists.txt

 set_property(TARGET attention PROPERTY CUDA_RESOLVE_DEVICE_SYMBOLS ON)
 target_compile_options(attention PRIVATE -O3
-    $<$<COMPILE_LANGUAGE:CUDA>:-use_fast_math --expt-relaxed-constexpr>)
+    $<$<COMPILE_LANGUAGE:CUDA>:-use_fast_math --expt-relaxed-constexpr -Xptxas=-v --threads 16>)


Cool! Now the compiliation duration of target kv_cache_utils_v2 dropped from 600+s to 110+s

Copilot

Pull request overview

This PR speeds up Turbomind attention/decoding for MLA (HeadDim=576) by adding an Sm80 cp.async MLA mainloop, relaxing CTA/WARP head tiling constraints, and retuning kernel hyperparameters to reduce register pressure/spills.

Changes:

Add an Sm80 MLA mainloop path (selected via Impl::MLA) and introduce 576-specific decoding/attention configs.
Generalize warp mapping/tiling to allow CTA_H to be a multiple of WARP_H, updating thread maps and shared-memory indexing accordingly.
Retune KV-cache processing/flattening for HeadDim=576 (e.g., smaller CTA_S) and adjust build flags for CUDA compilation.

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
src/turbomind/kernels/core/thread_map.h	Extends `RakedThreadMap` to support configurable warp partitioning and uses `cdiv()` for iteration counts.
src/turbomind/kernels/attention/mainloop_sm80.h	Adds MLA-specific Sm80 cp.async mainloop and routes selection via a new `Impl::MLA` boolean.
src/turbomind/kernels/attention/kv_cache_utils_v2.cu	Adjusts CTA shape for HeadDim=576 (share-KV) and updates checks/grid math.
src/turbomind/kernels/attention/impl_simt.h	Allows multi-warp head tiling (`kWarpCntH>1`) with updated warp-id mapping and sync behavior; adds `MLA` flag.
src/turbomind/kernels/attention/impl_884.h	Adds `Impl::MLA = false` for compatibility with Sm80 mainloop dispatch.
src/turbomind/kernels/attention/impl_81616.h	Updates warp-counting/shared-memory layout for multi-warp head tiling; adds `MLA` flag and warp-id helper.
src/turbomind/kernels/attention/impl_1688.h	Adds `Impl::MLA` flag for Sm75 tensorcore implementation.
src/turbomind/kernels/attention/impl_16816.h	Adds `Impl::MLA` flag for Sm80 tensorcore implementation.
src/turbomind/kernels/attention/decoding_config.h	Adds HeadDim=576 specializations and modifies Sm80 dispatch to avoid generic paths for 576.
src/turbomind/kernels/attention/codegen/*.cu	Removes some unused decoding instantiations for 576 variants (primarily Qh != 8 cases).
src/turbomind/kernels/attention/attention_config.h	Adds linear-cache attention configs for HeadDim=576 on Sm80 and Sm75.
src/turbomind/kernels/attention/CMakeLists.txt	Adds CUDA compile flags (`-Xptxas=-v --threads 16`) for the attention target.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/turbomind/kernels/attention/impl_81616.h

src/turbomind/kernels/attention/attention_config.h

src/turbomind/kernels/attention/decoding_config.h

src/turbomind/kernels/attention/impl_simt.h

minor fix Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

lzhangzz added 2 commits March 3, 2026 06:42

faster MLA

2da2f2e

fix lint

4be3c27

lvhan028 added improvement labels Mar 3, 2026

lvhan028 reviewed Mar 4, 2026

View reviewed changes

lvhan028 mentioned this pull request Mar 5, 2026

[Feature] Add TurboMind support for Qwen3.5 models (dense + MoE) #4389

Merged

4 tasks

lvhan028 requested a review from Copilot March 5, 2026 10:10

Copilot started reviewing on behalf of lvhan028 March 5, 2026 10:10 View session

Copilot AI reviewed Mar 5, 2026

View reviewed changes

lvhan028 approved these changes Mar 5, 2026

View reviewed changes

Update src/turbomind/kernels/attention/decoding_config.h

f38fcfc

minor fix Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

lvhan028 merged commit 035dd4e into InternLM:main Mar 5, 2026
1 of 9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster MLA kernels#4391

Faster MLA kernels#4391
lvhan028 merged 3 commits intoInternLM:mainfrom
lzhangzz:faster-mla

lzhangzz commented Mar 3, 2026

Uh oh!

lvhan028 Mar 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lzhangzz commented Mar 3, 2026

Uh oh!

lvhan028 Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants