Refactor attn by grimoire · Pull Request #4238 · InternLM/lmdeploy

grimoire · 2025-12-26T02:42:10Z

Align interface of our custom kernel with fa.
better fa3 scheduler management.

Copilot

Pull request overview

This PR refactors the attention mechanism to align the custom kernel interface with FlashAttention (FA) and improves FA3 scheduler management. The changes rename core functions, standardize parameter names, and reorganize the attention implementation into a modular structure.

Key changes:

Renamed paged_attention_fwd to flash_attn_with_kvcache and flash_attention_fwd to flash_attn_varlen_func to align with FA naming conventions
Standardized parameter names: kv_seqlens → cache_seqlens, sm_scale → softmax_scale, logit_softcapping → softcap, max_seqlen → max_seqlen_q
Changed default value of logit_softcapping from None to 0.0 across attention implementations
Refactored attention backends into modular structure with default.py, fa3.py, and mla.py implementations
Improved FA3 scheduler metadata management by extracting _get_meta_flashattn function and updating buffer handling in cudagraph

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
tests/pytorch/kernel/test_paged_attention.py	Updated tests to use new `flash_attn_with_kvcache` interface and renamed parameters
tests/pytorch/kernel/test_flash_attention.py	Updated tests to use `flash_attn_varlen_func` with new parameter names and cu_seqlens fixtures
tests/pytorch/kernel/test_fill_kv_cache.py	Added initialization fixture for FP8 test class
lmdeploy/pytorch/nn/attention.py	Changed `logit_softcapping` default from `None` to `0.0`
lmdeploy/pytorch/models/utils/cudagraph.py	Moved `_get_meta_flashattn` function and improved FA3 scheduler metadata buffer management
lmdeploy/pytorch/kernels/cuda/pagedattention.py	Renamed function to `flash_attn_with_kvcache`, aligned parameters with FA interface, returns output tensor
lmdeploy/pytorch/kernels/cuda/flashattention.py	Renamed to `flash_attn_varlen_func`, added cu_seqlens support, standardized parameter names
lmdeploy/pytorch/kernels/cuda/init.py	Updated exports to reflect renamed functions
lmdeploy/pytorch/backends/cuda/op_backend.py	Removed `_get_meta_flashattn` (moved to cudagraph), updated metadata construction
lmdeploy/pytorch/backends/cuda/flash_attention.py	Updated to use renamed function and parameter names
lmdeploy/pytorch/backends/cuda/attention/mla.py	New file: FlashMLA attention implementation extracted from monolithic attention.py
lmdeploy/pytorch/backends/cuda/attention/fa3.py	New file: FA3 attention implementation extracted from monolithic attention.py
lmdeploy/pytorch/backends/cuda/attention/default.py	New file: Default Triton attention implementation and metadata
lmdeploy/pytorch/backends/cuda/attention/init.py	New file: Attention builder with automatic selection between implementations
lmdeploy/pytorch/backends/cuda/attention.py	Deleted: Split into modular structure
lmdeploy/pytorch/backends/attention.py	Changed `logical_softcapping` default from `None` to `0.0`

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

lmdeploy/pytorch/backends/cuda/attention/__init__.py

lmdeploy/pytorch/backends/cuda/attention/fa3.py

lmdeploy/pytorch/backends/cuda/attention/mla.py

lmdeploy/pytorch/backends/cuda/flash_attention.py

lmdeploy/pytorch/backends/cuda/attention/mla.py

lmdeploy/pytorch/models/utils/cudagraph.py

lmdeploy/pytorch/backends/cuda/attention/fa3.py

RunningLeon

LGTM

grimoire added 11 commits December 23, 2025 16:46

update attn args

19b2262

Merge branch 'main' into refactor-attn

21bd123

refactor backend attn

997b6ac

align key names with fa3, kvcache

2c1233d

remove disable fa3

503d4e0

use fa3

ed209d2

merge main

a08707b

align paged attention

d0d0de4

fix vlm

0afff5b

update meta

9b455f4

add comment

b6f8d2b

lvhan028 added the improvement label Dec 26, 2025

lvhan028 requested review from RunningLeon and Copilot December 26, 2025 07:03

Copilot started reviewing on behalf of lvhan028 December 26, 2025 07:04 View session

Copilot AI reviewed Dec 26, 2025

View reviewed changes

grimoire added 4 commits December 29, 2025 21:36

remove alibi

731bd24

fix

d521001

refactor attn

45ed869

rename kernel args

4ac9495

lvhan028 approved these changes Dec 31, 2025

View reviewed changes

RunningLeon reviewed Dec 31, 2025

View reviewed changes

lmdeploy/pytorch/models/utils/cudagraph.py Outdated Show resolved Hide resolved

fix dsv32

d4e1372

RunningLeon reviewed Dec 31, 2025

View reviewed changes

lmdeploy/pytorch/backends/cuda/attention/fa3.py Show resolved Hide resolved

grimoire added 2 commits December 31, 2025 15:07

fix spec decoding

c82b202

fix blockoffsets

88034eb

RunningLeon approved these changes Dec 31, 2025

View reviewed changes

lvhan028 merged commit 2686053 into InternLM:main Dec 31, 2025
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor attn#4238

Refactor attn#4238
lvhan028 merged 18 commits intoInternLM:mainfrom
grimoire:refactor-attn

grimoire commented Dec 26, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

RunningLeon left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

grimoire commented Dec 26, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

RunningLeon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants