Skip to content

Refactor attn#4238

Merged
lvhan028 merged 18 commits intoInternLM:mainfrom
grimoire:refactor-attn
Dec 31, 2025
Merged

Refactor attn#4238
lvhan028 merged 18 commits intoInternLM:mainfrom
grimoire:refactor-attn

Conversation

@grimoire
Copy link
Copy Markdown
Collaborator

Align interface of our custom kernel with fa.
better fa3 scheduler management.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the attention mechanism to align the custom kernel interface with FlashAttention (FA) and improves FA3 scheduler management. The changes rename core functions, standardize parameter names, and reorganize the attention implementation into a modular structure.

Key changes:

  • Renamed paged_attention_fwd to flash_attn_with_kvcache and flash_attention_fwd to flash_attn_varlen_func to align with FA naming conventions
  • Standardized parameter names: kv_seqlenscache_seqlens, sm_scalesoftmax_scale, logit_softcappingsoftcap, max_seqlenmax_seqlen_q
  • Changed default value of logit_softcapping from None to 0.0 across attention implementations
  • Refactored attention backends into modular structure with default.py, fa3.py, and mla.py implementations
  • Improved FA3 scheduler metadata management by extracting _get_meta_flashattn function and updating buffer handling in cudagraph

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
tests/pytorch/kernel/test_paged_attention.py Updated tests to use new flash_attn_with_kvcache interface and renamed parameters
tests/pytorch/kernel/test_flash_attention.py Updated tests to use flash_attn_varlen_func with new parameter names and cu_seqlens fixtures
tests/pytorch/kernel/test_fill_kv_cache.py Added initialization fixture for FP8 test class
lmdeploy/pytorch/nn/attention.py Changed logit_softcapping default from None to 0.0
lmdeploy/pytorch/models/utils/cudagraph.py Moved _get_meta_flashattn function and improved FA3 scheduler metadata buffer management
lmdeploy/pytorch/kernels/cuda/pagedattention.py Renamed function to flash_attn_with_kvcache, aligned parameters with FA interface, returns output tensor
lmdeploy/pytorch/kernels/cuda/flashattention.py Renamed to flash_attn_varlen_func, added cu_seqlens support, standardized parameter names
lmdeploy/pytorch/kernels/cuda/init.py Updated exports to reflect renamed functions
lmdeploy/pytorch/backends/cuda/op_backend.py Removed _get_meta_flashattn (moved to cudagraph), updated metadata construction
lmdeploy/pytorch/backends/cuda/flash_attention.py Updated to use renamed function and parameter names
lmdeploy/pytorch/backends/cuda/attention/mla.py New file: FlashMLA attention implementation extracted from monolithic attention.py
lmdeploy/pytorch/backends/cuda/attention/fa3.py New file: FA3 attention implementation extracted from monolithic attention.py
lmdeploy/pytorch/backends/cuda/attention/default.py New file: Default Triton attention implementation and metadata
lmdeploy/pytorch/backends/cuda/attention/init.py New file: Attention builder with automatic selection between implementations
lmdeploy/pytorch/backends/cuda/attention.py Deleted: Split into modular structure
lmdeploy/pytorch/backends/attention.py Changed logical_softcapping default from None to 0.0

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Collaborator

@RunningLeon RunningLeon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lvhan028 lvhan028 merged commit 2686053 into InternLM:main Dec 31, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants