Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR refactors the attention mechanism to align the custom kernel interface with FlashAttention (FA) and improves FA3 scheduler management. The changes rename core functions, standardize parameter names, and reorganize the attention implementation into a modular structure.
Key changes:
- Renamed
paged_attention_fwdtoflash_attn_with_kvcacheandflash_attention_fwdtoflash_attn_varlen_functo align with FA naming conventions - Standardized parameter names:
kv_seqlens→cache_seqlens,sm_scale→softmax_scale,logit_softcapping→softcap,max_seqlen→max_seqlen_q - Changed default value of
logit_softcappingfromNoneto0.0across attention implementations - Refactored attention backends into modular structure with
default.py,fa3.py, andmla.pyimplementations - Improved FA3 scheduler metadata management by extracting
_get_meta_flashattnfunction and updating buffer handling in cudagraph
Reviewed changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/pytorch/kernel/test_paged_attention.py | Updated tests to use new flash_attn_with_kvcache interface and renamed parameters |
| tests/pytorch/kernel/test_flash_attention.py | Updated tests to use flash_attn_varlen_func with new parameter names and cu_seqlens fixtures |
| tests/pytorch/kernel/test_fill_kv_cache.py | Added initialization fixture for FP8 test class |
| lmdeploy/pytorch/nn/attention.py | Changed logit_softcapping default from None to 0.0 |
| lmdeploy/pytorch/models/utils/cudagraph.py | Moved _get_meta_flashattn function and improved FA3 scheduler metadata buffer management |
| lmdeploy/pytorch/kernels/cuda/pagedattention.py | Renamed function to flash_attn_with_kvcache, aligned parameters with FA interface, returns output tensor |
| lmdeploy/pytorch/kernels/cuda/flashattention.py | Renamed to flash_attn_varlen_func, added cu_seqlens support, standardized parameter names |
| lmdeploy/pytorch/kernels/cuda/init.py | Updated exports to reflect renamed functions |
| lmdeploy/pytorch/backends/cuda/op_backend.py | Removed _get_meta_flashattn (moved to cudagraph), updated metadata construction |
| lmdeploy/pytorch/backends/cuda/flash_attention.py | Updated to use renamed function and parameter names |
| lmdeploy/pytorch/backends/cuda/attention/mla.py | New file: FlashMLA attention implementation extracted from monolithic attention.py |
| lmdeploy/pytorch/backends/cuda/attention/fa3.py | New file: FA3 attention implementation extracted from monolithic attention.py |
| lmdeploy/pytorch/backends/cuda/attention/default.py | New file: Default Triton attention implementation and metadata |
| lmdeploy/pytorch/backends/cuda/attention/init.py | New file: Attention builder with automatic selection between implementations |
| lmdeploy/pytorch/backends/cuda/attention.py | Deleted: Split into modular structure |
| lmdeploy/pytorch/backends/attention.py | Changed logical_softcapping default from None to 0.0 |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
lvhan028
approved these changes
Dec 31, 2025
RunningLeon
reviewed
Dec 31, 2025
RunningLeon
reviewed
Dec 31, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Align interface of our custom kernel with fa.
better fa3 scheduler management.