Skip to content

Move fa4 from sgl-kernel to jit kernel#17353

Merged
BBuf merged 18 commits intomainfrom
try_to_move_fa4_to_jit_kernel
Jan 24, 2026
Merged

Move fa4 from sgl-kernel to jit kernel#17353
BBuf merged 18 commits intomainfrom
try_to_move_fa4_to_jit_kernel

Conversation

@BBuf
Copy link
Collaborator

@BBuf BBuf commented Jan 19, 2026

Motivation

Flash Attention V4 is implemented using the cute-dsl and is inherently JIT-compiled, so moving it to jit_kernel is a fairly natural transition.

python/sglang/jit_kernel/flash_attention/cute directory is copied from https://github.com/Dao-AILab/flash-attention/tree/main/flash_attn/cute , so this directory do not need review.

图片

unit-test passed.

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@github-actions github-actions bot added documentation Improvements or additions to documentation dependencies Pull requests that update a dependency file sgl-kernel diffusion SGLang Diffusion labels Jan 19, 2026
@Fridge003
Copy link
Collaborator

Nice, we definitely need this.
In this way the update of fa4 won't be blocked by sgl-kernel update

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @BBuf, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the project by migrating the FlashAttention v4 (FA4) implementation to a new, more appropriate location within the sglang/jit_kernel module. This change streamlines the project's architecture, making the JIT-compiled components more discoverable and logically grouped. The core functionality of FA4 remains unchanged, with the primary focus being on improving the structural integrity of the codebase.

Highlights

  • FlashAttention v4 (FA4) Migration: The entire FlashAttention v4 implementation has been relocated from the sgl-kernel directory to python/sglang/jit_kernel/flash_attention/cute/. This includes all core CUDA Template Engine (CuTE) DSL files, helper functions, and related utilities.
  • Codebase Centralization: This move centralizes the FA4 codebase within the sglang project structure, improving organization and maintainability by placing JIT-compiled kernels in a dedicated jit_kernel module.
  • Import Path Updates: Dependent modules, including sglang/multimodal_gen/runtime/layers/attention/backends/flash_attn.py and sglang/srt/layers/attention/flashattention_backend.py, have been updated to reflect the new import paths for FA4 functions.
  • Test File Relocation: The associated test file sgl-kernel/tests/test_flash_attention_4.py has been renamed and moved to python/sglang/jit_kernel/tests/test_flash_attention_4.py, aligning tests with the new module structure.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request moves the FlashAttention v4 implementation from sgl-kernel to a new jit_kernel directory within the sglang package. As requested, the review focuses on the integration points and new interface code, while skipping the vendored code from the flash-attention repository.

I've identified a few critical issues in the new interface and integration code that will cause runtime errors due to incorrect function calls and undefined variables. These need to be addressed to ensure the new FA4 implementation can be correctly invoked. Please see the detailed comments for suggestions on how to fix them.

@BBuf
Copy link
Collaborator Author

BBuf commented Jan 19, 2026

/tag-and-rerun-ci

@merrymercy merrymercy requested a review from zhyncs January 20, 2026 07:45
@merrymercy
Copy link
Contributor

We do not guarantee any backward compatbility of an experimental API (early FA4) in sgl kernel for other non-sglang projects, so I believe we can merge this and delete old code.

It should be easy for other projects to migrate (copy code and change a few lines of imports). It does not make sense to sacrifice our code's cleanness for other projects. Unfortunately, the other projects have to pay the maintenance overhead.

@sgl-project sgl-project deleted a comment from zhyncs Jan 20, 2026
@BBuf
Copy link
Collaborator Author

BBuf commented Jan 20, 2026

@BBuf

Thanks for the explanation. I agree that FA4 should evolve independently and should not be upgraded or released from sgl-kernel.

My point about keeping FA4 in sgl-kernel is only for backward compatibility. sgl-kernel is depended on by multiple downstream projects (not just sglang), and some users do not rely on PyPI installs. Removing it would be a breaking change.

Since the new FA4 uses a different import path (JIT or a future standalone package), keeping the existing FA4 in sgl-kernel does not constrain or interfere with future FA4 upgrades. It can remain frozen as a legacy interface.

For this reason, I’d prefer to keep it and follow the same approach for future kernel migrations unless we explicitly plan a breaking change.

Agree with it, I'll add sgl-kernel fa4 back, thanks for your explanation.

@johnnynunez
Copy link
Contributor

johnnynunez commented Jan 20, 2026

I think that at some point, we can move everything to cute dsl, so It would be nice to remove C++ FA from sglang

It saves time compilation == Less pressure CI == reduce costs.

FA cute dsl it has FA2 and FA3(but i don't know if it has all the features)

Now i close to cutlass team, so i'll divulgate the future changes

@BBuf
Copy link
Collaborator Author

BBuf commented Jan 21, 2026

I think that at some point, we can move everything to cute dsl, so It would be nice to remove C++ FA from sglang

It saves time compilation == Less pressure CI == reduce costs.

FA cute dsl it has FA2 and FA3(but i don't know if it has all the features)

Now i close to cutlass team, so i'll divulgate the future changes

Indeed, this can significantly reduce the size of the sgl-kernel package.

@johnnynunez
Copy link
Contributor

I think that at some point, we can move everything to cute dsl, so It would be nice to remove C++ FA from sglang
It saves time compilation == Less pressure CI == reduce costs.
FA cute dsl it has FA2 and FA3(but i don't know if it has all the features)
Now i close to cutlass team, so i'll divulgate the future changes

Indeed, this can significantly reduce the size of the sgl-kernel package.

yes, and we can incorporate more archs.

@BBuf
Copy link
Collaborator Author

BBuf commented Jan 21, 2026

/rerun-failed-ci

@johnnynunez
Copy link
Contributor

The problem that I see here is that the API is still beta and not all features are available, so in sgl-kernel, we point to exact commit in cmakelist.txt so we should have the same behavior here, because they are changing constantly the API

@BBuf
Copy link
Collaborator Author

BBuf commented Jan 21, 2026

The problem that I see here is that the API is still beta and not all features are available, so in sgl-kernel, we point to exact commit in cmakelist.txt so we should have the same behavior here, because they are changing constantly the API

If there are changes to the interface later, we can modify it accordingly, and we won't have to go through the cumbersome process of releasing a new version through sgl-kernel.

@BBuf
Copy link
Collaborator Author

BBuf commented Jan 21, 2026

/rerun-failed-ci

@BBuf
Copy link
Collaborator Author

BBuf commented Jan 23, 2026

/rerun-failed-ci

3 similar comments
@BBuf
Copy link
Collaborator Author

BBuf commented Jan 23, 2026

/rerun-failed-ci

@BBuf
Copy link
Collaborator Author

BBuf commented Jan 24, 2026

/rerun-failed-ci

@BBuf
Copy link
Collaborator Author

BBuf commented Jan 24, 2026

/rerun-failed-ci

@BBuf
Copy link
Collaborator Author

BBuf commented Jan 24, 2026

@BBuf
Copy link
Collaborator Author

BBuf commented Jan 24, 2026

@zhyncs The request change has been solved and ci passed too. Can you give me a approve? Thanks.

@BBuf
Copy link
Collaborator Author

BBuf commented Jan 24, 2026

@BBuf BBuf merged commit 3992a02 into main Jan 24, 2026
293 of 314 checks passed
@BBuf BBuf deleted the try_to_move_fa4_to_jit_kernel branch January 24, 2026 07:25
Johnsonms pushed a commit to Johnsonms/sglang that referenced this pull request Feb 14, 2026
@rainj-me rainj-me mentioned this pull request Feb 26, 2026
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file diffusion SGLang Diffusion documentation Improvements or additions to documentation run-ci sgl-kernel

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants