Move fa4 from sgl-kernel to jit kernel by BBuf · Pull Request #17353 · sgl-project/sglang

BBuf · 2026-01-19T12:37:42Z

Motivation

Flash Attention V4 is implemented using the cute-dsl and is inherently JIT-compiled, so moving it to jit_kernel is a fairly natural transition.

python/sglang/jit_kernel/flash_attention/cute directory is copied from https://github.com/Dao-AILab/flash-attention/tree/main/flash_attn/cute , so this directory do not need review.

unit-test passed.

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

Fridge003 · 2026-01-19T12:44:58Z

Nice, we definitely need this.
In this way the update of fa4 won't be blocked by sgl-kernel update

gemini-code-assist · 2026-01-19T12:49:00Z

Summary of Changes

Hello @BBuf, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the project by migrating the FlashAttention v4 (FA4) implementation to a new, more appropriate location within the sglang/jit_kernel module. This change streamlines the project's architecture, making the JIT-compiled components more discoverable and logically grouped. The core functionality of FA4 remains unchanged, with the primary focus being on improving the structural integrity of the codebase.

Highlights

FlashAttention v4 (FA4) Migration: The entire FlashAttention v4 implementation has been relocated from the sgl-kernel directory to python/sglang/jit_kernel/flash_attention/cute/. This includes all core CUDA Template Engine (CuTE) DSL files, helper functions, and related utilities.
Codebase Centralization: This move centralizes the FA4 codebase within the sglang project structure, improving organization and maintainability by placing JIT-compiled kernels in a dedicated jit_kernel module.
Import Path Updates: Dependent modules, including sglang/multimodal_gen/runtime/layers/attention/backends/flash_attn.py and sglang/srt/layers/attention/flashattention_backend.py, have been updated to reflect the new import paths for FA4 functions.
Test File Relocation: The associated test file sgl-kernel/tests/test_flash_attention_4.py has been renamed and moved to python/sglang/jit_kernel/tests/test_flash_attention_4.py, aligning tests with the new module structure.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request moves the FlashAttention v4 implementation from sgl-kernel to a new jit_kernel directory within the sglang package. As requested, the review focuses on the integration points and new interface code, while skipping the vendored code from the flash-attention repository.

I've identified a few critical issues in the new interface and integration code that will cause runtime errors due to incorrect function calls and undefined variables. These need to be addressed to ensure the new FA4 implementation can be correctly invoked. Please see the detailed comments for suggestions on how to fix them.

python/sglang/jit_kernel/flash_attention_v4.py

python/sglang/multimodal_gen/runtime/layers/attention/backends/flash_attn.py

python/sglang/srt/layers/attention/flashattention_backend.py

BBuf · 2026-01-19T13:21:40Z

/tag-and-rerun-ci

merrymercy · 2026-01-20T07:49:52Z

We do not guarantee any backward compatbility of an experimental API (early FA4) in sgl kernel for other non-sglang projects, so I believe we can merge this and delete old code.

It should be easy for other projects to migrate (copy code and change a few lines of imports). It does not make sense to sacrifice our code's cleanness for other projects. Unfortunately, the other projects have to pay the maintenance overhead.

BBuf · 2026-01-20T09:04:13Z

@BBuf

Thanks for the explanation. I agree that FA4 should evolve independently and should not be upgraded or released from sgl-kernel.

My point about keeping FA4 in sgl-kernel is only for backward compatibility. sgl-kernel is depended on by multiple downstream projects (not just sglang), and some users do not rely on PyPI installs. Removing it would be a breaking change.

Since the new FA4 uses a different import path (JIT or a future standalone package), keeping the existing FA4 in sgl-kernel does not constrain or interfere with future FA4 upgrades. It can remain frozen as a legacy interface.

For this reason, I’d prefer to keep it and follow the same approach for future kernel migrations unless we explicitly plan a breaking change.

Agree with it, I'll add sgl-kernel fa4 back, thanks for your explanation.

johnnynunez · 2026-01-20T11:48:02Z

I think that at some point, we can move everything to cute dsl, so It would be nice to remove C++ FA from sglang

It saves time compilation == Less pressure CI == reduce costs.

FA cute dsl it has FA2 and FA3(but i don't know if it has all the features)

Now i close to cutlass team, so i'll divulgate the future changes

BBuf · 2026-01-21T01:49:16Z

I think that at some point, we can move everything to cute dsl, so It would be nice to remove C++ FA from sglang

It saves time compilation == Less pressure CI == reduce costs.

FA cute dsl it has FA2 and FA3(but i don't know if it has all the features)

Now i close to cutlass team, so i'll divulgate the future changes

Indeed, this can significantly reduce the size of the sgl-kernel package.

johnnynunez · 2026-01-21T02:52:56Z

I think that at some point, we can move everything to cute dsl, so It would be nice to remove C++ FA from sglang
It saves time compilation == Less pressure CI == reduce costs.
FA cute dsl it has FA2 and FA3(but i don't know if it has all the features)
Now i close to cutlass team, so i'll divulgate the future changes

Indeed, this can significantly reduce the size of the sgl-kernel package.

yes, and we can incorporate more archs.

BBuf · 2026-01-21T03:53:49Z

/rerun-failed-ci

johnnynunez · 2026-01-21T12:31:03Z

The problem that I see here is that the API is still beta and not all features are available, so in sgl-kernel, we point to exact commit in cmakelist.txt so we should have the same behavior here, because they are changing constantly the API

BBuf · 2026-01-21T12:33:54Z

The problem that I see here is that the API is still beta and not all features are available, so in sgl-kernel, we point to exact commit in cmakelist.txt so we should have the same behavior here, because they are changing constantly the API

If there are changes to the interface later, we can modify it accordingly, and we won't have to go through the cumbersome process of releasing a new version through sgl-kernel.

BBuf · 2026-01-21T15:39:37Z

/rerun-failed-ci

BBuf · 2026-01-23T15:45:31Z

/rerun-failed-ci

BBuf · 2026-01-23T15:49:59Z

/rerun-failed-ci

BBuf · 2026-01-24T00:18:12Z

/rerun-failed-ci

BBuf · 2026-01-24T01:46:28Z

/rerun-failed-ci

BBuf · 2026-01-24T06:29:47Z

Merged with ci green. https://github.com/sgl-project/sglang/actions/runs/21291016853/job/61339174344?pr=17353

BBuf · 2026-01-24T07:18:07Z

@zhyncs The request change has been solved and ci passed too. Can you give me a approve? Thanks.

BBuf · 2026-01-24T07:24:55Z

Merged with ci green. https://github.com/sgl-project/sglang/actions/runs/21291016853/job/61339174355?pr=17353

BBuf added 8 commits January 18, 2026 18:16

ud

a9a5104

ud

8e29998

ud

de8a2e5

ud

0e049c8

ud

ddfef72

ud

6a17983

ud

7cf1f7f

ud

830671d

BBuf requested review from DarkSharpness, FlamingoPg, Fridge003, HaiShaw, Qiaolin-Yu, hebiao064, ispobock, merrymercy, mickqian, yhyang201, yizhang2077 and zhyncs as code owners January 19, 2026 12:37

github-actions bot added documentation Improvements or additions to documentation dependencies Pull requests that update a dependency file sgl-kernel diffusion SGLang Diffusion labels Jan 19, 2026

gemini-code-assist bot reviewed Jan 19, 2026

View reviewed changes

python/sglang/jit_kernel/flash_attention_v4.py Outdated Show resolved Hide resolved

python/sglang/multimodal_gen/runtime/layers/attention/backends/flash_attn.py Outdated Show resolved Hide resolved

python/sglang/srt/layers/attention/flashattention_backend.py Outdated Show resolved Hide resolved

BBuf added 2 commits January 19, 2026 20:56

format

c6d1f80

ud

3c3cb31

merrymercy approved these changes Jan 20, 2026

View reviewed changes

merrymercy requested a review from zhyncs January 20, 2026 07:45

sgl-project deleted a comment from zhyncs Jan 20, 2026

BBuf and others added 2 commits January 21, 2026 09:44

ud

9136f65

Merge branch 'main' into try_to_move_fa4_to_jit_kernel

51c735c

fix ci

a1fda15

BBuf and others added 3 commits January 23, 2026 20:33

ud

24effe8

Merge branch 'main' into try_to_move_fa4_to_jit_kernel

3e20b76

ud

735ada6

zhyncs approved these changes Jan 24, 2026

View reviewed changes

BBuf merged commit 3992a02 into main Jan 24, 2026
293 of 314 checks passed

BBuf deleted the try_to_move_fa4_to_jit_kernel branch January 24, 2026 07:25

Johnsonms pushed a commit to Johnsonms/sglang that referenced this pull request Feb 14, 2026

Move fa4 from sgl-kernel to jit kernel (sgl-project#17353)

42cafbb

rainj-me mentioned this pull request Feb 26, 2026

[Feature] Sglang FA4 Refactor #19447

Open

8 tasks

Conversation

BBuf commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

Fridge003 commented Jan 19, 2026

Uh oh!

gemini-code-assist bot commented Jan 19, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BBuf commented Jan 19, 2026

Uh oh!

merrymercy commented Jan 20, 2026

Uh oh!

BBuf commented Jan 20, 2026

Uh oh!

johnnynunez commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BBuf commented Jan 21, 2026

Uh oh!

johnnynunez commented Jan 21, 2026

Uh oh!

BBuf commented Jan 21, 2026

Uh oh!

johnnynunez commented Jan 21, 2026

Uh oh!

BBuf commented Jan 21, 2026

Uh oh!

BBuf commented Jan 21, 2026

Uh oh!

BBuf commented Jan 23, 2026

Uh oh!

BBuf commented Jan 23, 2026

Uh oh!

BBuf commented Jan 24, 2026

Uh oh!

BBuf commented Jan 24, 2026

Uh oh!

BBuf commented Jan 24, 2026

Uh oh!

BBuf commented Jan 24, 2026

Uh oh!

BBuf commented Jan 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

BBuf commented Jan 19, 2026 •

edited

Loading

johnnynunez commented Jan 20, 2026 •

edited

Loading