fuse qkvbfg linear into one gemm and f_b g_b into batched gemm. by strgrb · Pull Request #17801 · sgl-project/sglang

strgrb · 2026-01-27T06:27:13Z

Motivation

There are 8 gemm in kda, and 6 of them (q_proj, k_proj, v_proj, b_proj, f_a_proj, g_a_proj) share the same input, so they can be fused into single gemm, and other 2 of them (f_b_proj, g_b_proj) has the same input/output size, so they can be fused into a batched gemm.

With 4k input 1k output test for decode, the profile like the following:

kernel duration from 5.9+5.7+5.6+5.6+2.5+5.2+5.7+2.6=38.8us to 8.7+3=11.7us

For prefill, kernel duration from 286+286+286+45+20+14+45+19=1001us to 934+35=969us, optimization is much less than decode.

Modifications

Accuracy Tests

test with gsm8k all questions:

python bench_sglang.py --port 8189 --data-path ./test.jsonl --num-questions -1

before:

Accuracy: 0.895
Invalid: 0.001
Latency: 109.134 s
Output throughput: 1175.404 token/s

after:

Accuracy: 0.898
Invalid: 0.000
Latency: 104.709 s
Output throughput: 1224.809 token/s

Benchmarking and Profiling

before:

max concurrency	random input len	random output len	throughput	TTFT	TPOT
8	4000	1000	1.26	473.58	5.88
32	4000	1000	2.50	1445.09	11.33
128	4000	1000	4.38	5326.57	23.62

after:

max concurrency	random input len	random output len	throughput	TTFT	TPOT
8	4000	1000	1.37	470.89	5.28
32	4000	1000	2.65	1443.90	10.62
128	4000	1000	4.56	5612.32	22.48

TPOT is decreased by about 5%~10%

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-01-27T06:27:17Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

python/sglang/srt/models/kimi_linear.py

strgrb · 2026-01-29T09:53:27Z

/tag-and-rerun-ci

strgrb · 2026-01-29T09:56:37Z

@yizhang2077 @ispobock Now benchmark is ready here.

strgrb · 2026-01-30T02:43:11Z

/rerun-failed-ci

strgrb · 2026-01-30T04:12:30Z

/rerun-failed-ci

ispobock · 2026-01-30T12:55:02Z

python/sglang/srt/models/kimi_linear.py

+                self.num_heads,
+            ]
+            self.fg_sizes = [self.head_dim, self.head_dim]
+            self.fused_qkvbfg_proj = MergedColumnParallelRepeatedLinear(


It seems that, for large bs, it will cause higher ttft, since it's already compute bound.

Gemm with large M will not cause higher latency, I think this benchmark result may be caused by random issue. I'll test it for more times and do some separate test.

@ispobock I used benchmark but ttft result is not so stable, so I use throughput at output_size=1 to identify the prefill performance at batch_size=128.
Before opt is 12.87 and after is 12.96, this result is quite stable.

@ispobock And for separate test, I use following code to test. Mainly for m=4096 to 16384, because these values are commonly used chunked_prefill_size .

import torch m = 8192 k = 2304 n1s = [1024, 1024, 1024, 8, 128, 128] n2 = 1024 x = torch.randn([m, k], device='cuda', dtype=torch.bfloat16) w1s = [torch.randn([n, k], device='cuda', dtype=torch.bfloat16) for n in n1s] w2s = [torch.randn([n2, 128], device='cuda', dtype=torch.bfloat16) for _ in range(2)] merged_w1 = torch.cat(w1s, dim=0) merged_w2 = torch.stack(w2s, dim=0) def forward_before(): y1s = [x @ w.T for w in w1s] y21 = y1s[-2] @ w2s[0].T y22 = y1s[-1] @ w2s[1].T def forward_after(): merged_y1 = x @ merged_w1.T merged_y2 = torch.bmm(merged_y1[:, -256:].view(m, 2, 128).transpose(0, 1), merged_w2.transpose(-1, -2)) forward_before() forward_after()

For m=4096, 0.57ms vs 0.50ms, for m=8192, 1.03ms vs 0.98ms, for m=16384, 1.99ms vs 1.91ms

strgrb · 2026-02-02T03:57:54Z

/rerun-failed-ci

strgrb · 2026-02-02T13:32:15Z

/rerun-failed-ci

strgrb · 2026-02-03T02:45:21Z

/rerun-failed-ci

strgrb · 2026-02-03T03:47:44Z

/rerun-failed-ci

strgrb · 2026-02-03T05:59:50Z

/rerun-failed-ci

strgrb · 2026-02-03T08:42:16Z

@ispobock It seems most ci passed and the rest is not related, and rerun bot seems not working.

strgrb · 2026-02-03T12:56:55Z

/rerun-failed-ci

strgrb · 2026-02-04T01:44:20Z

/rerun-failed-ci

strgrb · 2026-02-04T02:24:21Z

@ispobock now it's all passed

…project#17801)

fuse qkvbfg linear into one gemm and f_b g_b into batched gemm.

f8a89ee

strgrb requested review from ispobock and yizhang2077 January 27, 2026 06:37

yuan-luo reviewed Jan 29, 2026

View reviewed changes

python/sglang/srt/models/kimi_linear.py Outdated Show resolved Hide resolved

yuan-luo reviewed Jan 29, 2026

View reviewed changes

python/sglang/srt/models/kimi_linear.py Outdated Show resolved Hide resolved

strgrb added 2 commits January 29, 2026 15:26

fix beta precision

cdac534

fix comment

8bf92d2

strgrb requested review from BBuf, Edwardf0t1, Fridge003, HaiShaw, Ying1123, ch-wan and merrymercy as code owners January 29, 2026 09:51

github-actions bot added the run-ci label Jan 29, 2026

ispobock reviewed Jan 30, 2026

View reviewed changes

Merge branch 'main' into dev/kda_fuse_gemm

61e2984

ispobock approved these changes Feb 4, 2026

View reviewed changes

ispobock merged commit 37c33cc into sgl-project:main Feb 4, 2026
376 of 398 checks passed

charlesHsuGG pushed a commit to charlesHsuGG/sglang that referenced this pull request Feb 5, 2026

fuse qkvbfg linear into one gemm and f_b g_b into batched gemm. (sgl-…

bdbc626

…project#17801)

sfiisf pushed a commit to sfiisf/sglang that referenced this pull request Feb 5, 2026

fuse qkvbfg linear into one gemm and f_b g_b into batched gemm. (sgl-…

e2baa69

…project#17801)

RubiaCx pushed a commit to RubiaCx/sglang that referenced this pull request Feb 8, 2026

fuse qkvbfg linear into one gemm and f_b g_b into batched gemm. (sgl-…

e1e3eae

…project#17801)

Johnsonms pushed a commit to Johnsonms/sglang that referenced this pull request Feb 14, 2026

fuse qkvbfg linear into one gemm and f_b g_b into batched gemm. (sgl-…

c2d8861

…project#17801)

Conversation

strgrb commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist bot commented Jan 27, 2026

Uh oh!

Uh oh!

Uh oh!

strgrb commented Jan 29, 2026

Uh oh!

strgrb commented Jan 29, 2026

Uh oh!

strgrb commented Jan 30, 2026

Uh oh!

strgrb commented Jan 30, 2026

Uh oh!

ispobock Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

strgrb Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

strgrb Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

strgrb Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

strgrb commented Feb 2, 2026

Uh oh!

strgrb commented Feb 2, 2026

Uh oh!

strgrb commented Feb 3, 2026

Uh oh!

strgrb commented Feb 3, 2026

Uh oh!

strgrb commented Feb 3, 2026

Uh oh!

strgrb commented Feb 3, 2026

Uh oh!

strgrb commented Feb 3, 2026

Uh oh!

strgrb commented Feb 4, 2026

Uh oh!

strgrb commented Feb 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

strgrb commented Jan 27, 2026 •

edited

Loading