Kernel: optimize decoding metadata in NSA multi-spec backend with fused kernels by Johnsonms · Pull Request #17554 · sgl-project/sglang

Johnsonms · 2026-01-22T06:16:09Z

Motivation

Implement fused CUDA kernels to eliminate redundant metadata copies in Native Sparse Attention (NSA) backend during CUDA graph replay for speculative decoding. This optimization provides 3-5x speedup for multi-backend metadata operations.

Changes

Core Implementation

Add fused_metadata_copy_cuda: Single-backend fused kernel supporting DECODE, TARGET_VERIFY, and DRAFT_EXTEND forward modes
Add fused_metadata_copy_multi_cuda: Multi-backend kernel that copies metadata to 3 backends simultaneously in a single kernel launch

Runtime Optimization

Update nsa_backend.py to intelligently use fused kernels:
speculative_num_steps >= 3: Use fused kernel for first 3 backends,
then copy remaining backends individually
speculative_num_steps < 3: Use individual copy (overhead not worth it)

Testing

Create comprehensive test suite covering:
Single-backend kernel: all forward modes, optional tensors
Multi-backend kernel: 3-backend simultaneous copy
Performance benchmarks with timing and speedup measurements

Performance Impact

CPU side

Before:

After:

GPU side

Before

After

Performance Improvements

CPU-side metadata processing:

Before: 193μs (24 kernel launches)
After: 12μs (1 fused kernel launch)
Speedup: 16x faster (193μs → 12μs)

Kernel execution time:

Before: 177μs
After: 4.7μs
Speedup: 37.6x faster (177μs → 4.7μs)

End-to-end throughput:

Before: 151 tokens/sec
After: 179 tokens/sec
Improvement: +18.5% (+28 tokens/sec)

Technical Details

The fused kernels handle:

Basic metadata: cache_seqlens, cu_seqlens_k, page_table, nsa_cache_seqlens
Optional tensors: real_page_table, flashmla_num_splits, flashmla_metadata
All copies done in single kernel launch, reducing PCIe/memory overhead

Files Modified

sgl-kernel/csrc/elementwise/fused_metadata_copy.cu: CUDA kernel impl
sgl-kernel/python/sgl_kernel/elementwise.py: Python bindings
python/sglang/srt/layers/attention/nsa_backend.py: Runtime integration
sgl-kernel/tests/test_fused_metadata_copy.py: Comprehensive tests

Tested: All 40 tests passing (100% pass rate)

Modifications

Accuracy Tests

Accuracy Test with gsm8k

python3 -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-V3.2-Exp \
  --trust-remote-code \
  --tp-size 8 --dp-size 8 --enable-dp-attention \
  --tool-call-parser deepseekv31 \
  --reasoning-parser deepseek-v3 \
  --chat-template ./examples/chat_template/tool_chat_template_deepseekv32.jinja

python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319

Accuracy Test with gpqa-diamond
Service:
python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 --dp 8 --enable-dp-attention --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4

python3 -m sglang.test.run_eval --port 30000 --eval-name gpqa --num-examples 198 --max-tokens 120000 --repeat 8 --thinking-mode deepseek-v3

Accuracy Test with aime 2025
python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 --dp 8 --enable-dp-attention --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --tool-call-parser deepseekv32 --reasoning-parser deepseek-v3

#! /bin/bash
export NEMO_SKILLS_DISABLE_UNCOMMITTED_CHANGES_CHECK=1

ns prepare_data aime25

PORT=30000
BACKEND=sglang
MODEL="deepseek-ai/DeepSeek-V3.2-Exp" # Should be changed to the model name
MODEL_NAME="dsv32-fp8"

echo "Starting AIME25 evaluation with model $MODEL on port $PORT using backend $BACKEND..."
ns eval \
  --benchmarks=aime25:4 \
  --server_type=$BACKEND \
  --model=$MODEL \
  --server_address=http://localhost:${PORT}/v1 \
  --output_dir=nemo_skills_aime25_${MODEL_NAME}_output_${BACKEND}_$(date +%Y%m%d_%H%M%S) \
  ++chat_template_kwargs.thinking=true \
  ++inference.temperature=1.0 \
  ++inference.top_p=0.95 \
  ++inference.tokens_to_generate=64000
  # ++inference.tokens_to_generate=120000 for Speciale model

Correctness
After the above tests, using both kernels and verified every steps

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-01-22T06:16:13Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Fridge003 · 2026-01-26T06:18:23Z

Can you please move the kernels to jit folder, thanks~

Johnsonms · 2026-01-27T21:15:07Z

Re-performed the accuracy testing

Accuracy Tests

Accuracy Test with gsm8k

python3 -m sglang.launch_server \
  --model-path deepseek-ai/DeepSeek-V3.2-Exp \
  --trust-remote-code \
  --tp-size 8 --dp-size 8 --enable-dp-attention \
  --tool-call-parser deepseekv31 \
  --reasoning-parser deepseek-v3 \
  --chat-template ./examples/chat_template/tool_chat_template_deepseekv32.jinja

python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319

Accuracy Test with gpqa-diamond
Service:
python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 --dp 8 --enable-dp-attention --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4
python3 -m sglang.test.run_eval --port 30000 --eval-name gpqa --num-examples 198 --max-tokens 120000 --repeat 8 --thinking-mode deepseek-v3

Accuracy Test with aime 2025
python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 --dp 8 --enable-dp-attention --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --tool-call-parser deepseekv32 --reasoning-parser deepseek-v3

#! /bin/bash
export NEMO_SKILLS_DISABLE_UNCOMMITTED_CHANGES_CHECK=1

ns prepare_data aime25

PORT=30000
BACKEND=sglang
MODEL="deepseek-ai/DeepSeek-V3.2-Exp" # Should be changed to the model name
MODEL_NAME="dsv32-fp8"

echo "Starting AIME25 evaluation with model $MODEL on port $PORT using backend $BACKEND..."
ns eval \
  --benchmarks=aime25:4 \
  --server_type=$BACKEND \
  --model=$MODEL \
  --server_address=http://localhost:${PORT}/v1 \
  --output_dir=nemo_skills_aime25_${MODEL_NAME}_output_${BACKEND}_$(date +%Y%m%d_%H%M%S) \
  ++chat_template_kwargs.thinking=true \
  ++inference.temperature=1.0 \
  ++inference.top_p=0.95 \
  ++inference.tokens_to_generate=64000
  # ++inference.tokens_to_generate=120000 for Speciale model

Johnsonms · 2026-01-28T03:34:21Z

Can you please move the kernels to jit folder, thanks~

Done, with re-performance Accuracy Tests. Thanks @Fridge003 !

python/sglang/jit_kernel/csrc/elementwise/fused_metadata_copy.cuh

python/sglang/srt/layers/attention/nsa_backend.py

python/sglang/jit_kernel/csrc/elementwise/fused_metadata_copy.cuh

DarkSharpness · 2026-02-10T19:18:34Z

/tag-and-rerun-ci

… parameters Consolidate three separate CUDA kernels (decode, target_verify, draft_extend) into a single unified kernel with runtime mode selection and structured parameter passing. Key changes: - Merge fused_metadata_copy_{decode,target_verify,draft_extend}_kernel into single fused_metadata_copy_kernel with runtime forward_mode branching - Introduce structured parameter passing via SourcePointers, DestinationPointers, FusedMetadataCopyParams, and FusedMetadataCopyMultiParams structs - Use __grid_constant__ attribute for efficient constant memory parameter access - Reduce parameter count from 29 individual parameters to 1 struct - Maintain compile-time optimization via template parameters (HAS_REAL_PAGE_TABLE, HAS_FLASHMLA) - Update multi-backend kernel with same structured parameter pattern Benefits: - Code reduction: 233 fewer lines - Eliminates code duplication across three nearly-identical kernels - Improves maintainability: modify logic once instead of three times - Cleaner API: structured parameters group related pointers logically - Better alignment with other kernels (qknorm.cuh, rmsnorm.cuh patterns) No performance regression expected as hot paths remain optimized via template specialization and __grid_constant__ provides efficient parameter access.

…ata copy

@DarkSharpness

…used metadata copy Replace throw statements with RuntimeCheck and add type-safe helper functions for tensor data pointer extraction. Switch from relying on torch.empty(0) null pointer behavior to proper tvm::ffi::Optional<TensorView> for optional tensors. Changes: - Add unwrap_data_ptr/unwrap_optional_data_ptr helper functions with integrated dtype validation using RuntimeCheck - Update function signatures to use Optional<TensorView> for seqlens_expanded, real_page_table, and flashmla tensors - Replace .data_ptr() null checks with .has_value()/.value() pattern - Update Python wrapper to pass None directly instead of empty tensors - Remove _make_empty_tensor_if_none helper (no longer needed) Addresses review feedback from @DarkSharpness

- Add linear_metadata.py for linear attention metadata handling - Update communicator_nsa_cp.py for context parallel support - Add test_nsa_pool_host_unit.py for HiCache NSA pool testing - Update test_nsa_indexer.py with new test cases - Update discover_metadata.rs for Rust gateway support - Improve error messages in fused_metadata_copy.cuh

yuan-luo · 2026-02-20T02:36:17Z

Awesome job!

Johnsonms requested review from BBuf, FlamingoPg, Fridge003, HaiShaw, Qiaolin-Yu, hebiao064, ispobock, merrymercy, yizhang2077 and zhyncs as code owners January 22, 2026 06:16

github-actions bot added the sgl-kernel label Jan 22, 2026

Johnsonms force-pushed the nsa-metadata-copy-kernel-v2 branch from f1744cc to b59ce0b Compare January 26, 2026 19:17

Johnsonms requested a review from DarkSharpness as a code owner January 26, 2026 19:17

Johnsonms force-pushed the nsa-metadata-copy-kernel-v2 branch from 8533df9 to f9cd4d8 Compare January 26, 2026 23:38

Johnsonms force-pushed the nsa-metadata-copy-kernel-v2 branch from d920fe3 to 094323f Compare January 27, 2026 23:36

DarkSharpness reviewed Jan 28, 2026

View reviewed changes

python/sglang/jit_kernel/csrc/elementwise/fused_metadata_copy.cuh Outdated Show resolved Hide resolved

Johnsonms force-pushed the nsa-metadata-copy-kernel-v2 branch from af349b7 to ceec97a Compare January 28, 2026 18:50

Johnsonms requested a review from DarkSharpness January 28, 2026 22:19

DarkSharpness reviewed Jan 31, 2026

View reviewed changes

python/sglang/srt/layers/attention/nsa_backend.py Outdated Show resolved Hide resolved

Johnsonms requested a review from DarkSharpness January 31, 2026 19:27

DarkSharpness reviewed Feb 9, 2026

View reviewed changes

python/sglang/jit_kernel/csrc/elementwise/fused_metadata_copy.cuh Outdated Show resolved Hide resolved

DarkSharpness reviewed Feb 9, 2026

View reviewed changes

python/sglang/jit_kernel/csrc/elementwise/fused_metadata_copy.cuh Outdated Show resolved Hide resolved

Johnsonms requested a review from DarkSharpness February 9, 2026 22:51

DarkSharpness reviewed Feb 10, 2026

View reviewed changes

python/sglang/jit_kernel/csrc/elementwise/fused_metadata_copy.cuh Outdated Show resolved Hide resolved

DarkSharpness reviewed Feb 10, 2026

View reviewed changes

python/sglang/jit_kernel/csrc/elementwise/fused_metadata_copy.cuh Outdated Show resolved Hide resolved

Johnsonms requested a review from Kangyan-Zhou as a code owner February 14, 2026 01:44

Johnsonms added 8 commits February 14, 2026 02:06

fix(jit_kernel): add dtype validation for fused metadata copy kernel

b9cc812

chore: remove unused fused_metadata_copy_refactor.cuh file

8962c69

test(jit_kernel): add dtype validation negative tests for fused metad…

50f3345

…ata copy

Add environment variables for fused metadata copy

c30103d

Integrate fused metadata copy into NSA backend

c02aae5

Johnsonms force-pushed the nsa-metadata-copy-kernel-v2 branch from 3e7e878 to 50b1638 Compare February 14, 2026 02:09

Merge branch 'main' into nsa-metadata-copy-kernel-v2

5fb79a3

Fridge003 merged commit 34132d6 into sgl-project:main Feb 14, 2026
82 of 94 checks passed

michaelzhang-ai mentioned this pull request Feb 18, 2026

[AMD] Fix mi35x dsv32 mtp nightly #18978

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kernel: optimize decoding metadata in NSA multi-spec backend with fused kernels#17554

Kernel: optimize decoding metadata in NSA multi-spec backend with fused kernels#17554
Fridge003 merged 9 commits intosgl-project:mainfrom
Johnsonms:nsa-metadata-copy-kernel-v2

Johnsonms commented Jan 22, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Jan 22, 2026

Uh oh!

Fridge003 commented Jan 26, 2026

Uh oh!

Johnsonms commented Jan 27, 2026 •

edited

Loading

Uh oh!

Johnsonms commented Jan 28, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DarkSharpness commented Feb 10, 2026

Uh oh!

Uh oh!

yuan-luo commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Johnsonms commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Changes

Core Implementation

Runtime Optimization

Testing

Performance Impact

CPU side

GPU side

Performance Improvements

Technical Details

Files Modified

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist bot commented Jan 22, 2026

Uh oh!

Fridge003 commented Jan 26, 2026

Uh oh!

Johnsonms commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Accuracy Tests

Uh oh!

Johnsonms commented Jan 28, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DarkSharpness commented Feb 10, 2026

Uh oh!

Uh oh!

yuan-luo commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Johnsonms commented Jan 22, 2026 •

edited

Loading

Johnsonms commented Jan 27, 2026 •

edited

Loading