Skip to content

[Bug] NPU Qwen3-32B eagle3 failed with sgl_kernel_npu error. #343

@chz34

Description

@chz34

Bug Report

Failed to start serving with following errors.

[2026-01-22 08:26:23] SIGQUIT received. signum=None, frame=None. It usually means one child failed.
[2026-01-22 08:26:23 TP0] Scheduler hit an exception: Traceback (most recent call last):
  File "/home/chz/sglang/python/sglang/srt/managers/scheduler.py", line 2957, in run_scheduler_process
    scheduler.event_loop_overlap()
  File "/usr/local/Python-3.11/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/chz/sglang/python/sglang/srt/managers/scheduler.py", line 1126, in event_loop_overlap
    batch_result = self.run_batch(batch)
                   ^^^^^^^^^^^^^^^^^^^^^
  File "/home/chz/sglang/python/sglang/srt/managers/scheduler.py", line 2278, in run_batch
    batch_result = self.model_worker.forward_batch_generation(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/chz/sglang/python/sglang/srt/speculative/eagle_worker_v2.py", line 656, in forward_batch_generation
    self.draft_worker._draft_extend_for_prefill(
  File "/home/chz/sglang/python/sglang/srt/speculative/eagle_worker_v2.py", line 500, in _draft_extend_for_prefill
    logits_output = self.draft_runner.forward(forward_batch).logits_output
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/chz/sglang/python/sglang/srt/model_executor/model_runner.py", line 2234, in forward
    output = self._forward_raw(
             ^^^^^^^^^^^^^^^^^^
  File "/home/chz/sglang/python/sglang/srt/model_executor/model_runner.py", line 2333, in _forward_raw
    ret = self.forward_extend(
          ^^^^^^^^^^^^^^^^^^^^
  File "/home/chz/sglang/python/sglang/srt/model_executor/model_runner.py", line 2173, in forward_extend
    return self.model.forward(
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/Python-3.11/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/chz/sglang/python/sglang/srt/models/llama.py", line 508, in forward
    hidden_states = self.model(
                    ^^^^^^^^^^^
  File "/usr/local/Python-3.11/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/Python-3.11/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/chz/sglang/python/sglang/srt/models/llama_eagle3.py", line 170, in forward
    hidden_states, residual = self.midlayer(
                              ^^^^^^^^^^^^^^
  File "/usr/local/Python-3.11/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/Python-3.11/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/chz/sglang/python/sglang/srt/models/llama_eagle3.py", line 90, in forward
    hidden_states = self.self_attn(
                    ^^^^^^^^^^^^^^^
  File "/usr/local/Python-3.11/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/Python-3.11/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/chz/sglang/python/sglang/srt/models/llama.py", line 230, in forward
    q, k, v = self.forward_prepare_npu(
              ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/chz/sglang/python/sglang/srt/models/llama.py", line 208, in forward_prepare_npu
    q, k, v = split_qkv_rmsnorm_rope(
              ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/Python-3.11/lib/python3.11/site-packages/sgl_kernel_npu/norm/split_qkv_rmsnorm_rope.py", line 218, in split_qkv_rmsnorm_rope
    assert KV_BLOCK_SIZE == head_dim
           ^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError

server_eagle.sh: line 24: 576903 Killed                  python3 -m sglang.launch_server --model-path /home/chz/Qwen3-32B --host 0.0.0.0 --port 37654 --device npu --attention-backend ascend --tp-size 4 --dp-size 1 --mem-fraction-static 0.7 --disable-cuda-graph --speculative-draft-model-path "/home/ckpt/Qwen3-32B_eagle3" --speculative-algorithm "EAGLE3" --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4

Reproduction

Running with following scripts, refering https://docs.sglang.io/platforms/ascend_npu_qwen3_examples.html.

export SGLANG_SET_CPU_AFFINITY=1
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export STREAMS_PER_DEVICE=32
export HCCL_OP_EXPANSION_MODE=AIV
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1

python3 -m sglang.launch_server \
    --model-path /home/chz/Qwen3-32B \
    --host 0.0.0.0 \
    --port 37654 \
    --device npu \
    --attention-backend ascend \
    --tp-size 4 \
    --dp-size 1 \
    --mem-fraction-static 0.7 \
    --disable-cuda-graph \
    --speculative-draft-model-path "/home/ckpt/Qwen3-32B_eagle3" \
    --speculative-algorithm "EAGLE3" \
    --speculative-num-steps 3 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 4 

Environment

Python: 3.11.11 (main, May 29 2025, 13:02:50) [Clang 17.0.6 ( 17.0.6-7.oe2203sp4)]
NPU available: True
NPU 0,1,2,3,4,5,6,7: Ascend910B1
CANN_HOME: /usr/local/Ascend/ascend-toolkit/latest
CANN: 8.3.0.1.200:8.3.RC1
BiSheng: 2025-10-24T18:53:37+08:00 clang version 15.0.5 (clang-5c68a1cb1231 flang-5c68a1cb1231)
Ascend Driver Version: 25.2.0
PyTorch: 2.7.1+cpu
sglang: 0.5.5.post3
sgl_kernel: Module Not Found
flashinfer_python: Module Not Found
flashinfer_cubin: Module Not Found
flashinfer_jit_cache: Module Not Found
triton: Module Not Found
transformers: 4.57.1
torchao: 0.9.0
numpy: 1.26.4
aiohttp: 3.12.13
fastapi: 0.115.14
hf_transfer: 0.1.9
huggingface_hub: 0.34.3
interegular: 0.3.3
modelscope: 1.28.0
orjson: 3.10.18
outlines: 0.1.11
packaging: 25.0
psutil: 7.0.0
pydantic: 2.11.7
python-multipart: 0.0.20
pyzmq: 27.0.0
uvicorn: 0.35.0
uvloop: 0.21.0
vllm: 0.8.4.dev0+g296c657.d20250701.empty
xgrammar: 0.1.25
openai: 1.99.1
tiktoken: 0.9.0
anthropic: 0.57.1
litellm: 1.74.0.post1
decord2: 2.0.0
torch_npu: 2.7.1.post1.dev20251107
sgl-kernel-npu: 2026.1.12
deep_ep: Module Not Found
Ascend Topology:
           NPU0       NPU1       NPU2       NPU3       NPU4       NPU5       NPU6       NPU7       CPU Affinity
NPU0       X          HCCS       HCCS       HCCS       HCCS       HCCS       HCCS       HCCS       144-167
NPU1       HCCS       X          HCCS       HCCS       HCCS       HCCS       HCCS       HCCS       0-23
NPU2       HCCS       HCCS       X          HCCS       HCCS       HCCS       HCCS       HCCS       144-167
NPU3       HCCS       HCCS       HCCS       X          HCCS       HCCS       HCCS       HCCS       0-23
NPU4       HCCS       HCCS       HCCS       HCCS       X          HCCS       HCCS       HCCS       96-119
NPU5       HCCS       HCCS       HCCS       HCCS       HCCS       X          HCCS       HCCS       48-71
NPU6       HCCS       HCCS       HCCS       HCCS       HCCS       HCCS       X          HCCS       96-119
NPU7       HCCS       HCCS       HCCS       HCCS       HCCS       HCCS       HCCS       X          48-71

Legend:

  X    = Self
  SYS  = Path traversing PCIe and NUMA nodes. Nodes are connected through SMP, such as QPI, UPI.
  PHB  = Path traversing PCIe and the PCIe host bridge of a CPU.
  PIX  = Path traversing a single PCIe switch
  PXB  = Path traversing multipul PCIe switches
  HCCS = Connection traversing HCCS.
  NA   = Unknown relationship.

ulimit soft: 1048576

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions