Cover the workflows cases on a3 by zhuyutong332 · Pull Request #321 · sgl-project/sgl-kernel-npu

zhuyutong332 · 2026-01-15T13:18:44Z

Cover the cases on a3, include intranode, low latency, fused deep moe, normal and low latency
a2:
test_intranode - open enable-dynamic-tokens
[tuning] Dispatch (BF16) 67.22 GB/s (HCCS), avg_t: 7039.39 us
[tuning] Combine 80.31 GB/s (HCCS), avg_t: 5892.55 us

test_low_latency - open enable-dynamic-tokens
Average Dispatch bandwidth: 24.27 GB/s, avg_t=622.38 us
Average Combine bandwidth: 66.99 GB/s, avg_t=436.33 us

test_normal_and_low_latency - open enable-dynamic-tokens
Start executing normal test loop 99 ...
End executing normal test loop 99 ...
Start executing low latency test loop 99 ...
End executing low latency test loop 99 ...

a3:
test_intranode - open enable-dynamic-tokens
[tuning] Dispatch (BF16) 95.95 GB/s (HCCS), avg_t: 4948.90 us
[tuning] Combine 95.89 GB/s (HCCS), avg_t: 4951.82 us

test_low_latency - open enable-dynamic-tokens
Average Dispatch bandwidth: 27.38 GB/s, avg_t=560.26 us
Average Combine bandwidth: 89.44 GB/s, avg_t=331.96 us

test_normal_and_low_latency - open enable-dynamic-tokens
Start executing normal test loop 99 ...
End executing normal test loop 99 ...
Start executing low latency test loop 99 ...
End executing low latency test loop 99 ...

…, normal and low latency

gemini-code-assist · 2026-01-15T13:18:50Z

Note

Gemini is unable to generate a summary for this pull request due to the file types involved not being currently supported.

.github/workflows/pr-test-npu.yml

…e case where topk=1 is excluded.

…mic token code.

* upstream/main: add function for deep-ep tests (sgl-project#301) [Doc] Improved README.md content and English grammar and integrated the DeepWiki badge for Ask AI (sgl-project#345) (test) add solve_tril from upstream (sgl-project#339) Add AscendC triangular inverse (sgl-project#332) support the situation that topk maybe -1 on machine A3 (sgl-project#313) chunk_gated_delta_rule_npu output final state (sgl-project#341) The environment variable DEEPEP_HCCL_BUFFSIZE is added, and the priority of DEEPEP_HCCL_BUFFSIZE is higher than that of HCCL_BUFFSIZE. (sgl-project#329) Added the low_latency operator API documentation. (sgl-project#337) Added the verification of num_max_dispatch_tokens_per_rank to the decode operator adaptation layer. (sgl-project#330) Document get_dispatch_layout API (sgl-project#338) 【Doc】add fused deep moe doc (sgl-project#335) add deepep normal api doc (sgl-project#336) remove the limit that A2 internode only support topk 8 (sgl-project#323) Optimize the performance of the Combine Ant Moving function and the use of HCCL buffer (sgl-project#314) deepep adapt custom cann installation path (sgl-project#327) [Chore] CANN version bump to 8.5.0 (sgl-project#326) add dfx for operator FusedDeepMoe (sgl-project#317) Integrate ccache for faster compilation (sgl-project#318)

reported as an error by a2.)[O

…-npu into sgl-cmake2 * 'sgl-cmake2' of https://github.com/1329009851/sgl-kernel-npu: CI execution requirements for separating a2 and a3 (sgl-project#367) Fix the bug that total expert num greater than 256 or local expert num is less than 8 (sgl-project#364) adapt ant moving to A2 single machine (sgl-project#362) reset ci -- run test mixed running for experts on a2. (sgl-project#365) Revert "Build the deepep package with the chip model included. (sgl-project#274)" (sgl-project#363) fix:buffer control (sgl-project#361) Build the deepep package with the chip model included. (sgl-project#274) bugfix wrong packages build dir (sgl-project#360) bump version to 2026.02.01 (sgl-project#359) Cover the workflows cases on a3 (sgl-project#321) release follows naming convention (sgl-project#356) Modify notifydispatch to support DEEPEP_NORMAL_LONG_SEQ_ROUND up to 128. (sgl-project#352) fix the hanging bug (sgl-project#355) [Bugfix] Fix build script working with cann 8.5.0 (sgl-project#354) Modify the description of DeepEP in the README file. (sgl-project#348) Revert "Add scripts for building CMake files (sgl-project#344)" (sgl-project#353) Add scripts for building CMake files (sgl-project#344) Support x86_64 and aarch64 binary release (sgl-project#325) add function for deep-ep tests (sgl-project#301) [Doc] Improved README.md content and English grammar and integrated the DeepWiki badge for Ask AI (sgl-project#345)

* Cover the cases on a3, include intranode, low latency, fused deep moe, normal and low latency * fix BUFFSIZE of intranode * cleancode error * add test-build-deep * fix low latency assert error * cleancode * fix intranode stuck when num-tokens=1 * fix min buffsize * fix min buffsize * fix min buffsize * delete intranode test with little bs. * 流水线报错重跑 * rerun pipeline * Separate the tests under the two test_intranode conditions. * rerun pipeline * del low latency test when topk=1 * When topk=1, no data (na) is available for low latency. Therefore, the case where topk=1 is excluded. * create a2 ci. * add num-processes when run on a2 * fix syntax errors. * rerun the pipeline. * Replacing the image of a2 test. * rerun the pipeline * rerun the pipeline * fix buffsize and edit of a2 * fix buffsize. * The value of MOE_ENABLE_TOPK_NEG_ONE is changed to match the new dynamic token code. * add args enable-dynamic-tokens * a2 rerun pipeline * replace a2 images, rerun a2 pipeline * repleace a2 image, rerun pipeline * rerun a2 pipeline * rerun a2 pipeline * rerun a3 pipeline * linting * replace a2 image, rerun a2 pipeline * rerun a2 pipeline * add mix test on a2, rerun a2 pipeline * add continue-on-error: true. rerun a2 pipeline. * add normal test on a2, rerun a2 pipeline.(Test the command currently reported as an error by a2.)[O * rerun a2 pipeline * rerun a2 pipeline * rerun a2 pipeline * rerun pipeline * adjust buffsize and add intranode for rerunning the pipeline * add mix test on a2 * only rerun intranode on a2 * add intranode on a2 * fix intranode num-tokens, rerun a2 pipeline * rerun a3 for intranode experts * rerun a3 * rerun a3 * rerun a3 * rerun a3 * rerun a3 * a3 * a3 * delete all experts test

Cover the cases on a3, include intranode, low latency, fused deep moe…

fe8878a

…, normal and low latency

zhuyutong332 added 12 commits January 16, 2026 09:16

fix BUFFSIZE of intranode

1c080b9

cleancode error

d42dbd2

add test-build-deep

050a825

fix low latency assert error

5df9a9d

cleancode

64a09c9

fix intranode stuck when num-tokens=1

d747c0c

fix min buffsize

9200975

fix min buffsize

cbd6305

fix min buffsize

c232f43

delete intranode test with little bs.

082cc73

流水线报错重跑

5c98ced

rerun pipeline

49f9fb3

Yael-X reviewed Jan 23, 2026

View reviewed changes

.github/workflows/pr-test-npu.yml Outdated Show resolved Hide resolved

zhuyutong332 added 15 commits January 23, 2026 14:43

Separate the tests under the two test_intranode conditions.

baaecb8

rerun pipeline

c894ece

del low latency test when topk=1

7ab8de2

When topk=1, no data (na) is available for low latency. Therefore, th…

79900c8

…e case where topk=1 is excluded.

create a2 ci.

0728780

add num-processes when run on a2

be5b1c5

fix syntax errors.

32d65fd

rerun the pipeline.

7c70d82

Replacing the image of a2 test.

21e5521

rerun the pipeline

cebf0bd

rerun the pipeline

70a1602

fix buffsize and edit of a2

ea63069

fix buffsize.

965ad8d

The value of MOE_ENABLE_TOPK_NEG_ONE is changed to match the new dyna…

ffb48a9

…mic token code.

zhuyutong332 added 25 commits January 29, 2026 09:15

rerun a3 pipeline

ad045ee

linting

887a2ac

replace a2 image, rerun a2 pipeline

fa77fb6

rerun a2 pipeline

6322ad1

add mix test on a2, rerun a2 pipeline

7b275c3

add continue-on-error: true. rerun a2 pipeline.

0e7fb82

add normal test on a2, rerun a2 pipeline.(Test the command currently

dcb0f49

reported as an error by a2.)[O

rerun a2 pipeline

0afaeff

rerun a2 pipeline

af13057

rerun a2 pipeline

90434cb

rerun pipeline

e99d181

adjust buffsize and add intranode for rerunning the pipeline

bde63e2

add mix test on a2

9ab118d

only rerun intranode on a2

68ee520

add intranode on a2

16804d5

fix intranode num-tokens, rerun a2 pipeline

caa79e6

rerun a3 for intranode experts

9033065

rerun a3

c6838bf

rerun a3

1ada5c9

rerun a3

7912b1d

rerun a3

f9895ec

rerun a3

fc7753e

a3

cbf7dbb

a3

0f72fa6

delete all experts test

da98fac

Yael-X approved these changes Feb 2, 2026

View reviewed changes

Yael-X merged commit 7f8f943 into sgl-project:main Feb 2, 2026
5 checks passed

zhuyutong332 deleted the update_ci branch February 9, 2026 02:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cover the workflows cases on a3#321

Cover the workflows cases on a3#321
Yael-X merged 59 commits intosgl-project:mainfrom
zhuyutong332:update_ci

zhuyutong332 commented Jan 15, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Jan 15, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zhuyutong332 commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Jan 15, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zhuyutong332 commented Jan 15, 2026 •

edited

Loading