Cover the workflows cases on a3#321
Merged
Yael-X merged 59 commits intosgl-project:mainfrom Feb 2, 2026
Merged
Conversation
…, normal and low latency
Contributor
|
Note Gemini is unable to generate a summary for this pull request due to the file types involved not being currently supported. |
Yael-X
reviewed
Jan 23, 2026
…e case where topk=1 is excluded.
* upstream/main: add function for deep-ep tests (sgl-project#301) [Doc] Improved README.md content and English grammar and integrated the DeepWiki badge for Ask AI (sgl-project#345) (test) add solve_tril from upstream (sgl-project#339) Add AscendC triangular inverse (sgl-project#332) support the situation that topk maybe -1 on machine A3 (sgl-project#313) chunk_gated_delta_rule_npu output final state (sgl-project#341) The environment variable DEEPEP_HCCL_BUFFSIZE is added, and the priority of DEEPEP_HCCL_BUFFSIZE is higher than that of HCCL_BUFFSIZE. (sgl-project#329) Added the low_latency operator API documentation. (sgl-project#337) Added the verification of num_max_dispatch_tokens_per_rank to the decode operator adaptation layer. (sgl-project#330) Document get_dispatch_layout API (sgl-project#338) 【Doc】add fused deep moe doc (sgl-project#335) add deepep normal api doc (sgl-project#336) remove the limit that A2 internode only support topk 8 (sgl-project#323) Optimize the performance of the Combine Ant Moving function and the use of HCCL buffer (sgl-project#314) deepep adapt custom cann installation path (sgl-project#327) [Chore] CANN version bump to 8.5.0 (sgl-project#326) add dfx for operator FusedDeepMoe (sgl-project#317) Integrate ccache for faster compilation (sgl-project#318)
reported as an error by a2.)[O
Yael-X
approved these changes
Feb 2, 2026
1329009851
added a commit
to 1329009851/sgl-kernel-npu
that referenced
this pull request
Feb 11, 2026
…-npu into sgl-cmake2 * 'sgl-cmake2' of https://github.com/1329009851/sgl-kernel-npu: CI execution requirements for separating a2 and a3 (sgl-project#367) Fix the bug that total expert num greater than 256 or local expert num is less than 8 (sgl-project#364) adapt ant moving to A2 single machine (sgl-project#362) reset ci -- run test mixed running for experts on a2. (sgl-project#365) Revert "Build the deepep package with the chip model included. (sgl-project#274)" (sgl-project#363) fix:buffer control (sgl-project#361) Build the deepep package with the chip model included. (sgl-project#274) bugfix wrong packages build dir (sgl-project#360) bump version to 2026.02.01 (sgl-project#359) Cover the workflows cases on a3 (sgl-project#321) release follows naming convention (sgl-project#356) Modify notifydispatch to support DEEPEP_NORMAL_LONG_SEQ_ROUND up to 128. (sgl-project#352) fix the hanging bug (sgl-project#355) [Bugfix] Fix build script working with cann 8.5.0 (sgl-project#354) Modify the description of DeepEP in the README file. (sgl-project#348) Revert "Add scripts for building CMake files (sgl-project#344)" (sgl-project#353) Add scripts for building CMake files (sgl-project#344) Support x86_64 and aarch64 binary release (sgl-project#325) add function for deep-ep tests (sgl-project#301) [Doc] Improved README.md content and English grammar and integrated the DeepWiki badge for Ask AI (sgl-project#345)
zzx-study
pushed a commit
to zzx-study/sgl-kernel-npu
that referenced
this pull request
Feb 28, 2026
* Cover the cases on a3, include intranode, low latency, fused deep moe, normal and low latency * fix BUFFSIZE of intranode * cleancode error * add test-build-deep * fix low latency assert error * cleancode * fix intranode stuck when num-tokens=1 * fix min buffsize * fix min buffsize * fix min buffsize * delete intranode test with little bs. * 流水线报错重跑 * rerun pipeline * Separate the tests under the two test_intranode conditions. * rerun pipeline * del low latency test when topk=1 * When topk=1, no data (na) is available for low latency. Therefore, the case where topk=1 is excluded. * create a2 ci. * add num-processes when run on a2 * fix syntax errors. * rerun the pipeline. * Replacing the image of a2 test. * rerun the pipeline * rerun the pipeline * fix buffsize and edit of a2 * fix buffsize. * The value of MOE_ENABLE_TOPK_NEG_ONE is changed to match the new dynamic token code. * add args enable-dynamic-tokens * a2 rerun pipeline * replace a2 images, rerun a2 pipeline * repleace a2 image, rerun pipeline * rerun a2 pipeline * rerun a2 pipeline * rerun a3 pipeline * linting * replace a2 image, rerun a2 pipeline * rerun a2 pipeline * add mix test on a2, rerun a2 pipeline * add continue-on-error: true. rerun a2 pipeline. * add normal test on a2, rerun a2 pipeline.(Test the command currently reported as an error by a2.)[O * rerun a2 pipeline * rerun a2 pipeline * rerun a2 pipeline * rerun pipeline * adjust buffsize and add intranode for rerunning the pipeline * add mix test on a2 * only rerun intranode on a2 * add intranode on a2 * fix intranode num-tokens, rerun a2 pipeline * rerun a3 for intranode experts * rerun a3 * rerun a3 * rerun a3 * rerun a3 * rerun a3 * a3 * a3 * delete all experts test
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Cover the cases on a3, include intranode, low latency, fused deep moe, normal and low latency
a2:
test_intranode - open enable-dynamic-tokens
[tuning] Dispatch (BF16) 67.22 GB/s (HCCS), avg_t: 7039.39 us
[tuning] Combine 80.31 GB/s (HCCS), avg_t: 5892.55 us
test_low_latency - open enable-dynamic-tokens
Average Dispatch bandwidth: 24.27 GB/s, avg_t=622.38 us
Average Combine bandwidth: 66.99 GB/s, avg_t=436.33 us
test_normal_and_low_latency - open enable-dynamic-tokens
Start executing normal test loop 99 ...
End executing normal test loop 99 ...
Start executing low latency test loop 99 ...
End executing low latency test loop 99 ...
a3:
test_intranode - open enable-dynamic-tokens
[tuning] Dispatch (BF16) 95.95 GB/s (HCCS), avg_t: 4948.90 us
[tuning] Combine 95.89 GB/s (HCCS), avg_t: 4951.82 us
test_low_latency - open enable-dynamic-tokens
Average Dispatch bandwidth: 27.38 GB/s, avg_t=560.26 us
Average Combine bandwidth: 89.44 GB/s, avg_t=331.96 us
test_normal_and_low_latency - open enable-dynamic-tokens
Start executing normal test loop 99 ...
End executing normal test loop 99 ...
Start executing low latency test loop 99 ...
End executing low latency test loop 99 ...