Releases: sgl-project/sgl-kernel-npu
Releases · sgl-project/sgl-kernel-npu
20251206
What's Changed
- add catlass ops demo by @ltcs11 in #200
- Add release package dependencies by @monkeyLoveding in #215
- bugfix: The rint interface causes ub to be unaligned. by @chenxu140 in #208
- add pybind11 by @monkeyLoveding in #217
- Upload wheel to release by @BourneSun0527 in #218
- Upload wheel to release (#218) by @BourneSun0527 in #220
- optimizer internode_dispatch hostbound by @zuje123 in #211
- Release workflow switch to cpu by @BourneSun0527 in #222
- switch release machine by @BourneSun0527 in #224
- Add cann path to LD_LIBRARY_PATH by @BourneSun0527 in #225
- [Feat] Lightning indexer op & GE helper engineering by @randgun in #203
- Fixing missing.so files by @monkeyLoveding in #226
- try Fixing missing.so files by @monkeyLoveding in #228
- add sinks_attenton for GPT-OSS by @Todobe in #216
- add zero_experts_compute_identity by @Todobe in #214
- Add performance testing section to the moe script by @goosj in #198
- normal_dispatch num_recv_tokens_per_expert_list support prefixSum by @zuje123 in #221
- A2 dispatch/combine layered operator adaptation for SGLang interface by @oagniqgnat in #209
- Add swiglu_oai for GPT-OSS by @Todobe in #233
- [DFX] Adaptable to multiple model validations for fused moe by @kaniel-outis in #229
- [Bugfix] add padding cases for causal_conv1d_update by @ltcs11 in #235
- [Feat] add chunk_gated_delta_rule triton support by @ltcs11 in #232
- Add two mixed-race tests: normal and low latency, normal and fused deep moe. by @goosj in #206
- debug deepep build by @BourneSun0527 in #231
- rework release build by @iforgetmyname in #237
New Contributors
- @ltcs11 made their first contribution in #200
- @monkeyLoveding made their first contribution in #215
Full Changelog: 2025112...2025120
20251128
What's Changed
- add test internode for deepep by @zuje123 in #193
- Support run normal mode deepep on a single A2 machine by @luanyundu in #201
- [Test] Testing the generalization of fused moe by @kaniel-outis in #167
- Add whl packages to Github Release by @BourneSun0527 in #204
- Add two scripts by @DubhepPan in #119
- support long cat on a3 by @luanyundu in #182
- calculate dispatch normal input parameters using npu instead of cpu by @lih827 in #177
- Add alloc_extend_kernel by @hw-csong in #196
- Modify deepep README_CN.md by @oagniqgnat in #187
- notify_dispatch kernel change magic from int32_t to uint64_t by @zuje123 in #202
Full Changelog: 2025112...2025112
20251120
What's Changed
- dispatch and combine batchsize support 4096 for A2 by @ruiqiangworking in #173
- remove redundant check by @ruiqiangworking in #175
- optimize deepep setup, package name with cann version by @zuje123 in #178
- deepep low_latency d&c support a2 single server by @zuje123 in #176
- Add README files for mlapo and batch_transpose_matmul by @randgun in #104
- Support device with different counts of AICore (FusedDeepMoe operator) by @wangqiankun13 in #180
- Add triton decode attention kernels by @RuixuanZhang06 in #184
- fix cann version check by @hustmf in #188
- update a verification of HCCL_BUFFSIZE for moe by @goosj in #183
- op transfer kv fixbug by @husf1130 in #194
- add_norm_bias and split_qkv_norm_rope for qwen3 by @chenxu140 in #157
- [Chore] Upgrade CANN to 8.3.RC1 by @iforgetmyname in #195
New Contributors
- @hustmf made their first contribution in #188
- @chenxu140 made their first contribution in #157
Full Changelog: 2025111...2025112
20251110
What's Changed
- Added custom low_latency operators for dispatch/combine in the A2 dec… by @oagniqgnat in #166
- deepep support internode api by @zuje123 in #169
- add layout to ops2 directory by @luanyundu in #171
- Modified the deep_ep README and add A2 operator performance data. by @oagniqgnat in #168
- feat: add verify_tree_greedy_kernel triton kernel by @ranjiewen in #165
- optimize a2 layered combine kernel code by @ruiqiangworking in #172
- feat:tiny bugfix&Performance Optimization by @Yael-X in #170
New Contributors
- @ruiqiangworking made their first contribution in #172
Full Changelog: 2025110...2025111
20251106
What's Changed
- Add dependency on the moe header file of CANN by @DubhepPan in #152
- support small bs = 1 or 2 by @wangyibo1005 in #150
- feat:adapt x86_64 compilation by @Yael-X in #143
- [DFX] Compatible with CAN 8.2 and CAN 8.3 by @kaniel-outis in #158
- add mla_preprocess test script by @LinyuanLi0046 in #153
- [DFX] adapt cann8.3 by @kaniel-outis in #159
- [bugfix] swiglu quant by @Liwansi in #162
- [New Ops] build tree efficient by @hw-csong in #161
- support shallow fused topk=-1 by @wangyibo1005 in #160
- support kvcacheio by @husf1130 in #163
- improve layout kernel on a2 by @luanyundu in #164
New Contributors
- @DubhepPan made their first contribution in #152
- @Liwansi made their first contribution in #162
- @hw-csong made their first contribution in #161
Full Changelog: 2025103...2025110
20251030
What's Changed
- add a2 dispatch layout and update its test by @luanyundu in #149
- support topk=-1 by @wangyibo1005 in #132
- add env to decide whether send out prefix sum or not by @luanyundu in #151
- refactor: make hiddenStateDim a class member in MlaTilingData, Follow up closed PR#82 by @LinyuanLi0046 in #133
- support cachemode int8_nzcache with bf16 in mla_preprocess by @LinyuanLi0046 in #135
- add op transfer_kv_dim_exchange by @husf1130 in #148
- impl fused_swiglu_quant with group_list for deepep-low-latency by @xiaobaicxy in #155
- [Kernel] add Flash-Linear-Attention/layernorm_gated Triton op by @iforgetmyname in #154
New Contributors
- @LinyuanLi0046 made their first contribution in #133
- @husf1130 made their first contribution in #148
- @xiaobaicxy made their first contribution in #155
Full Changelog: 2025102...2025103
20251023
What's Changed
- Change the padding generation from randperm back to arange by @oagniqgnat in #140
- LoRA: moving kernels from vllm-ascend repo by @vlserov in #128
- Update README.md of DeepEp by @goosj in #144
New Contributors
Full Changelog: 2025102...2025102
20251022
What's Changed
- Update README.md: Add performace of normal and low latency dispatch/combine by @oagniqgnat in #106
- Support debug info for build by @jia-rundong in #99
- Update README by @oagniqgnat in #115
- Synchronous fusion moe by @kaniel-outis in #108
- Fix the severe performance degradation issue of the top9 dispatch in normal mode compared to top8. by @oagniqgnat in #117
- feat:add moe fused operator test draft by @Yael-X in #120
- mlapo fit different hidden state dim by @Todobe in #82
- Not use download.pytorch.org by @jia-rundong in #121
- EPLB for fused_deep_moe by @wangyibo1005 in #116
- [FusedDeepMoe] Support EPLB by @kaniel-outis in #118
- Support different token hidden sizes and gmm hidden sizes [FusedDeepMoe Operator] by @wangqiankun13 in #123
- Delete left useless code [FusedDeepMoe Operator] by @wangqiankun13 in #129
- update qwen3-next performance kernels by @iforgetmyname in #130
- [Bugfix] Remove unused code that causes split failure in Qwen3-Next by @iforgetmyname in #142
New Contributors
- @Todobe made their first contribution in #82
- @wangqiankun13 made their first contribution in #123
Full Changelog: 2025092...2025102
20250926
What's Changed
- Reapply fix for hccl buffer use and verify by @zuje123 in #91
- Compilation warnings pending cleanup by @oagniqgnat in #86
- Add CI to test args -a deepep by @jia-rundong in #84
- [Feature] Add diagnostic modules to dispatch and combine by @oagniqgnat in #95
- add FusedDeepMoe by @wangyibo1005 in #92
- fused_moe_for_sglang by @kaniel-outis in #94
- fix some bug for fused moe by @kaniel-outis in #102
- unfold layout expert limit and fix bug by @luanyundu in #107
- Added --pressure-test function in test_low_latency by @oagniqgnat in #101
- fix for sglang verl and readme by @lbk-sys in #98
- [feat] add batch_matmul_transpose op by @randgun in #77
- update fused moe readme by @kaniel-outis in #110
- Modify test_low_latency to support int8 quantization testing. by @oagniqgnat in #109
- feat:add env var to switch quant by @Yael-X in #112
New Contributors
- @oagniqgnat made their first contribution in #86
- @wangyibo1005 made their first contribution in #92
- @kaniel-outis made their first contribution in #94
Full Changelog: 2025091...2025092
20250913
What's Changed
- mlapo support bf16 KV Cache NZ format by @shengzhaotian in #79
- Fix the memory verification issue within intranode dispatch by @lih827 in #83
- [Feature] add fla and mamba kernels by @iforgetmyname in #87
- Revert "Fix the memory verification issue within intranode dispatch" by @iforgetmyname in #88
- Revert "Separate the buffers used by D/C and notify_dispatch to avoid conflicts" by @iforgetmyname in #89
Full Changelog: 2025090...2025091