[Feat] add chunk_gated_delta_rule triton support#232
[Feat] add chunk_gated_delta_rule triton support#232iforgetmyname merged 4 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello @ltcs11, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly refactors the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces Triton kernel implementations for chunk_gated_delta_rule to support NPU hardware, which is a significant feature addition. The changes include new Triton kernels for various sub-operations, utility functions, and corresponding tests.
My review has identified several critical bugs in the Triton kernels related to incorrect handling of batching and parallelism, which would lead to runtime errors or incorrect outputs for batch sizes greater than one. I've also pointed out some areas for improvement regarding code clarity, such as removing debug prints and documenting magic numbers used for performance tuning. The overall structure is good, but the identified bugs need to be addressed to ensure correctness and performance.
…unning * upstream/main: rework release build (sgl-project#237) release build (sgl-project#231) Add two mixed-race tests: normal and low latency, normal and fused deep moe. (sgl-project#206) [Feat] add chunk_gated_delta_rule triton support (sgl-project#232) [Bugfix] add padding cases for causal_conv1d_update (sgl-project#235) [DFX] Adaptable to multiple model validations for fused moe (sgl-project#229) Add swiglu_oai for GPT-OSS (sgl-project#233)
…into main * 'main' of https://github.com/sgl-project/sgl-kernel-npu: (44 commits) fix a2 deepep doc (sgl-project#279) prepare build and release for a2 (sgl-project#273) add deepep a2 doc (sgl-project#277) Add the long-sequence ant migration feature for the prefill combine operator. (sgl-project#267) Fixing Chinese character encoding issues (sgl-project#275) [Bugfix] fix TorchNpuHelper rename bugs (sgl-project#265) qwen3-next op optimize (sgl-project#257) fixup bug in conv1d_update_fn (sgl-project#259) add long sequence feature for normal deep_ep (sgl-project#254) modify md file (sgl-project#255) sgl-kernel-npu add release version (sgl-project#253) add a script for generalize test (sgl-project#131) fixing release ci (sgl-project#248) fix normal and low_latency layerd rdma_data_size when mixed running (sgl-project#246) Fixing the issue where the A2 notify_dispatch operator gets stuck on cann8.3 (sgl-project#245) rework release build (sgl-project#237) release build (sgl-project#231) Add two mixed-race tests: normal and low latency, normal and fused deep moe. (sgl-project#206) [Feat] add chunk_gated_delta_rule triton support (sgl-project#232) [Bugfix] add padding cases for causal_conv1d_update (sgl-project#235) ...
[TODO] currently this merge ops can only support VARLEN with head dim 128 and chunk_size 64 (for Qwen-Next-80B)