[AMD] support two batch overlapping for mori ep#17953
[AMD] support two batch overlapping for mori ep#17953HaiShaw merged 23 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello @billishyahao, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly upgrades the Mori Expert Parallelism (EP) backend by integrating advanced features for performance optimization. It introduces an asynchronous API to reduce latency and leverages multi-HIP streams to enable efficient overlapping of communication and computation. These enhancements are crucial for improving the overall throughput and responsiveness of models utilizing Mori EP, particularly within a two-batch overlapping context, making the system more adaptable to diverse performance demands. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces support for two-batch overlapping for the mori expert parallelism backend, primarily targeting AMD GPUs. The changes are extensive, adding an async API for low latency scenarios and multi-hip stream support to overlap communication and computation. Overall, the implementation is solid and aligns with the PR's objectives. I've identified a critical bug that could cause crashes on non-CUDA platforms and have also included a few suggestions to improve code style and maintainability.
2429529 to
37eb77f
Compare
|
@billishyahao conflicts? |
12c53b6 to
0a709fa
Compare
|
@kkHuang-amd Please review aiter backend, for performance implications |
|
/tag-and-rerun-ci |
| warp_num_per_block=8, | ||
| block_num=64, | ||
| rdma_block_num=32, | ||
| ), |
There was a problem hiding this comment.
Can these parameters be tunable for diff chips (all 3 sections above)?
There was a problem hiding this comment.
Theorically yes! But till now, we only provide end user two ways, (1) one is that manually decide the parameters by user like this, (2) another one is decide automatically inside mori through setting env MORI_EP_LAUNCH_CONFIG_MODE =auto.
For reference:
https://github.com/ROCm/mori/blob/b95cdbd6e0f36a61ae75a570da97d5c308a3fa85/python/mori/ops/dispatch_combine.py#L160C51-L181
…e_fused_experts The `op_output` method (TBO path) introduced in PR #17953 used only `_use_aiter` to decide whether to skip `routed_scaling_factor`, while `_forward_moe_fused_experts` (non-TBO path) uses the more complete condition `self.experts.should_fuse_routed_scaling_factor_in_topk or _use_aiter`. This inconsistency could cause incorrect scaling behavior when TBO is enabled with certain quantization backends. Align `op_output` to use the same condition as `_forward_moe_fused_experts` so both paths handle `routed_scaling_factor` identically. Co-authored-by: Cursor <cursoragent@cursor.com>
PR sgl-project#17953 added EpDispatchCombineKernelType.AsyncLL but the CI Docker image's mori package doesn't have it yet, causing an AttributeError on every mori EP test. Guard the LOW_LATENCY config behind a hasattr check and fall back to INTRA_NODE/INTER_NODE mode when AsyncLL is unavailable.
|
Hi @billishyahao @HaiShaw , I need to revert this PR in #19161 since it will break current CI |
|
Hi @billishyahao, Since I saw the commands you ran in the experiment used CC: @HaiShaw |
|
Hi @hubertlu-tw Thanks for the inquiry. Please check out the post init process in server_args.py # Handle data parallelism.
self._handle_data_parallelism() <----- self.chunked_prefill_size will be adjusted to chunked_prefill_size/dp_size
...
self._handle_a2a_moe() <--- assertion self.chunked_prefill_size <= SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK heresglang/python/sglang/srt/server_args.py Lines 763 to 771 in 9c11a7a So for your case: SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK=8192
--chunked-prefill-size 65536 new chunked_prefill_size will be 65536 / dp_size(8)=8192 so it is legit. |
Thank you for answering my question! |
Motivation
co-author with @kkHuang-amd @ZhaiFeiyue @Duyi-Wang
cc @HaiShaw
This patch is to support TBO aka two batch overlapping feature for mori ep. It can be divided into the following changes:
(1) We introduce MORI async API to support CU-free method for low latency scenario.
(2) We introduce multi hip stream to enable communication-computation overlapping for high throughput scenario.
(3) The relation between sglang arguments and the underlying configs are shown as below
--deepep-mode normal--deepep-mode normal --two-batch-overlap--deepep-mode low_latency--deepep-mode low_latency --two-batch-overlap--deepep-mode autoUnittest is to be added.
Accuracy Tests
Accuracy check pass on gsm8k dataset:
DSR1 FP8 EP8: aiter backend + Mori normal mode + fp8 dispatch + eager
Click to expand the server command
DSR1 FP8 EP8: aiter backend + Mori normal mode + fp8 dispatch + graph
Click to expand the server command
DSR1 FP8 EP8: aiter backend + Mori normal mode + fp8 dispatch + non-persist mla
Click to expand the server command
DSR1 FP8 EP8: aiter backend + Mori low_latency mode + fp8 dispatch + eager
Click to expand the server command
DSR1 FP8 EP8: aiter backend + Mori low_latency mode + fp8 dispatch + graph
Click to expand the server command
DSR1 FP8 EP8: aiter backend + Mori low_latency mode + fp8 dispatch + two-batch-overlap + eager
Click to expand the server command
DSR1 FP8 EP8: aiter backend + Mori low_latency mode + fp8 dispatch + two-batch-overlap + graph
Click to expand the server command
DSR1 FP8 EP8: aiter backend + Mori low_latency mode + fp8 dispatch + two-batch-overlap + graph + non-persist mla
Click to expand the server command
DSR1 FP8 EP8: aiter backend + Mori low_latency mode + fp8 dispatch + two-batch-overlap + graph + MTP
Click to expand the server command
Benchmarking, Performance Gain and Profiling
DSR1 FP8 1k/1k stress test
+25%Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci