Skip to content

[diffusion] hardware: support diffusion models on MTGPU (multi-GPU, 5/N)#17318

Merged
Kangyan-Zhou merged 1 commit intosgl-project:mainfrom
yeahdongcn:xd/diffusion_multi_gpus
Feb 3, 2026
Merged

[diffusion] hardware: support diffusion models on MTGPU (multi-GPU, 5/N)#17318
Kangyan-Zhou merged 1 commit intosgl-project:mainfrom
yeahdongcn:xd/diffusion_multi_gpus

Conversation

@yeahdongcn
Copy link
Collaborator

@yeahdongcn yeahdongcn commented Jan 19, 2026

Motivation

This PR is the 5th in a series of pull requests (tracked in #16565) to add full support for Moore Threads GPUs, leveraging MUSA (Meta-computing Unified System Architecture) to accelerate LLM inference.

Modifications

  • python/sglang/multimodal_gen/runtime/distributed/device_communicators/pynccl_wrapper.py:

    • Update error message to include MTHREADS GPUs
  • python/sglang/multimodal_gen/utils.py:

    • Add MUSA backend support in find_nccl_library() to return libmccl.so.2
    • Update comment and error message to include MUSA/MCCL

Testing Done

Tested in a clean torch_musa container.

[01-19 11:42:25] Initializing distributed environment with world_size=2, device=musa:0
[01-19 11:42:25] Found nccl from library libmccl.so.2
[01-19 11:42:25] sglang-diffusion is using nccl==2.11.4
[01-19 11:42:45] Found nccl from library libmccl.so.2
[01-19 11:42:45] sglang-diffusion is using nccl==2.11.4

Verified that the video is generated correctly and here is the full log:

root@worker3218:/ws# cat test.py 
from sglang.multimodal_gen import DiffGenerator

def main():
    # Create a diff generator from a pre-trained model
    generator = DiffGenerator.from_pretrained(
        model_path="/home/dist/Wan2.1-T2V-1.3B-Diffusers/",
        num_gpus=2,  # Adjust based on your hardware
    )

    # Generate the video
    video = generator.generate(
        sampling_params_kwargs=dict(
            prompt="A curious raccoon peers through a vibrant field of yellow sunflowers, its eyes wide with interest.",
            return_frames=True,  # Also return frames from this call (defaults to False)
            output_path="my_videos/",  # Controls where videos are saved
            save_output=True
        )
    )

if __name__ == '__main__':
    main()
root@worker3218:/ws# python test.py 
2026-01-19 11:47:09 | hf_diffusers_utils | 140576816678016 | INFO : Diffusers version: 0.33.0.dev0
2026-01-19 11:47:09 | hf_diffusers_utils | 140576816678016 | INFO : Diffusers version: 0.33.0.dev0
2026-01-19 11:47:09 | registry | 140576816678016 | INFO : Using native sglang backend for model '/home/dist/Wan2.1-T2V-1.3B-Diffusers/'
2026-01-19 11:47:09 | registry | 140576816678016 | INFO : Found model info: ModelInfo(pipeline_cls=<class 'sglang.multimodal_gen.runtime.pipelines.wan_pipeline.WanPipeline'>, sampling_param_cls=<class 'sglang.multimodal_gen.configs.sample.wan.WanT2V_1_3B_SamplingParams'>, pipeline_config_cls=<class 'sglang.multimodal_gen.configs.pipeline_configs.wan.WanT2V480PConfig'>)
[01-19 11:47:09] Automatically enable dit_layerwise_offload for Wan for best performance
[01-19 11:47:09] dit_layerwise_offload is enabled, automatically disabling dit_cpu_offload.
[01-19 11:47:09] Automatically set ulysses_degree=sp_degree=2 for best performance
[01-19 11:47:09] server_args: {"model_path": "/home/dist/Wan2.1-T2V-1.3B-Diffusers/", "backend": "auto", "attention_backend": null, "diffusers_attention_backend": null, "nccl_port": null, "trust_remote_code": false, "revision": null, "num_gpus": 2, "tp_size": 1, "sp_degree": 2, "ulysses_degree": 2, "ring_degree": 1, "dp_size": 1, "dp_degree": 1, "enable_cfg_parallel": false, "hsdp_replicate_dim": 1, "hsdp_shard_dim": 2, "dist_timeout": null, "pipeline_class_name": null, "lora_path": null, "lora_nickname": "default", "vae_path": null, "lora_target_modules": null, "dit_cpu_offload": false, "dit_layerwise_offload": true, "text_encoder_cpu_offload": true, "image_encoder_cpu_offload": true, "vae_cpu_offload": true, "use_fsdp_inference": false, "pin_cpu_memory": true, "comfyui_mode": false, "mask_strategy_file_path": null, "STA_mode": "STA_inference", "skip_time_steps": 15, "enable_torch_compile": false, "warmup": false, "warmup_resolutions": null, "disable_autocast": false, "VSA_sparsity": 0.0, "moba_config_path": null, "moba_config": {}, "master_port": 30059, "host": "127.0.0.1", "port": 30000, "webui": false, "webui_port": 12312, "scheduler_port": 5596, "output_path": "outputs/", "prompt_file_path": null, "model_paths": {}, "model_loaded": {"transformer": true, "vae": true}, "boundary_ratio": null, "log_level": "info"}
[01-19 11:47:09] Local mode: True
[01-19 11:47:09] Starting server...
[01-19 11:47:17] Scheduler bind at endpoint: tcp://127.0.0.1:5596
[01-19 11:47:17] Initializing distributed environment with world_size=2, device=musa:0
[01-19 11:47:18] Found nccl from library libmccl.so.2
[01-19 11:47:18] sglang-diffusion is using nccl==2.11.4
[01-19 11:47:21] Found nccl from library libmccl.so.2
[01-19 11:47:21] sglang-diffusion is using nccl==2.11.4
[01-19 11:47:21] No pipeline_class_name specified, using model_index.json
[01-19 11:47:21] Diffusers version: 0.33.0.dev0
[01-19 11:47:21] Diffusers version: 0.33.0.dev0
[01-19 11:47:21] Using native sglang backend for model '/home/dist/Wan2.1-T2V-1.3B-Diffusers/'
[01-19 11:47:21] Found model info: ModelInfo(pipeline_cls=<class 'sglang.multimodal_gen.runtime.pipelines.wan_pipeline.WanPipeline'>, sampling_param_cls=<class 'sglang.multimodal_gen.configs.sample.wan.WanT2V_1_3B_SamplingParams'>, pipeline_config_cls=<class 'sglang.multimodal_gen.configs.pipeline_configs.wan.WanT2V480PConfig'>)
[01-19 11:47:21] Using pipeline from model_index.json: WanPipeline
[01-19 11:47:21] Loading pipeline modules...
Loading required modules:   0%|                                    | 0/5 [00:00<?, ?it/s][01-19 11:47:21] Model already exists locally and is complete
[01-19 11:47:21] Model path: /home/dist/Wan2.1-T2V-1.3B-Diffusers/
[01-19 11:47:21] Diffusers version: 0.33.0.dev0
[01-19 11:47:21] Loading pipeline modules from config: {'_class_name': 'WanPipeline', '_diffusers_version': '0.33.0.dev0', 'scheduler': ['diffusers', 'UniPCMultistepScheduler'], 'text_encoder': ['transformers', 'UMT5EncoderModel'], 'tokenizer': ['transformers', 'T5TokenizerFast'], 'transformer': ['diffusers', 'WanTransformer3DModel'], 'vae': ['diffusers', 'AutoencoderKLWan']}
[01-19 11:47:21] Loading required components: ['text_encoder', 'tokenizer', 'vae', 'transformer', 'scheduler']
Loading required modules:   0%|                                    | 0/5 [00:00<?, ?it/s][01-19 11:47:21] Loading text_encoder from /home/dist/Wan2.1-T2V-1.3B-Diffusers/text_encoder. avail mem: 78.81 GB
[01-19 11:47:29] [RunAI Streamer] Overall time to stream 21.2 GiB of all files to cpu: 8.03s, 2.6 GiB/s
[01-19 11:47:29] [RunAI Streamer] Overall time to stream 21.2 GiB of all files to cpu: 8.13s, 2.6 GiB/s
[01-19 11:47:56] Loaded text_encoder: FSDPUMT5EncoderModel (sgl-diffusion version). model size: 21.16 GB, avail mem: 78.51 GB
Loading required modules:  20%|█████▌                      | 1/5 [00:34<02:17, 34.42s/it][01-19 11:47:56] Loading tokenizer from /home/dist/Wan2.1-T2V-1.3B-Diffusers/tokenizer. avail mem: 78.51 GB
[01-19 11:47:56] Loaded tokenizer: T5TokenizerFast (sgl-diffusion version). model size: 0.00 GB, avail mem: 78.51 GB
Loading required modules:  40%|███████████▏                | 2/5 [00:35<00:43, 14.53s/it][01-19 11:47:56] Loading vae from /home/dist/Wan2.1-T2V-1.3B-Diffusers/vae. avail mem: 78.51 GB
[01-19 11:47:56] Loaded vae: AutoencoderKLWan (sgl-diffusion version). model size: 0.27 GB, avail mem: 78.51 GB
[01-19 11:47:56] Loading transformer from /home/dist/Wan2.1-T2V-1.3B-Diffusers/transformer. avail mem: 78.51 GB
[01-19 11:47:56] Loading WanTransformer3DModel from 2 safetensors files, default_dtype: torch.bfloat16
[01-19 11:47:56] Using Torch SDPA backend.
Loading required modules:  40%|███████████▏                | 2/5 [00:40<00:49, 16.60s/it][01-19 11:48:03] [RunAI Streamer] Overall time to stream 5.3 GiB of all files to cpu: 6.1s, 888.0 MiB/s
[01-19 11:48:03] [RunAI Streamer] Overall time to stream 5.3 GiB of all files to cpu: 1.48s, 3.6 GiB/s
Loading required modules: 100%|████████████████████████████| 5/5 [00:44<00:00,  8.91s/it]
[01-19 11:48:06] Loaded model with 1.42B parameters
[01-19 11:48:06] Loaded transformer: WanTransformer3DModel (sgl-diffusion version). model size: 2.64 GB, avail mem: 75.66 GB
Loading required modules:  80%|██████████████████████▍     | 4/5 [00:44<00:08,  8.40s/it][01-19 11:48:06] Loading scheduler from /home/dist/Wan2.1-T2V-1.3B-Diffusers/scheduler. avail mem: 75.66 GB
[01-19 11:48:06] Loaded scheduler: UniPCMultistepScheduler (sgl-diffusion version). model size: 0.00 GB, avail mem: 75.66 GB
Loading required modules: 100%|████████████████████████████| 5/5 [00:44<00:00,  8.91s/it]
[01-19 11:48:06] Creating pipeline stages...
[01-19 11:48:06] Using Torch SDPA backend.
[01-19 11:48:06] Pipeline instantiated
[01-19 11:48:06] Enabled layerwise offload for WanTransformer3DModel on modules: ['blocks']
[01-19 11:48:06] Worker 0: Initialized device, model, and distributed environment.
[01-19 11:48:06] Worker 0: Scheduler loop started.
[01-19 11:48:06] Adjusting number of frames from 81 to 85 based on number of GPUs (2)
[01-19 11:48:06] Processing prompt 1/1: A curious raccoon peers through a vibrant field of yellow sunflowers, its eyes wide with interest.
[01-19 11:48:06] Sampling params:
                       width: 832
                      height: 480
                  num_frames: 85
                      prompt: A curious raccoon peers through a vibrant field of yellow sunflowers, its eyes wide with interest.
                  neg_prompt: Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards
                        seed: 42
                 infer_steps: 50
      num_outputs_per_prompt: 1
              guidance_scale: 3.0
     embedded_guidance_scale: 6.0
                    n_tokens: None
                  flow_shift: 3.0
                  image_path: None
                 save_output: True
            output_file_path: outputs/A_curious_raccoon_peers_through_a_vibrant_field_of_yellow_sunflowers_its_eyes_wide_with_interest._20260119-114806_f5034b65.mp4
        
[01-19 11:48:06] Running pipeline stages: ['input_validation_stage', 'prompt_encoding_stage', 'conditioning_stage', 'timestep_preparation_stage', 'latent_preparation_stage', 'denoising_stage', 'decoding_stage']
[01-19 11:48:06] [InputValidationStage] started...
[01-19 11:48:06] [InputValidationStage] finished in 0.0001 seconds
[01-19 11:48:06] [TextEncodingStage] started...
[01-19 11:48:29] /ws/python/sglang/multimodal_gen/runtime/models/dits/wanvideo.py:453: UserWarning: FlashInfer not available, using Triton fallback for RoPE
  query, key = apply_flashinfer_rope_qk_inplace(

[01-19 11:48:30] [TextEncodingStage] finished in 24.1123 seconds
[01-19 11:48:30] [ConditioningStage] started...
[01-19 11:48:30] [ConditioningStage] finished in 0.0000 seconds
[01-19 11:48:30] [TimestepPreparationStage] started...
[01-19 11:48:30] [TimestepPreparationStage] finished in 0.0006 seconds
[01-19 11:48:30] [LatentPreparationStage] started...
[01-19 11:48:30] [LatentPreparationStage] finished in 0.0040 seconds
[01-19 11:48:30] [DenoisingStage] started...
  0%|                                                             | 0/50 [00:00<?, ?it/s][01-19 11:48:31] /ws/python/sglang/multimodal_gen/runtime/models/dits/wanvideo.py:453: UserWarning: FlashInfer not available, using Triton fallback for RoPE
  query, key = apply_flashinfer_rope_qk_inplace(

100%|████████████████████████████████████████████████████| 50/50 [02:23<00:00,  2.86s/it]
[01-19 11:50:53] [DenoisingStage] average time per step: 2.8612 seconds
[01-19 11:50:53] [DenoisingStage] finished in 143.1266 seconds
[01-19 11:50:53] [DecodingStage] started...
[01-19 11:50:53] /usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py:321: UserWarning: In musa autocast, but the target dtype is not supported. Disabling autocast.
 musa Autocast only supports dtypes of torch.float16, torch.bfloat16 currently.
  warnings.warn(error_message)

[01-19 11:50:53] /ws/python/sglang/multimodal_gen/runtime/models/vaes/wanvae.py:533: UserWarning: Unsupported qk_head_dim: 384 v_head_dim: 384 for FlashAttention in MUSA backend (Triggered internally at /home/torch_musa/torch_musa/csrc/aten/ops/attention/mudnn/SDPUtils.h:129.)
  x = F.scaled_dot_product_attention(q, k, v)

[01-19 11:50:53] /usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py:321: UserWarning: In musa autocast, but the target dtype is not supported. Disabling autocast.
 musa Autocast only supports dtypes of torch.float16, torch.bfloat16 currently.
  warnings.warn(error_message)

[01-19 11:50:53] /ws/python/sglang/multimodal_gen/runtime/models/vaes/wanvae.py:533: UserWarning: Unsupported qk_head_dim: 384 v_head_dim: 384 for FlashAttention in MUSA backend (Triggered internally at /home/torch_musa/torch_musa/csrc/aten/ops/attention/mudnn/SDPUtils.h:129.)
  x = F.scaled_dot_product_attention(q, k, v)

[01-19 11:51:00] [DecodingStage] finished in 7.1403 seconds
[01-19 11:51:00] Peak GPU memory: 11.77 GB, Remaining GPU memory at peak: 68.23 GB. Components that can stay resident: ['text_encoder', 'vae', 'transformer']
[01-19 11:51:06] Output saved to outputs/A_curious_raccoon_peers_through_a_vibrant_field_of_yellow_sunflowers_its_eyes_wide_with_interest._20260119-114806_f5034b65.mp4
[01-19 11:51:06] Pixel data generated successfully in 179.78 seconds
[01-19 11:51:06] Completed batch processing. Generated 1 outputs in 179.78 seconds
[01-19 11:51:06] Memory usage - Max peak: 12054.81 MB, Avg peak: 12054.81 MB
[01-19 11:51:06] Generator was garbage collected without being shut down. Attempting to shut down the local server and client.
root@worker3218:/ws# /usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 2 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@github-actions github-actions bot added the diffusion SGLang Diffusion label Jan 19, 2026
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @yeahdongcn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request initiates support for Moore Threads (MTGPU) GPUs within the diffusion framework. It integrates the MUSA (Meta-computing Unified System Architecture) backend by adapting the NCCL communication library wrapper to recognize and utilize MCCL. The changes ensure that the system can correctly identify and interact with Moore Threads' collective communication library, laying the groundwork for accelerated LLM inference on their hardware.

Highlights

  • MUSA/MCCL Integration: Introduced a new helper function _nccl_to_mccl_func_name to map standard NCCL function names to their MCCL (Moore Threads Collective Communication Library) equivalents, enabling compatibility with MUSA-based GPUs.
  • Dynamic Library Loading: Enhanced the NCCLLibrary class to detect if the loaded library is MCCL (via _is_mccl flag) and dynamically call the appropriate MCCL functions when NCCL functions are invoked.
  • MUSA Backend Detection: Updated the find_nccl_library() utility to recognize the MUSA backend (torch.version.musa) and correctly return the libmccl.so.2 library path.
  • Improved Error Messaging: Modified error messages and comments to explicitly include MTHREADS GPUs and MUSA/MCCL, providing clearer guidance for users encountering library loading issues.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@yeahdongcn yeahdongcn marked this pull request as ready for review January 19, 2026 03:53
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for Moore Threads (MTGPU) GPUs by integrating the MUSA backend. The changes primarily involve detecting the MUSA environment to load the appropriate MCCL library (libmccl.so.2) and dynamically mapping NCCL function calls to their MCCL equivalents. The implementation is clean and follows existing patterns in the codebase. I've suggested one improvement to make the MCCL library detection more robust. Overall, this is a good contribution towards expanding hardware support.

@yeahdongcn yeahdongcn changed the title [diffusion] hardware: support diffusion on MTGPU (multi-GPU, 5/N) [diffusion] hardware: support diffusion models on MTGPU (multi-GPU, 5/N) Jan 19, 2026
@yeahdongcn yeahdongcn force-pushed the xd/diffusion_multi_gpus branch from b02eb5b to c9a9d4a Compare January 28, 2026 06:35
@yeahdongcn
Copy link
Collaborator Author

Rebased onto upstream/main.

@yeahdongcn yeahdongcn force-pushed the xd/diffusion_multi_gpus branch from c9a9d4a to b5317b5 Compare February 2, 2026 02:55
@yeahdongcn
Copy link
Collaborator Author

Rebased onto upstream/main.

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
@yeahdongcn yeahdongcn force-pushed the xd/diffusion_multi_gpus branch from b5317b5 to 7605599 Compare February 2, 2026 10:30
@Kangyan-Zhou Kangyan-Zhou merged commit ec2461b into sgl-project:main Feb 3, 2026
55 of 60 checks passed
charlesHsuGG pushed a commit to charlesHsuGG/sglang that referenced this pull request Feb 5, 2026
…/N) (sgl-project#17318)

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
sfiisf pushed a commit to sfiisf/sglang that referenced this pull request Feb 5, 2026
…/N) (sgl-project#17318)

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
RubiaCx pushed a commit to RubiaCx/sglang that referenced this pull request Feb 8, 2026
…/N) (sgl-project#17318)

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Johnsonms pushed a commit to Johnsonms/sglang that referenced this pull request Feb 14, 2026
…/N) (sgl-project#17318)

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

diffusion SGLang Diffusion mthreads

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants