[diffusion] hardware: support diffusion models on MTGPU (doc, 6/N) by yeahdongcn · Pull Request #17346 · sgl-project/sglang

yeahdongcn · 2026-01-19T09:40:42Z

Motivation

This PR is the 6th in a series of pull requests (tracked in #16565) to add full support for Moore Threads GPUs, leveraging MUSA (Meta-computing Unified System Architecture) to accelerate LLM inference.

Modifications

Add Moore Threads (MThreads) / MUSA documentation.

Note: #15592 is required for MUSA to run black-forest-labs/FLUX.1-dev.

Testing Done

Tested in a clean torch_musa container.

root@worker3218:/ws# sglang generate --model-path /home/dist/FLUX.1-dev/ --prompt "A logo With Bold Large text: SGL Diffusion" --save-output
2026-01-19 17:22:06 | hf_diffusers_utils | 139713722844288 | INFO : Diffusers version: 0.30.0.dev0
2026-01-19 17:22:06 | hf_diffusers_utils | 139713722844288 | INFO : Diffusers version: 0.30.0.dev0
2026-01-19 17:22:06 | registry | 139713722844288 | INFO : Using native sglang backend for model '/home/dist/FLUX.1-dev/'
2026-01-19 17:22:06 | registry | 139713722844288 | INFO : Found model info: ModelInfo(pipeline_cls=<class 'sglang.multimodal_gen.runtime.pipelines.flux.FluxPipeline'>, sampling_param_cls=<class 'sglang.multimodal_gen.configs.sample.flux.FluxSamplingParams'>, pipeline_config_cls=<class 'sglang.multimodal_gen.configs.pipeline_configs.flux.FluxPipelineConfig'>)
[01-19 17:22:06] Disabling some offloading (except dit, text_encoder) for image generation model
[01-19 17:22:06] Port 5555 was unavailable, using port 5597 instead
[01-19 17:22:06] server_args: {"model_path": "/home/dist/FLUX.1-dev/", "backend": "auto", "attention_backend": null, "diffusers_attention_backend": null, "nccl_port": null, "trust_remote_code": false, "revision": null, "num_gpus": 1, "tp_size": 1, "sp_degree": 1, "ulysses_degree": 1, "ring_degree": 1, "dp_size": 1, "dp_degree": 1, "enable_cfg_parallel": false, "hsdp_replicate_dim": 1, "hsdp_shard_dim": 1, "dist_timeout": null, "pipeline_class_name": null, "lora_path": null, "lora_nickname": "default", "vae_path": null, "lora_target_modules": null, "dit_cpu_offload": true, "dit_layerwise_offload": null, "text_encoder_cpu_offload": true, "image_encoder_cpu_offload": false, "vae_cpu_offload": false, "use_fsdp_inference": false, "pin_cpu_memory": true, "comfyui_mode": false, "mask_strategy_file_path": null, "STA_mode": "STA_inference", "skip_time_steps": 15, "enable_torch_compile": false, "warmup": false, "warmup_resolutions": null, "disable_autocast": true, "VSA_sparsity": 0.0, "moba_config_path": null, "moba_config": {}, "master_port": 30067, "host": "127.0.0.1", "port": 30000, "webui": false, "webui_port": 12312, "scheduler_port": 5597, "output_path": "outputs/", "prompt_file_path": null, "model_paths": {}, "model_loaded": {"transformer": true, "vae": true}, "boundary_ratio": null, "log_level": "info"}
[01-19 17:22:06] Local mode: True
[01-19 17:22:06] Starting server...
[01-19 17:22:14] Scheduler bind at endpoint: tcp://127.0.0.1:5597
[01-19 17:22:14] Initializing distributed environment with world_size=1, device=musa:0
[01-19 17:22:14] No pipeline_class_name specified, using model_index.json
[01-19 17:22:14] Diffusers version: 0.30.0.dev0
[01-19 17:22:14] Diffusers version: 0.30.0.dev0
[01-19 17:22:14] Using native sglang backend for model '/home/dist/FLUX.1-dev/'
[01-19 17:22:14] Found model info: ModelInfo(pipeline_cls=<class 'sglang.multimodal_gen.runtime.pipelines.flux.FluxPipeline'>, sampling_param_cls=<class 'sglang.multimodal_gen.configs.sample.flux.FluxSamplingParams'>, pipeline_config_cls=<class 'sglang.multimodal_gen.configs.pipeline_configs.flux.FluxPipelineConfig'>)
[01-19 17:22:14] Using pipeline from model_index.json: FluxPipeline
[01-19 17:22:14] Loading pipeline modules...
[01-19 17:22:14] Model already exists locally and is complete
[01-19 17:22:14] Model path: /home/dist/FLUX.1-dev/
[01-19 17:22:14] Diffusers version: 0.30.0.dev0
[01-19 17:22:14] Loading pipeline modules from config: {'_class_name': 'FluxPipeline', '_diffusers_version': '0.30.0.dev0', 'scheduler': ['diffusers', 'FlowMatchEulerDiscreteScheduler'], 'text_encoder': ['transformers', 'CLIPTextModel'], 'text_encoder_2': ['transformers', 'T5EncoderModel'], 'tokenizer': ['transformers', 'CLIPTokenizer'], 'tokenizer_2': ['transformers', 'T5TokenizerFast'], 'transformer': ['diffusers', 'FluxTransformer2DModel'], 'vae': ['diffusers', 'AutoencoderKL']}
[01-19 17:22:14] Loading required components: ['text_encoder', 'text_encoder_2', 'tokenizer', 'tokenizer_2', 'vae', 'transformer', 'scheduler']
Loading required modules:   0%|                               | 0/7 [00:00<?, ?it/s][01-19 17:22:14] Loading text_encoder from /home/dist/FLUX.1-dev/text_encoder. avail mem: 79.30 GB
[01-19 17:22:16] Using Torch SDPA backend.
[01-19 17:22:16] [RunAI Streamer] Overall time to stream 234.7 MiB of all files to cpu: 0.05s, 4.5 GiB/s
[01-19 17:22:20] Loaded text_encoder: FSDPCLIPTextModel (sgl-diffusion version). model size: 0.23 GB, avail mem: 79.05 GB
Loading required modules:  14%|███▎                   | 1/7 [00:05<00:31,  5.29s/it][01-19 17:22:20] Loading text_encoder_2 from /home/dist/FLUX.1-dev/text_encoder_2. avail mem: 79.05 GB
[01-19 17:22:22] [RunAI Streamer] Overall time to stream 8.9 GiB of all files to cpu: 2.02s, 4.4 GiB/s
[01-19 17:22:34] Loaded text_encoder_2: FSDPT5EncoderModel (sgl-diffusion version). model size: 8.87 GB, avail mem: 79.03 GB
Loading required modules:  29%|██████▌                | 2/7 [00:19<00:51, 10.36s/it][01-19 17:22:34] Loading tokenizer from /home/dist/FLUX.1-dev/tokenizer. avail mem: 79.03 GB
[01-19 17:22:34] Loaded tokenizer: CLIPTokenizerFast (sgl-diffusion version). model size: 0.00 GB, avail mem: 79.03 GB
[01-19 17:22:34] Loading tokenizer_2 from /home/dist/FLUX.1-dev/tokenizer_2. avail mem: 79.03 GB
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
[01-19 17:22:34] Loaded tokenizer_2: T5TokenizerFast (sgl-diffusion version). model size: 0.00 GB, avail mem: 79.03 GB
Loading required modules:  57%|█████████████▏         | 4/7 [00:19<00:11,  3.97s/it][01-19 17:22:34] Loading vae from /home/dist/FLUX.1-dev/vae. avail mem: 79.03 GB
[01-19 17:22:34] Loaded vae: AutoencoderKL (sgl-diffusion version). model size: 0.31 GB, avail mem: 78.68 GB
Loading required modules:  71%|████████████████▍      | 5/7 [00:19<00:05,  2.81s/it][01-19 17:22:34] Loading transformer from /home/dist/FLUX.1-dev/transformer. avail mem: 78.68 GB
[01-19 17:22:34] Loading FluxTransformer2DModel from 3 safetensors files, default_dtype: torch.bfloat16
[01-19 17:22:34] Using Torch SDPA backend.
[01-19 17:22:36] [RunAI Streamer] Overall time to stream 22.2 GiB of all files to cpu: 1.64s, 13.5 GiB/s
[01-19 17:22:41] Loaded model with 11.90B parameters
[01-19 17:22:41] Loaded transformer: FluxTransformer2DModel (sgl-diffusion version). model size: 22.17 GB, avail mem: 56.47 GB
Loading required modules:  86%|███████████████████▋   | 6/7 [00:26<00:03,  3.93s/it][01-19 17:22:41] Loading scheduler from /home/dist/FLUX.1-dev/scheduler. avail mem: 56.47 GB
[01-19 17:22:41] Loaded scheduler: FlowMatchEulerDiscreteScheduler (sgl-diffusion version). model size: 0.00 GB, avail mem: 56.47 GB
Loading required modules: 100%|███████████████████████| 7/7 [00:26<00:00,  3.75s/it]
[01-19 17:22:41] Creating pipeline stages...
[01-19 17:22:41] Using Torch SDPA backend.
[01-19 17:22:41] Pipeline instantiated
[01-19 17:22:41] Worker 0: Initialized device, model, and distributed environment.
[01-19 17:22:41] Worker 0: Scheduler loop started.
[01-19 17:22:41] Processing prompt 1/1: A logo With Bold Large text: SGL Diffusion
[01-19 17:22:41] Sampling params:
                       width: -1
                      height: -1
                  num_frames: 1
                      prompt: A logo With Bold Large text: SGL Diffusion
                  neg_prompt: None
                        seed: 42
                 infer_steps: 50
      num_outputs_per_prompt: 1
              guidance_scale: 1.0
     embedded_guidance_scale: 3.5
                    n_tokens: None
                  flow_shift: None
                  image_path: None
                 save_output: True
            output_file_path: outputs/A_logo_With_Bold_Large_text_SGL_Diffusion_20260119-172241_2cb639af.png
        
[01-19 17:22:41] Running pipeline stages: ['input_validation_stage', 'prompt_encoding_stage_primary', 'conditioning_stage', 'timestep_preparation_stage', 'latent_preparation_stage', 'denoising_stage', 'decoding_stage']
[01-19 17:22:41] [InputValidationStage] started...
[01-19 17:22:41] [InputValidationStage] finished in 0.0001 seconds
[01-19 17:22:41] [TextEncodingStage] started...
[01-19 17:23:01] [TextEncodingStage] finished in 19.9630 seconds
[01-19 17:23:01] [ConditioningStage] started...
[01-19 17:23:01] [ConditioningStage] finished in 0.0000 seconds
[01-19 17:23:01] [TimestepPreparationStage] started...
[01-19 17:23:01] [TimestepPreparationStage] finished in 0.0013 seconds
[01-19 17:23:01] [LatentPreparationStage] started...
[01-19 17:23:01] [LatentPreparationStage] finished in 0.0054 seconds
[01-19 17:23:01] [DenoisingStage] started...
  0%|                                                        | 0/50 [00:00<?, ?it/s][01-19 17:23:02] Failed to load JIT QK-Norm kernel: No module named 'tvm_ffi'
[01-19 17:23:02] /ws/python/sglang/multimodal_gen/runtime/models/dits/flux.py:225: UserWarning: FlashInfer not available, using Triton fallback for RoPE
  query, key = apply_flashinfer_rope_qk_inplace(

100%|███████████████████████████████████████████████| 50/50 [00:18<00:00,  2.70it/s]
[01-19 17:23:19] [DenoisingStage] average time per step: 0.3708 seconds
[01-19 17:23:19] [DenoisingStage] finished in 18.5531 seconds
[01-19 17:23:19] [DecodingStage] started...
[01-19 17:23:19] /usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py:321: UserWarning: In musa autocast, but the target dtype is not supported. Disabling autocast.
 musa Autocast only supports dtypes of torch.float16, torch.bfloat16 currently.
  warnings.warn(error_message)

[01-19 17:23:19] /usr/local/lib/python3.10/dist-packages/diffusers/models/attention_processor.py:2765: UserWarning: Unsupported qk_head_dim: 512 v_head_dim: 512 for FlashAttention in MUSA backend (Triggered internally at /home/torch_musa/torch_musa/csrc/aten/ops/attention/mudnn/SDPUtils.h:129.)
  hidden_states = F.scaled_dot_product_attention(

[01-19 17:23:19] [DecodingStage] finished in 0.0445 seconds
[01-19 17:23:19] Peak GPU memory: 26.67 GB, Remaining GPU memory at peak: 53.33 GB. Components that can stay resident: ['text_encoder', 'text_encoder_2', 'transformer']
[01-19 17:23:24] Output saved to outputs/A_logo_With_Bold_Large_text_SGL_Diffusion_20260119-172241_2cb639af.png
[01-19 17:23:24] Pixel data generated successfully in 42.96 seconds
[01-19 17:23:24] Completed batch processing. Generated 1 outputs in 42.97 seconds
[01-19 17:23:24] Memory usage - Max peak: 27306.64 MB, Avg peak: 27306.64 MB
[01-19 17:23:24] Generator was garbage collected without being shut down. Attempting to shut down the local server and client.
root@worker3218:/ws# /usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

A_logo_With_Bold_Large_text_SGL_Diffusion_20260119-172241_2cb639af

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

python/sglang/multimodal_gen/docs/install.md

python/sglang/multimodal_gen/README.md

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

yeahdongcn · 2026-01-28T06:44:38Z

Rebased onto upstream/main.

…gl-project#17346) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

github-actions bot added documentation Improvements or additions to documentation diffusion SGLang Diffusion labels Jan 19, 2026

yeahdongcn mentioned this pull request Jan 19, 2026

[Roadmap][Feature] Support Moore Threads (MUSA) GPU #16565

Open

2 tasks

yeahdongcn marked this pull request as ready for review January 19, 2026 09:54

yeahdongcn requested review from mickqian and yhyang201 as code owners January 19, 2026 09:54

yeahdongcn added the mthreads label Jan 19, 2026

mickqian reviewed Jan 19, 2026

View reviewed changes

python/sglang/multimodal_gen/docs/install.md Outdated Show resolved Hide resolved

python/sglang/multimodal_gen/README.md Outdated Show resolved Hide resolved

yeahdongcn force-pushed the xd/diffusion_doc branch from 87e3a27 to 333fb5f Compare January 19, 2026 14:10

[diffusion] hardware: support diffusion models on MTGPU (doc, 6/N)

39ba38e

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

yeahdongcn force-pushed the xd/diffusion_doc branch from 333fb5f to 39ba38e Compare January 28, 2026 06:43

Kangyan-Zhou merged commit 7de650c into sgl-project:main Feb 3, 2026
56 of 60 checks passed

charlesHsuGG pushed a commit to charlesHsuGG/sglang that referenced this pull request Feb 5, 2026

[diffusion] hardware: support diffusion models on MTGPU (doc, 6/N) (s…

8d76233

…gl-project#17346) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

sfiisf pushed a commit to sfiisf/sglang that referenced this pull request Feb 5, 2026

[diffusion] hardware: support diffusion models on MTGPU (doc, 6/N) (s…

0a0433b

…gl-project#17346) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

RubiaCx pushed a commit to RubiaCx/sglang that referenced this pull request Feb 8, 2026

[diffusion] hardware: support diffusion models on MTGPU (doc, 6/N) (s…

08e4d5b

…gl-project#17346) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

Johnsonms pushed a commit to Johnsonms/sglang that referenced this pull request Feb 14, 2026

[diffusion] hardware: support diffusion models on MTGPU (doc, 6/N) (s…

419ea31

…gl-project#17346) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[diffusion] hardware: support diffusion models on MTGPU (doc, 6/N)#17346

[diffusion] hardware: support diffusion models on MTGPU (doc, 6/N)#17346
Kangyan-Zhou merged 1 commit intosgl-project:mainfrom
yeahdongcn:xd/diffusion_doc

yeahdongcn commented Jan 19, 2026

Uh oh!

Uh oh!

Uh oh!

yeahdongcn commented Jan 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yeahdongcn commented Jan 19, 2026

Motivation

Modifications

Testing Done

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

Uh oh!

Uh oh!

yeahdongcn commented Jan 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants