Skip to content

[diffusion] hardware: support diffusion models on MTGPU (doc, 6/N)#17346

Merged
Kangyan-Zhou merged 1 commit intosgl-project:mainfrom
yeahdongcn:xd/diffusion_doc
Feb 3, 2026
Merged

[diffusion] hardware: support diffusion models on MTGPU (doc, 6/N)#17346
Kangyan-Zhou merged 1 commit intosgl-project:mainfrom
yeahdongcn:xd/diffusion_doc

Conversation

@yeahdongcn
Copy link
Collaborator

Motivation

This PR is the 6th in a series of pull requests (tracked in #16565) to add full support for Moore Threads GPUs, leveraging MUSA (Meta-computing Unified System Architecture) to accelerate LLM inference.

Modifications

Add Moore Threads (MThreads) / MUSA documentation.

Note: #15592 is required for MUSA to run black-forest-labs/FLUX.1-dev.

Testing Done

Tested in a clean torch_musa container.

root@worker3218:/ws# sglang generate --model-path /home/dist/FLUX.1-dev/ --prompt "A logo With Bold Large text: SGL Diffusion" --save-output
2026-01-19 17:22:06 | hf_diffusers_utils | 139713722844288 | INFO : Diffusers version: 0.30.0.dev0
2026-01-19 17:22:06 | hf_diffusers_utils | 139713722844288 | INFO : Diffusers version: 0.30.0.dev0
2026-01-19 17:22:06 | registry | 139713722844288 | INFO : Using native sglang backend for model '/home/dist/FLUX.1-dev/'
2026-01-19 17:22:06 | registry | 139713722844288 | INFO : Found model info: ModelInfo(pipeline_cls=<class 'sglang.multimodal_gen.runtime.pipelines.flux.FluxPipeline'>, sampling_param_cls=<class 'sglang.multimodal_gen.configs.sample.flux.FluxSamplingParams'>, pipeline_config_cls=<class 'sglang.multimodal_gen.configs.pipeline_configs.flux.FluxPipelineConfig'>)
[01-19 17:22:06] Disabling some offloading (except dit, text_encoder) for image generation model
[01-19 17:22:06] Port 5555 was unavailable, using port 5597 instead
[01-19 17:22:06] server_args: {"model_path": "/home/dist/FLUX.1-dev/", "backend": "auto", "attention_backend": null, "diffusers_attention_backend": null, "nccl_port": null, "trust_remote_code": false, "revision": null, "num_gpus": 1, "tp_size": 1, "sp_degree": 1, "ulysses_degree": 1, "ring_degree": 1, "dp_size": 1, "dp_degree": 1, "enable_cfg_parallel": false, "hsdp_replicate_dim": 1, "hsdp_shard_dim": 1, "dist_timeout": null, "pipeline_class_name": null, "lora_path": null, "lora_nickname": "default", "vae_path": null, "lora_target_modules": null, "dit_cpu_offload": true, "dit_layerwise_offload": null, "text_encoder_cpu_offload": true, "image_encoder_cpu_offload": false, "vae_cpu_offload": false, "use_fsdp_inference": false, "pin_cpu_memory": true, "comfyui_mode": false, "mask_strategy_file_path": null, "STA_mode": "STA_inference", "skip_time_steps": 15, "enable_torch_compile": false, "warmup": false, "warmup_resolutions": null, "disable_autocast": true, "VSA_sparsity": 0.0, "moba_config_path": null, "moba_config": {}, "master_port": 30067, "host": "127.0.0.1", "port": 30000, "webui": false, "webui_port": 12312, "scheduler_port": 5597, "output_path": "outputs/", "prompt_file_path": null, "model_paths": {}, "model_loaded": {"transformer": true, "vae": true}, "boundary_ratio": null, "log_level": "info"}
[01-19 17:22:06] Local mode: True
[01-19 17:22:06] Starting server...
[01-19 17:22:14] Scheduler bind at endpoint: tcp://127.0.0.1:5597
[01-19 17:22:14] Initializing distributed environment with world_size=1, device=musa:0
[01-19 17:22:14] No pipeline_class_name specified, using model_index.json
[01-19 17:22:14] Diffusers version: 0.30.0.dev0
[01-19 17:22:14] Diffusers version: 0.30.0.dev0
[01-19 17:22:14] Using native sglang backend for model '/home/dist/FLUX.1-dev/'
[01-19 17:22:14] Found model info: ModelInfo(pipeline_cls=<class 'sglang.multimodal_gen.runtime.pipelines.flux.FluxPipeline'>, sampling_param_cls=<class 'sglang.multimodal_gen.configs.sample.flux.FluxSamplingParams'>, pipeline_config_cls=<class 'sglang.multimodal_gen.configs.pipeline_configs.flux.FluxPipelineConfig'>)
[01-19 17:22:14] Using pipeline from model_index.json: FluxPipeline
[01-19 17:22:14] Loading pipeline modules...
[01-19 17:22:14] Model already exists locally and is complete
[01-19 17:22:14] Model path: /home/dist/FLUX.1-dev/
[01-19 17:22:14] Diffusers version: 0.30.0.dev0
[01-19 17:22:14] Loading pipeline modules from config: {'_class_name': 'FluxPipeline', '_diffusers_version': '0.30.0.dev0', 'scheduler': ['diffusers', 'FlowMatchEulerDiscreteScheduler'], 'text_encoder': ['transformers', 'CLIPTextModel'], 'text_encoder_2': ['transformers', 'T5EncoderModel'], 'tokenizer': ['transformers', 'CLIPTokenizer'], 'tokenizer_2': ['transformers', 'T5TokenizerFast'], 'transformer': ['diffusers', 'FluxTransformer2DModel'], 'vae': ['diffusers', 'AutoencoderKL']}
[01-19 17:22:14] Loading required components: ['text_encoder', 'text_encoder_2', 'tokenizer', 'tokenizer_2', 'vae', 'transformer', 'scheduler']
Loading required modules:   0%|                               | 0/7 [00:00<?, ?it/s][01-19 17:22:14] Loading text_encoder from /home/dist/FLUX.1-dev/text_encoder. avail mem: 79.30 GB
[01-19 17:22:16] Using Torch SDPA backend.
[01-19 17:22:16] [RunAI Streamer] Overall time to stream 234.7 MiB of all files to cpu: 0.05s, 4.5 GiB/s
[01-19 17:22:20] Loaded text_encoder: FSDPCLIPTextModel (sgl-diffusion version). model size: 0.23 GB, avail mem: 79.05 GB
Loading required modules:  14%|███▎                   | 1/7 [00:05<00:31,  5.29s/it][01-19 17:22:20] Loading text_encoder_2 from /home/dist/FLUX.1-dev/text_encoder_2. avail mem: 79.05 GB
[01-19 17:22:22] [RunAI Streamer] Overall time to stream 8.9 GiB of all files to cpu: 2.02s, 4.4 GiB/s
[01-19 17:22:34] Loaded text_encoder_2: FSDPT5EncoderModel (sgl-diffusion version). model size: 8.87 GB, avail mem: 79.03 GB
Loading required modules:  29%|██████▌                | 2/7 [00:19<00:51, 10.36s/it][01-19 17:22:34] Loading tokenizer from /home/dist/FLUX.1-dev/tokenizer. avail mem: 79.03 GB
[01-19 17:22:34] Loaded tokenizer: CLIPTokenizerFast (sgl-diffusion version). model size: 0.00 GB, avail mem: 79.03 GB
[01-19 17:22:34] Loading tokenizer_2 from /home/dist/FLUX.1-dev/tokenizer_2. avail mem: 79.03 GB
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
[01-19 17:22:34] Loaded tokenizer_2: T5TokenizerFast (sgl-diffusion version). model size: 0.00 GB, avail mem: 79.03 GB
Loading required modules:  57%|█████████████▏         | 4/7 [00:19<00:11,  3.97s/it][01-19 17:22:34] Loading vae from /home/dist/FLUX.1-dev/vae. avail mem: 79.03 GB
[01-19 17:22:34] Loaded vae: AutoencoderKL (sgl-diffusion version). model size: 0.31 GB, avail mem: 78.68 GB
Loading required modules:  71%|████████████████▍      | 5/7 [00:19<00:05,  2.81s/it][01-19 17:22:34] Loading transformer from /home/dist/FLUX.1-dev/transformer. avail mem: 78.68 GB
[01-19 17:22:34] Loading FluxTransformer2DModel from 3 safetensors files, default_dtype: torch.bfloat16
[01-19 17:22:34] Using Torch SDPA backend.
[01-19 17:22:36] [RunAI Streamer] Overall time to stream 22.2 GiB of all files to cpu: 1.64s, 13.5 GiB/s
[01-19 17:22:41] Loaded model with 11.90B parameters
[01-19 17:22:41] Loaded transformer: FluxTransformer2DModel (sgl-diffusion version). model size: 22.17 GB, avail mem: 56.47 GB
Loading required modules:  86%|███████████████████▋   | 6/7 [00:26<00:03,  3.93s/it][01-19 17:22:41] Loading scheduler from /home/dist/FLUX.1-dev/scheduler. avail mem: 56.47 GB
[01-19 17:22:41] Loaded scheduler: FlowMatchEulerDiscreteScheduler (sgl-diffusion version). model size: 0.00 GB, avail mem: 56.47 GB
Loading required modules: 100%|███████████████████████| 7/7 [00:26<00:00,  3.75s/it]
[01-19 17:22:41] Creating pipeline stages...
[01-19 17:22:41] Using Torch SDPA backend.
[01-19 17:22:41] Pipeline instantiated
[01-19 17:22:41] Worker 0: Initialized device, model, and distributed environment.
[01-19 17:22:41] Worker 0: Scheduler loop started.
[01-19 17:22:41] Processing prompt 1/1: A logo With Bold Large text: SGL Diffusion
[01-19 17:22:41] Sampling params:
                       width: -1
                      height: -1
                  num_frames: 1
                      prompt: A logo With Bold Large text: SGL Diffusion
                  neg_prompt: None
                        seed: 42
                 infer_steps: 50
      num_outputs_per_prompt: 1
              guidance_scale: 1.0
     embedded_guidance_scale: 3.5
                    n_tokens: None
                  flow_shift: None
                  image_path: None
                 save_output: True
            output_file_path: outputs/A_logo_With_Bold_Large_text_SGL_Diffusion_20260119-172241_2cb639af.png
        
[01-19 17:22:41] Running pipeline stages: ['input_validation_stage', 'prompt_encoding_stage_primary', 'conditioning_stage', 'timestep_preparation_stage', 'latent_preparation_stage', 'denoising_stage', 'decoding_stage']
[01-19 17:22:41] [InputValidationStage] started...
[01-19 17:22:41] [InputValidationStage] finished in 0.0001 seconds
[01-19 17:22:41] [TextEncodingStage] started...
[01-19 17:23:01] [TextEncodingStage] finished in 19.9630 seconds
[01-19 17:23:01] [ConditioningStage] started...
[01-19 17:23:01] [ConditioningStage] finished in 0.0000 seconds
[01-19 17:23:01] [TimestepPreparationStage] started...
[01-19 17:23:01] [TimestepPreparationStage] finished in 0.0013 seconds
[01-19 17:23:01] [LatentPreparationStage] started...
[01-19 17:23:01] [LatentPreparationStage] finished in 0.0054 seconds
[01-19 17:23:01] [DenoisingStage] started...
  0%|                                                        | 0/50 [00:00<?, ?it/s][01-19 17:23:02] Failed to load JIT QK-Norm kernel: No module named 'tvm_ffi'
[01-19 17:23:02] /ws/python/sglang/multimodal_gen/runtime/models/dits/flux.py:225: UserWarning: FlashInfer not available, using Triton fallback for RoPE
  query, key = apply_flashinfer_rope_qk_inplace(

100%|███████████████████████████████████████████████| 50/50 [00:18<00:00,  2.70it/s]
[01-19 17:23:19] [DenoisingStage] average time per step: 0.3708 seconds
[01-19 17:23:19] [DenoisingStage] finished in 18.5531 seconds
[01-19 17:23:19] [DecodingStage] started...
[01-19 17:23:19] /usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py:321: UserWarning: In musa autocast, but the target dtype is not supported. Disabling autocast.
 musa Autocast only supports dtypes of torch.float16, torch.bfloat16 currently.
  warnings.warn(error_message)

[01-19 17:23:19] /usr/local/lib/python3.10/dist-packages/diffusers/models/attention_processor.py:2765: UserWarning: Unsupported qk_head_dim: 512 v_head_dim: 512 for FlashAttention in MUSA backend (Triggered internally at /home/torch_musa/torch_musa/csrc/aten/ops/attention/mudnn/SDPUtils.h:129.)
  hidden_states = F.scaled_dot_product_attention(

[01-19 17:23:19] [DecodingStage] finished in 0.0445 seconds
[01-19 17:23:19] Peak GPU memory: 26.67 GB, Remaining GPU memory at peak: 53.33 GB. Components that can stay resident: ['text_encoder', 'text_encoder_2', 'transformer']
[01-19 17:23:24] Output saved to outputs/A_logo_With_Bold_Large_text_SGL_Diffusion_20260119-172241_2cb639af.png
[01-19 17:23:24] Pixel data generated successfully in 42.96 seconds
[01-19 17:23:24] Completed batch processing. Generated 1 outputs in 42.97 seconds
[01-19 17:23:24] Memory usage - Max peak: 27306.64 MB, Avg peak: 27306.64 MB
[01-19 17:23:24] Generator was garbage collected without being shut down. Attempting to shut down the local server and client.
root@worker3218:/ws# /usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
A_logo_With_Bold_Large_text_SGL_Diffusion_20260119-172241_2cb639af

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@github-actions github-actions bot added documentation Improvements or additions to documentation diffusion SGLang Diffusion labels Jan 19, 2026
@yeahdongcn yeahdongcn marked this pull request as ready for review January 19, 2026 09:54
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
@yeahdongcn
Copy link
Collaborator Author

Rebased onto upstream/main.

@Kangyan-Zhou Kangyan-Zhou merged commit 7de650c into sgl-project:main Feb 3, 2026
56 of 60 checks passed
charlesHsuGG pushed a commit to charlesHsuGG/sglang that referenced this pull request Feb 5, 2026
sfiisf pushed a commit to sfiisf/sglang that referenced this pull request Feb 5, 2026
RubiaCx pushed a commit to RubiaCx/sglang that referenced this pull request Feb 8, 2026
Johnsonms pushed a commit to Johnsonms/sglang that referenced this pull request Feb 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

diffusion SGLang Diffusion documentation Improvements or additions to documentation mthreads

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants