LLM Scaler is an GenAI solution for text generation, image generation, video generation etc. running on Intelยฎ Arcโข Pro B60 GPUs. LLM Scalar leverages standard frameworks such as vLLM, ComfyUI, SGLang Diffusion, Xinference etc and ensures the best performance for State-of-Art GenAI models running on Arc Pro B60 GPUs.
- ๐ฅ [2026.03] We released
intel/llm-scaler-vllm:0.14.0-b8.1to support Qwen3.5-27B, Qwen3.5-35B-A3B and Qwen3.5-122B-A10B (FP8/INT4 online quantization, GPTQ) - ๐ฅ [2026.03] We released
intel/llm-scaler-omni:0.1.0-b6for ComfyUI to support CacheDiT and torch.compile(), ComfyUI-GGUF, and more model workflows, and support FP8 for SGLang Diffusion. - ๐ฅ [2026.03] We released
intel/llm-scaler-vllm:0.14.0-b8for vLLM 0.14.0 and PyTorch 2.10 support, various new models support and performance improvement. - [2026.01] We released
intel/llm-scaler-vllm:1.3(or,intel/llm-scaler-vllm:0.11.1-b7) for vLLM 0.11.1 and PyTorch 2.9 support, various new models support and performance improvement. - [2026.01] We released
intel/llm-scaler-omni:0.1.0-b5for Python 3.12 and PyTorch 2.9 support, various ComfyUI workflows and more SGLang Diffusion support. - [2025.12] We released
intel/llm-scaler-vllm:1.2, same image asintel/llm-scaler-vllm:0.10.2-b6. - [2025.12] We released
intel/llm-scaler-omni:0.1.0-b4to support ComfyUI workflows for Z-Image-Turbo, Hunyuan-Video-1.5 T2V/I2V with multi-XPU, and experimentially support SGLang Diffusion. - [2025.11] We released
intel/llm-scaler-vllm:0.10.2-b6to support Qwen3-VL (Dense/MoE), Qwen3-Omni, Qwen3-30B-A3B (MoE Int4), MinerU 2.5, ERNIE-4.5-vl etc. - [2025.11] We released
intel/llm-scaler-vllm:0.10.2-b5to support gpt-oss models and releasedintel/llm-scaler-omni:0.1.0-b3to support more ComfyUI workflows, and Windows installation. - [2025.10] We released
intel/llm-scaler-omni:0.1.0-b2to support more models with ComfyUI workflows and Xinference. - [2025.09] We released
intel/llm-scaler-vllm:0.10.0-b3to support more models (MinerU, MiniCPM-v-4.5 etc), and releasedintel/llm-scaler-omni:0.1.0-b1to enable first omni GenAI models using ComfyUI and Xinference on Arc Pro B60 GPU. - [2025.08] We released
intel/llm-scaler-vllm:1.0.
llm-scaler-vllm supports running text generation models using vLLM, featuring:
- CCL support (P2P or USM)
- INT4 and FP8 quantized online serving
- Embedding and Reranker model support
- Multi-Modal model support
- Omni model support
- Tensor Parallel, Pipeline Parallel and Data Parallel
- Finding maximum Context Length
- Multi-Modal WebUI
- BPE-Qwen tokenizer
Please follow the instructions in the Getting Started to use llm-scaler-vllm.
| Category | Model Name | FP16 | Dynamic Online FP8 | Dynamic Online Int4 | MXFP4 | Notes |
|---|---|---|---|---|---|---|
| Language Model | openai/gpt-oss-20b | โ | ||||
| Language Model | openai/gpt-oss-120b | โ | ||||
| Language Model | deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B | โ | โ | โ | ||
| Language Model | deepseek-ai/DeepSeek-R1-Distill-Qwen-7B | โ | โ | โ | ||
| Language Model | deepseek-ai/DeepSeek-R1-Distill-Llama-8B | โ | โ | โ | ||
| Language Model | deepseek-ai/DeepSeek-R1-Distill-Qwen-14B | โ | โ | โ | ||
| Language Model | deepseek-ai/DeepSeek-R1-Distill-Qwen-32B | โ | โ | โ | ||
| Language Model | deepseek-ai/DeepSeek-R1-Distill-Llama-70B | โ | โ | โ | ||
| Language Model | deepseek-ai/DeepSeek-R1-0528-Qwen3-8B | โ | โ | โ | ||
| Language Model | deepseek-ai/DeepSeek-V2-Lite | โ | โ | export VLLM_MLA_DISABLE=1 | ||
| Language Model | deepseek-ai/deepseek-coder-33b-instruct | โ | โ | โ | ||
| Language Model | Qwen/Qwen3-8B | โ | โ | โ | ||
| Language Model | Qwen/Qwen3-14B | โ | โ | โ | ||
| Language Model | Qwen/Qwen3-32B | โ | โ | โ | ||
| Language Model | Qwen/Qwen3.5-27B | โ | โ | โ | ||
| Language MOE Model | Qwen/Qwen3-30B-A3B | โ | โ | โ | ||
| Language MOE Model | Qwen/Qwen3-235B-A22B | โ | ||||
| Language MOE Model | Qwen/Qwen3-Coder-30B-A3B-Instruct | โ | โ | โ | ||
| Language MOE Model | Qwen/Qwen3-Coder-Next | โ | โ | โ | ||
| Language MOE Model | Qwen/Qwen3.5-35B-A3B | โ | โ | โ | ||
| Language MOE Model | Qwen/Qwen3.5-122B-A10B | โ | โ | |||
| Language Model | Qwen/QwQ-32B | โ | โ | โ | ||
| Language Model | mistralai/Ministral-8B-Instruct-2410 | โ | โ | โ | ||
| Language Model | mistralai/Mixtral-8x7B-Instruct-v0.1 | โ | โ | โ | ||
| Language Model | meta-llama/Llama-3.1-8B | โ | โ | โ | ||
| Language Model | meta-llama/Llama-3.1-70B | โ | โ | โ | ||
| Language Model | baichuan-inc/Baichuan2-7B-Chat | โ | โ | โ | with chat_template | |
| Language Model | baichuan-inc/Baichuan2-13B-Chat | โ | โ | โ | with chat_template | |
| Language Model | THUDM/CodeGeex4-All-9B | โ | โ | โ | with chat_template | |
| Language Model | zai-org/GLM-4-9B-0414 | โ | use bfloat16 | |||
| Language Model | zai-org/GLM-4-32B-0414 | โ | use bfloat16 | |||
| Language MOE Model | zai-org/GLM-4.5-Air | โ | โ | |||
| Language MOE Model | zai-org/GLM-4.7-Flash | โ | โ | |||
| Language Model | ByteDance-Seed/Seed-OSS-36B-Instruct | โ | โ | โ | ||
| Language Model | miromind-ai/MiroThinker-v1.5-30B | โ | โ | โ | ||
| Language Model | tencent/Hunyuan-0.5B-Instruct | โ | โ | โ | follow the guide in here | |
| Language Model | tencent/Hunyuan-7B-Instruct | โ | โ | โ | follow the guide in here | |
| Multimodal Model | Qwen/Qwen2-VL-7B-Instruct | โ | โ | โ | ||
| Multimodal Model | Qwen/Qwen2.5-VL-7B-Instruct | โ | โ | โ | ||
| Multimodal Model | Qwen/Qwen2.5-VL-32B-Instruct | โ | โ | โ | ||
| Multimodal Model | Qwen/Qwen2.5-VL-72B-Instruct | โ | โ | โ | ||
| Multimodal Model | Qwen/Qwen3-VL-4B-Instruct | โ | โ | โ | ||
| Multimodal Model | Qwen/Qwen3-VL-8B-Instruct | โ | โ | โ | ||
| Multimodal MOE Model | Qwen/Qwen3-VL-30B-A3B-Instruct | โ | โ | โ | ||
| Multimodal Model | openbmb/MiniCPM-V-2_6 | โ | โ | โ | ||
| Multimodal Model | openbmb/MiniCPM-V-4 | โ | โ | โ | ||
| Multimodal Model | openbmb/MiniCPM-V-4_5 | โ | โ | โ | ||
| Multimodal Model | OpenGVLab/InternVL2-8B | โ | โ | โ | ||
| Multimodal Model | OpenGVLab/InternVL3-8B | โ | โ | โ | ||
| Multimodal Model | OpenGVLab/InternVL3_5-8B | โ | โ | โ | ||
| Multimodal MOE Model | OpenGVLab/InternVL3_5-30B-A3B | โ | โ | โ | ||
| Multimodal Model | rednote-hilab/dots.ocr | โ | โ | โ | ||
| Multimodal Model | ByteDance-Seed/UI-TARS-7B-DPO | โ | โ | โ | ||
| Multimodal Model | google/gemma-3-12b-it | โ | use bfloat16 | |||
| Multimodal Model | google/gemma-3-27b-it | โ | use bfloat16 | |||
| Multimodal Model | THUDM/GLM-4v-9B | โ | โ | โ | with --hf-overrides and chat_template | |
| Multimodal Model | zai-org/GLM-4.1V-9B-Base | โ | โ | โ | ||
| Multimodal Model | zai-org/GLM-4.1V-9B-Thinking | โ | โ | โ | ||
| Multimodal Model | zai-org/Glyph | โ | โ | โ | ||
| Multimodal Model | opendatalab/MinerU2.5-2509-1.2B | โ | โ | โ | ||
| Multimodal Model | baidu/ERNIE-4.5-VL-28B-A3B-Thinking | โ | โ | โ | ||
| Multimodal Model | zai-org/GLM-4.6V-Flash | โ | โ | โ | pip install transformers==5.0.0rc0 first | |
| Multimodal Model | PaddlePaddle/PaddleOCR-VL | โ | โ | โ | follow the guide in here | |
| Multimodal Model | deepseek-ai/DeepSeek-OCR | โ | โ | โ | ||
| Multimodal Model | deepseek-ai/DeepSeek-OCR-2 | โ | โ | โ | There may be accuracy issues when using --quantization fp8 |
|
| Multimodal Model | moonshotai/Kimi-VL-A3B-Thinking-2506 | โ | โ | โ | ||
| omni | Qwen/Qwen2.5-Omni-7B | โ | โ | โ | ||
| omni | Qwen/Qwen3-Omni-30B-A3B-Instruct | โ | โ | โ | ||
| audio | openai/whisper-medium | โ | โ | โ | ||
| audio | openai/whisper-large-v3 | โ | โ | โ | ||
| Embedding Model | Qwen/Qwen3-Embedding-8B | โ | โ | โ | ||
| VL Embedding Model | Qwen3-VL-Embedding-2B/8B | โ | โ | โ | follow the guide in here | |
| Embedding Model | BAAI/bge-m3 | โ | โ | โ | ||
| Embedding Model | BAAI/bge-large-en-v1.5 | โ | โ | โ | ||
| Reranker Model | Qwen/Qwen3-Reranker-8B | โ | โ | โ | ||
| VL Reranker Model | Qwen3-VL-Reranker-2B/8B | โ | โ | โ | follow the guide in here | |
| Reranker Model | BAAI/bge-reranker-large | โ | โ | โ | ||
| Reranker Model | BAAI/bge-reranker-v2-m3 | โ | โ | โ |
llm-scaler-omni supports running image/voice/video generation etc., featuring Omni Studio mode (using ComfyUI) and Omni Serving mode (via SGLang Diffusion or Xinference).
Please follow the instructions in the Getting Started to use llm-scaler-omni.
| Qwen-Image | Multi B60 Wan2.2-T2V-14B |
|---|---|
![]() |
![]() |
Omni Stuido supports Image Generation/Edit, Video Generation, Audio Generation, 3D Generation etc.
| Model Category | Model | Type |
|---|---|---|
| Image Generation | Qwen-Image, Qwen-Image-Edit | Text-to-Image, Image Editing |
| Image Generation | Stable Diffusion 3.5 | Text-to-Image, ControlNet |
| Image Generation | Z-Image-Turbo | Text-to-Image |
| Image Generation | Flux.1, Flux.1 Kontext dev | Text-to-Image, Multi-Image Reference, ControlNet |
| Image Generation | FireRed-Image-Edit-1.1 | Image Editing |
| Video Generation | Wan2.2 TI2V 5B, Wan2.2 T2V 14B, Wan2.2 I2V 14B | Text-to-Video, Image-to-Video |
| Video Generation | Wan2.2 Animate 14B | Video Animation |
| Video Generation | HunyuanVideo 1.5 8.3B | Text-to-Video, Image-to-Video |
| Video Generation | LTX-2 | Text-to-Video, Image-to-Video |
| 3D Generation | Hunyuan3D 2.1 | Text/Image-to-3D |
| Audio Generation | VoxCPM1.5, IndexTTS 2 | Text-to-Speech, Voice Cloning |
| Video Upscaling | SeedVR2 | Video Restoration and Upscaling |
Please check ComfyUI Support for more details.
Omni Serving supports Image Generation, Audio Generation etc.
- Image Generation (
/v1/images/generations): Stable Diffusion 3.5, Flux.1-dev - Text to Speech (
/v1/audio/speech): Kokoro 82M - Speech to Text (
/v1/audio/transcriptions): whisper-large-v3
Please check Xinference Support for more details.
- Please check out the Docker image releases for llm-scaler-vllm and llm-scaler-omni
- Please report a bug or raise a feature request by opening a Github Issue

