Skip to content

[Bug] Cannot serve Qwen3-VL-2B-Instruct-unsloth-bnb-4bit with vLLM #3732

@Tan-AcamarVN

Description

@Tan-AcamarVN
  1. Did you update? pip install --upgrade unsloth unsloth_zoo: Yes
  2. Colab or Kaggle or local / cloud: Local, an A100 80G
  3. Number GPUs used, use nvidia-smi: 1
  4. Which notebook? None
  5. Which Unsloth version: latest, TRL version: 0.24, transformers version: 4.57.3, PyTorch version: 2.9.0
  6. Which trainer? None

Minimal reproduce code:

hf download unsloth/Qwen3-VL-2B-Instruct-unsloth-bnb-4bit --local-dir Qwen3-VL-2B-Instruct-unsloth-bnb-4bit
vllm serve ./Qwen3-VL-2B-Instruct-unsloth-bnb-4bit/

While I try to reproduce this issue #3560
I realize that the model simply cannot be served using vllm.

The similar command to serve the instruct fp8 of qwen3 VL 2B runs fine.

So what the author suggest at the end of the issue may be wrong, i guess the issue comes from the model weight itself.
Full stack trace:

uv run vllm serve ./models-weights/Qwen3-VL-2B-Instruct-unsloth-bnb-4bit/ 
(APIServer pid=1272999) INFO 12-16 08:08:12 [api_server.py:1772] vLLM API server version 0.12.0
(APIServer pid=1272999) INFO 12-16 08:08:12 [utils.py:253] non-default args: {'model_tag': './models-weights/Qwen3-VL-2B-Instruct-unsloth-bnb-4bit/', 'model': './models-weights/Qwen3-VL-2B-Instruct-unsloth-bnb-4bit/'}
(APIServer pid=1272999) INFO 12-16 08:08:12 [model.py:637] Resolved architecture: Qwen3VLForConditionalGeneration
(APIServer pid=1272999) INFO 12-16 08:08:12 [model.py:1750] Using max model len 262144
(APIServer pid=1272999) INFO 12-16 08:08:12 [scheduler.py:228] Chunked prefill is enabled with max_num_batched_tokens=2048.
(EngineCore_DP0 pid=1273863) INFO 12-16 08:08:22 [core.py:93] Initializing a V1 LLM engine (v0.12.0) with config: model='./models-weights/Qwen3-VL-2B-Instruct-unsloth-bnb-4bit/', speculative_config=None, tokenizer='./models-weights/Qwen3-VL-2B-Instruct-unsloth-bnb-4bit/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=262144, download_dir=None, load_format=bitsandbytes, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01), seed=0, served_model_name=./models-weights/Qwen3-VL-2B-Instruct-unsloth-bnb-4bit/, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>}, 'local_cache_dir': None}
(EngineCore_DP0 pid=1273863) INFO 12-16 08:08:24 [parallel_state.py:1200] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.223.232.17:59771 backend=nccl
(EngineCore_DP0 pid=1273863) INFO 12-16 08:08:24 [parallel_state.py:1408] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=1273863) INFO 12-16 08:08:32 [gpu_model_runner.py:3467] Starting to load model ./models-weights/Qwen3-VL-2B-Instruct-unsloth-bnb-4bit/...
(EngineCore_DP0 pid=1273863) INFO 12-16 08:08:33 [cuda.py:411] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']
(EngineCore_DP0 pid=1273863) INFO 12-16 08:08:34 [bitsandbytes_loader.py:791] Loading weights with BitsAndBytes quantization. May take a while ...
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 37.65it/s]
(EngineCore_DP0 pid=1273863) 
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
(EngineCore_DP0 pid=1273863) INFO 12-16 08:08:34 [linear.py:1376] param_data.shape: torch.Size([6291456, 1]), loaded_weight.shape: torch.Size([2048, 6144])
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843] EngineCore failed to start.
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843] Traceback (most recent call last):
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843]   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 834, in run_engine_core
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843]     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843]   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 610, in __init__
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843]     super().__init__(
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843]   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 102, in __init__
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843]     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843]   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843]     self._init_executor()
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843]   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 48, in _init_executor
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843]     self.driver_worker.load_model()
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843]   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 273, in load_model
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843]     self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843]   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3484, in load_model
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843]     self.model = model_loader.load_model(
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843]                  ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843]   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 55, in load_model
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843]     self.load_weights(model, model_config)
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843]   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/bitsandbytes_loader.py", line 799, in load_weights
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843]     loaded_weights = model.load_weights(qweight_iterator)
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843]   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_vl.py", line 1676, in load_weights
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843]     return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper)
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843]   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/online_quantization.py", line 173, in patched_model_load_weights
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843]     return original_load_weights(auto_weight_loader, weights, mapper=mapper)
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843]   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 335, in load_weights
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843]     autoloaded_weights = set(self._load_module("", self.module, weights))
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843]   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 288, in _load_module
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843]     yield from self._load_module(
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843]   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 261, in _load_module
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843]     loaded_params = module_load_weights(weights)
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843]   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3.py", line 332, in load_weights
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843]     return loader.load_weights(weights)
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843]   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/online_quantization.py", line 173, in patched_model_load_weights
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843]     return original_load_weights(auto_weight_loader, weights, mapper=mapper)
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843]   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 335, in load_weights
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843]     autoloaded_weights = set(self._load_module("", self.module, weights))
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843]   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 288, in _load_module
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843]     yield from self._load_module(
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843]   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 261, in _load_module
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843]     loaded_params = module_load_weights(weights)
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843]   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen2.py", line 486, in load_weights
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843]     weight_loader(param, loaded_weight)
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843]   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 1377, in weight_loader
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843]     assert param_data.shape == loaded_weight.shape
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1273863) ERROR 12-16 08:08:34 [core.py:843] AssertionError
(EngineCore_DP0 pid=1273863) Process EngineCore_DP0:
(EngineCore_DP0 pid=1273863) Traceback (most recent call last):
(EngineCore_DP0 pid=1273863)   File "https://github.com/data0/tan/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=1273863)     self.run()
(EngineCore_DP0 pid=1273863)   File "https://github.com/data0/tan/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=1273863)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=1273863)   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 847, in run_engine_core
(EngineCore_DP0 pid=1273863)     raise e
(EngineCore_DP0 pid=1273863)   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 834, in run_engine_core
(EngineCore_DP0 pid=1273863)     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=1273863)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1273863)   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 610, in __init__
(EngineCore_DP0 pid=1273863)     super().__init__(
(EngineCore_DP0 pid=1273863)   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 102, in __init__
(EngineCore_DP0 pid=1273863)     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=1273863)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1273863)   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=1273863)     self._init_executor()
(EngineCore_DP0 pid=1273863)   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 48, in _init_executor
(EngineCore_DP0 pid=1273863)     self.driver_worker.load_model()
(EngineCore_DP0 pid=1273863)   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 273, in load_model
(EngineCore_DP0 pid=1273863)     self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_DP0 pid=1273863)   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3484, in load_model
(EngineCore_DP0 pid=1273863)     self.model = model_loader.load_model(
(EngineCore_DP0 pid=1273863)                  ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1273863)   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 55, in load_model
(EngineCore_DP0 pid=1273863)     self.load_weights(model, model_config)
(EngineCore_DP0 pid=1273863)   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/bitsandbytes_loader.py", line 799, in load_weights
(EngineCore_DP0 pid=1273863)     loaded_weights = model.load_weights(qweight_iterator)
(EngineCore_DP0 pid=1273863)                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1273863)   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_vl.py", line 1676, in load_weights
(EngineCore_DP0 pid=1273863)     return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper)
(EngineCore_DP0 pid=1273863)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1273863)   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/online_quantization.py", line 173, in patched_model_load_weights
(EngineCore_DP0 pid=1273863)     return original_load_weights(auto_weight_loader, weights, mapper=mapper)
(EngineCore_DP0 pid=1273863)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1273863)   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 335, in load_weights
(EngineCore_DP0 pid=1273863)     autoloaded_weights = set(self._load_module("", self.module, weights))
(EngineCore_DP0 pid=1273863)                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1273863)   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 288, in _load_module
(EngineCore_DP0 pid=1273863)     yield from self._load_module(
(EngineCore_DP0 pid=1273863)   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 261, in _load_module
(EngineCore_DP0 pid=1273863)     loaded_params = module_load_weights(weights)
(EngineCore_DP0 pid=1273863)                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1273863)   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3.py", line 332, in load_weights
(EngineCore_DP0 pid=1273863)     return loader.load_weights(weights)
(EngineCore_DP0 pid=1273863)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1273863)   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/online_quantization.py", line 173, in patched_model_load_weights
(EngineCore_DP0 pid=1273863)     return original_load_weights(auto_weight_loader, weights, mapper=mapper)
(EngineCore_DP0 pid=1273863)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1273863)   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 335, in load_weights
(EngineCore_DP0 pid=1273863)     autoloaded_weights = set(self._load_module("", self.module, weights))
(EngineCore_DP0 pid=1273863)                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1273863)   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 288, in _load_module
(EngineCore_DP0 pid=1273863)     yield from self._load_module(
(EngineCore_DP0 pid=1273863)   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 261, in _load_module
(EngineCore_DP0 pid=1273863)     loaded_params = module_load_weights(weights)
(EngineCore_DP0 pid=1273863)                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1273863)   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen2.py", line 486, in load_weights
(EngineCore_DP0 pid=1273863)     weight_loader(param, loaded_weight)
(EngineCore_DP0 pid=1273863)   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 1377, in weight_loader
(EngineCore_DP0 pid=1273863)     assert param_data.shape == loaded_weight.shape
(EngineCore_DP0 pid=1273863)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1273863) AssertionError
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
(EngineCore_DP0 pid=1273863) 
[rank0]:[W1216 08:08:35.004285921 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=1272999) Traceback (most recent call last):
(APIServer pid=1272999)   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/bin/vllm", line 10, in <module>
(APIServer pid=1272999)     sys.exit(main())
(APIServer pid=1272999)              ^^^^^^
(APIServer pid=1272999)   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 73, in main
(APIServer pid=1272999)     args.dispatch_function(args)
(APIServer pid=1272999)   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 60, in cmd
(APIServer pid=1272999)     uvloop.run(run_server(args))
(APIServer pid=1272999)   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=1272999)     return __asyncio.run(
(APIServer pid=1272999)            ^^^^^^^^^^^^^^
(APIServer pid=1272999)   File "https://github.com/data0/tan/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=1272999)     return runner.run(main)
(APIServer pid=1272999)            ^^^^^^^^^^^^^^^^
(APIServer pid=1272999)   File "https://github.com/data0/tan/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1272999)     return self._loop.run_until_complete(task)
(APIServer pid=1272999)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1272999)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1272999)   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=1272999)     return await main
(APIServer pid=1272999)            ^^^^^^^^^^
(APIServer pid=1272999)   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1819, in run_server
(APIServer pid=1272999)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1272999)   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1838, in run_server_worker
(APIServer pid=1272999)     async with build_async_engine_client(
(APIServer pid=1272999)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1272999)   File "https://github.com/data0/tan/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1272999)     return await anext(self.gen)
(APIServer pid=1272999)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1272999)   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 183, in build_async_engine_client
(APIServer pid=1272999)     async with build_async_engine_client_from_engine_args(
(APIServer pid=1272999)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1272999)   File "https://github.com/data0/tan/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1272999)     return await anext(self.gen)
(APIServer pid=1272999)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1272999)   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 224, in build_async_engine_client_from_engine_args
(APIServer pid=1272999)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1272999)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1272999)   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 223, in from_vllm_config
(APIServer pid=1272999)     return cls(
(APIServer pid=1272999)            ^^^^
(APIServer pid=1272999)   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 134, in __init__
(APIServer pid=1272999)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1272999)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1272999)   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 121, in make_async_mp_client
(APIServer pid=1272999)     return AsyncMPClient(*client_args)
(APIServer pid=1272999)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1272999)   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 810, in __init__
(APIServer pid=1272999)     super().__init__(
(APIServer pid=1272999)   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 471, in __init__
(APIServer pid=1272999)     with launch_core_engines(vllm_config, executor_class, log_stats) as (
(APIServer pid=1272999)          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1272999)   File "https://github.com/data0/tan/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=1272999)     next(self.gen)
(APIServer pid=1272999)   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 903, in launch_core_engines
(APIServer pid=1272999)     wait_for_engine_startup(
(APIServer pid=1272999)   File "https://github.com/data0/tan/mlpservice_tan/unsloth-experiment/.venv/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 960, in wait_for_engine_startup
(APIServer pid=1272999)     raise RuntimeError(
(APIServer pid=1272999) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions