Skip to content

[Bug]: rocm/vllm:rocm7.0.0_vllm_0.11.1_20251103 has an error with flash attention #28052

@capteen-hook

Description

@capteen-hook

Your current environment

rocm/vllm:rocm7.0.0_vllm_0.11.1_20251103

(Note, this does work with rocm/vllm:rocm7.0.0_vllm_0.10.2_20251006)

Docker:

ARG BASE_IMAGE=python:3.12-slim # passed in rocm/latest
FROM ${BASE_IMAGE}

WORKDIR /app

# Install Rust, required build tools, pkg-config, and OpenSSL development libraries
RUN apt-get update && apt-get install -y --no-install-recommends \
    curl \
    gcc \
    build-essential \
    pkg-config \
    libssl-dev \
    poppler-utils \
    && curl https://sh.rustup.rs -sSf | sh -s -- -y \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*
    
ENV PATH="/root/.cargo/bin:${PATH}"

COPY ./requirements.txt /app/flask_server/requirements.txt

RUN pip install --upgrade pip
RUN pip install --no-cache-dir -r /app/flask_server/requirements.txt

COPY . /app/flask_server

ENV PORT=8000
EXPOSE $PORT

HEALTHCHECK --interval=300s --timeout=5s --start-period=10s --retries=3 \
  CMD curl --fail http://localhost:$PORT/$BASE_URL || exit 1

CMD ["python", "-m", "flask_server"]

environment: (some of these are likely unhelpful :/)

      QWEN_CHECKPOINT: ${QWEN_CHECKPOINT:-Qwen/Qwen2.5-VL-3B-Instruct}
      MAX_MODEL_LEN: ${MAX_MODEL_LEN:-45000}
      MAX_NUM_SEQS: ${MAX_NUM_SEQS:-1}
      GPU_MEMORY_UTILIZATION: ${GPU_MEMORY_UTILIZATION:-0.9}
      MAX_TOKENS: ${MAX_TOKENS:-8192}
      TEMPERATURE: ${TEMPERATURE:-0.0}

      HIP_VISIBLE_DEVICES: ${HIP_VISIBLE_DEVICES:-0,1}
      TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL: ${TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL:-1}
      VLLM_USE_TRITON_FLASH_ATTN: ${VLLM_USE_TRITON_FLASH_ATTN:-0}
      MIOPEN_DEBUG_CONV_DIRECT: ${MIOPEN_DEBUG_CONV_DIRECT:-0}
      MIOOPEN_DEBUG_GCN_ASM_KERNELS: ${MIOOPEN_DEBUG_GCN_ASM_KERNELS:-0}
      VLLM_CUGAGRAPH_MODE: ${VLLM_CUGAGRAPH_MODE:-0}
      USE_CHAT_API: ${USE_CHAT_API:-1}

Host rocm:

rocminfo
ROCk module version 6.14.14 is loaded
=====================
HSA System Attributes
=====================
Runtime Version:         1.18
Runtime Ext Version:     1.11
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE
System Endianness:       LITTLE
Mwaitx:                  DISABLED
XNACK enabled:           NO
DMAbuf Support:          YES
VMM Support:             YES

==========
HSA Agents
==========
*******
Agent 1
*******
  Name:                    AMD Ryzen Threadripper PRO 9965WX 24-Cores
  Uuid:                    CPU-XX
  Marketing Name:          AMD Ryzen Threadripper PRO 9965WX 24-Cores
  Vendor Name:             CPU
  Feature:                 None specified
  Profile:                 FULL_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        0(0x0)
  Queue Min Size:          0(0x0)
  Queue Max Size:          0(0x0)
  Queue Type:              MULTI
  Node:                    0
  Device Type:             CPU
  Cache Info:
    L1:                      49152(0xc000) KB
  Chip ID:                 0(0x0)
  ASIC Revision:           0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   5489
  BDFID:                   0
  Internal Node ID:        0
  Compute Unit:            48
  SIMDs per CU:            0
  Shader Engines:          0
  Shader Arrs. per Eng.:   0
  WatchPts on Addr. Ranges:1
  Memory Properties:
  Features:                None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    395385652(0x17911b34) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 2
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    395385652(0x17911b34) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 3
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    395385652(0x17911b34) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 4
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    395385652(0x17911b34) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
  ISA Info:
*******
Agent 2
*******
  Name:                    gfx1100
  Uuid:                    GPU-f106daae55da882a
  Marketing Name:          AMD Radeon PRO W7800
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    1
  Device Type:             GPU
  Cache Info:
    L1:                      32(0x20) KB
    L2:                      6144(0x1800) KB
    L3:                      65536(0x10000) KB
  Chip ID:                 29790(0x745e)
  ASIC Revision:           0(0x0)
  Cacheline Size:          128(0x80)
  Max Clock Freq. (MHz):   1895
  BDFID:                   25344
  Internal Node ID:        1
  Compute Unit:            70
  SIMDs per CU:            2
  Shader Engines:          6
  Shader Arrs. per Eng.:   2
  WatchPts on Addr. Ranges:4
  Coherent Host Access:    FALSE
  Memory Properties:
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          32(0x20)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    1024(0x400)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        2147483647(0x7fffffff)
    y                        65535(0xffff)
    z                        65535(0xffff)
  Max fbarriers/Workgrp:   32
  Packet Processor uCode:: 552
  SDMA engine uCode::      24
  IOMMU Support::          None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    31440896(0x1dfc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    31440896(0x1dfc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 3
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Recommended Granule:0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx1100
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        2147483647(0x7fffffff)
        y                        65535(0xffff)
        z                        65535(0xffff)
      FBarrier Max Size:       32
    ISA 2
      Name:                    amdgcn-amd-amdhsa--gfx11-generic
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        2147483647(0x7fffffff)
        y                        65535(0xffff)
        z                        65535(0xffff)
      FBarrier Max Size:       32
*******
Agent 3
*******
  Name:                    gfx1201
  Uuid:                    GPU-ad0031ab2e3ff4d7
  Marketing Name:          AMD Radeon AI PRO R9700
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    2
  Device Type:             GPU
  Cache Info:
    L1:                      32(0x20) KB
    L2:                      8192(0x2000) KB
    L3:                      65536(0x10000) KB
  Chip ID:                 30033(0x7551)
  ASIC Revision:           1(0x1)
  Cacheline Size:          256(0x100)
  Max Clock Freq. (MHz):   2350
  BDFID:                   8960
  Internal Node ID:        2
  Compute Unit:            64
  SIMDs per CU:            2
  Shader Engines:          4
  Shader Arrs. per Eng.:   2
  WatchPts on Addr. Ranges:4
  Coherent Host Access:    FALSE
  Memory Properties:
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          32(0x20)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    1024(0x400)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        2147483647(0x7fffffff)
    y                        65535(0xffff)
    z                        65535(0xffff)
  Max fbarriers/Workgrp:   32
  Packet Processor uCode:: 58
  SDMA engine uCode::      380
  IOMMU Support::          None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    33406976(0x1fdc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Recommended Granule:0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx1201
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        2147483647(0x7fffffff)
        y                        65535(0xffff)
        z                        65535(0xffff)
      FBarrier Max Size:       32
    ISA 2
      Name:                    amdgcn-amd-amdhsa--gfx12-generic
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        2147483647(0x7fffffff)
        y                        65535(0xffff)
        z                        65535(0xffff)
      FBarrier Max Size:       32
*** Done ***

🐛 Describe the bug

Stack Trace

worker        | Visible devices: 0
worker        | Using tensor parallel size: 1
worker        | [Celery] Initializing model in worker...
worker        | INFO 11-04 15:56:04 [utils.py:243] non-default args: {'trust_remote_code': True, 'max_model_len': 10960, 'gpu_memory_utilization': 0.94, 'max_num_seqs': 1, 'disable_log_stats': True, 'limit_mm_per_prompt': {'image': 1, 'video': 0}, 'model': 'Qwen/Qwen3-VL-8B-Instruct'}
worker        | INFO 11-04 15:56:11 [model.py:658] Resolved architecture: Qwen3VLForConditionalGeneration
worker        | INFO 11-04 15:56:11 [model.py:1745] Using max model len 10960
worker        | The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
worker        | INFO 11-04 15:56:12 [scheduler.py:225] Chunked prefill is enabled with max_num_batched_tokens=8192.
worker        | WARNING 11-04 15:56:12 [vllm.py:498] No piecewise cudagraph for executing cascade attention. Will fall back to eager execution if a batch runs into cascade attentions
worker        | WARNING 11-04 15:56:13 [__init__.py:1993] We must use the `spawn` multiprocessing start method. Overriding VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/usage/troubleshooting.html#python-multiprocessing for more information. Reasons: CUDA is initialized
worker        | INFO 11-04 15:56:14 [__init__.py:225] Automatically detected platform rocm.
worker        | /usr/local/lib/python3.12/dist-packages/transformers/utils/hub.py:110: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
worker        |   warnings.warn(
worker        | (EngineCore_DP0 pid=249) INFO 11-04 15:56:15 [core.py:730] Waiting for init message from front-end.
worker        | (EngineCore_DP0 pid=249) INFO 11-04 15:56:15 [core.py:97] Initializing a V1 LLM engine (v0.11.1rc2.dev141+g38f225c2a) with config: model='Qwen/Qwen3-VL-8B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen3-VL-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=10960, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen/Qwen3-VL-8B-Instruct, enable_prefix_caching=True, chunked_prefill_enabled=True, pooler_config=None, compilation_config={'level': None, 'mode': 3, 'debug_dump_path': None, 'cache_dir': '', 'backend': 'inductor', 'custom_ops': ['+rms_norm', '+silu_and_mul', '+quant_fp8', 'none', '+rms_norm'], 'splitting_ops': [], 'use_inductor': None, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL: 2>, 'use_cudagraph': True, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [2, 1], 'cudagraph_copy_inputs': False, 'full_cuda_graph': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_capture_size': 2, 'local_cache_dir': None}
worker        | [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
worker        | [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
worker        | [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
worker        | [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
worker        | [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
worker        | [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
worker        | (EngineCore_DP0 pid=249) INFO 11-04 15:56:18 [parallel_state.py:1325] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
worker        | (EngineCore_DP0 pid=249) INFO 11-04 15:56:21 [gpu_model_runner.py:2843] Starting to load model Qwen/Qwen3-VL-8B-Instruct...
worker        | (EngineCore_DP0 pid=249) INFO 11-04 15:56:21 [rocm.py:298] Using Rocm Attention backend on V1 engine.
worker        | (EngineCore_DP0 pid=249) WARNING 11-04 15:56:21 [compilation.py:874] Op 'quant_fp8' not present in model, enabling with '+quant_fp8' has no effect
worker        | (EngineCore_DP0 pid=249) INFO 11-04 15:56:22 [weight_utils.py:419] Using model weights format ['*.safetensors']
worker        | (EngineCore_DP0 pid=249) INFO 11-04 15:58:35 [weight_utils.py:440] Time spent downloading weights for Qwen/Qwen3-VL-8B-Instruct: 133.725200 seconds
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:04,  1.43s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:03<00:03,  1.51s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:04<00:01,  1.53s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:05<00:00,  1.29s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:05<00:00,  1.37s/it]
worker        | (EngineCore_DP0 pid=249)
worker        | (EngineCore_DP0 pid=249) INFO 11-04 15:58:41 [default_loader.py:314] Loading weights took 5.54 seconds
worker        | (EngineCore_DP0 pid=249) INFO 11-04 15:58:42 [gpu_model_runner.py:2904] Model loading took 16.8730 GiB and 139.878708 seconds
worker        | (EngineCore_DP0 pid=249) INFO 11-04 15:58:42 [gpu_model_runner.py:3670] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
worker        | (EngineCore_DP0 pid=249) /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/utils.py:117: UserWarning: Failed validator: GCN_ARCH_NAME (Triggered internally at /app/pytorch/aten/src/ATen/hip/tunable/Tunable.cpp:366.)
worker        | (EngineCore_DP0 pid=249)   return torch.nn.functional.linear(x, weight, bias)
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793] EngineCore failed to start.
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793] Traceback (most recent call last):
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 784, in run_engine_core
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]     engine_core = EngineCoreProc(*args, **kwargs)
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 552, in __init__
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]     super().__init__(
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 113, in __init__
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]     num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]                                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 224, in _initialize_kv_caches
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]     available_gpu_memory = self.model_executor.determine_available_memory()
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 88, in determine_available_memory
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]     return self.collective_rpc("determine_available_memory")
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 74, in collective_rpc
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]     return [run_method(self.driver_worker, method, args, kwargs)]
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils/__init__.py", line 2089, in run_method
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]     return func(*args, **kwargs)
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]            ^^^^^^^^^^^^^^^^^^^^^
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]     return func(*args, **kwargs)
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]            ^^^^^^^^^^^^^^^^^^^^^
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 280, in determine_available_memory
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]     self.model_runner.profile_run()
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3686, in profile_run
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]     dummy_encoder_outputs = self.model.get_multimodal_embeddings(
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_vl.py", line 1604, in get_multimodal_embeddings
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]     image_embeddings = self._process_image_input(multimodal_input)
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_vl.py", line 1416, in _process_image_input
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]     image_embeds = self.visual(pixel_values, grid_thw=grid_thw_list)
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]     return self._call_impl(*args, **kwargs)
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]     return forward_call(*args, **kwargs)
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_vl.py", line 547, in forward
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]     hidden_states = blk(
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]                     ^^^^
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]     return self._call_impl(*args, **kwargs)
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]     return forward_call(*args, **kwargs)
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_vl.py", line 231, in forward
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]     x = x + self.attn(
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]             ^^^^^^^^^^
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]     return self._call_impl(*args, **kwargs)
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]     return forward_call(*args, **kwargs)
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2_5_vl.py", line 409, in forward
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]     output = self.flash_attn_varlen_func(
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]   File "/usr/local/lib/python3.12/dist-packages/flash_attn/flash_attn_interface.py", line 1443, in flash_attn_varlen_func
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]     return FlashAttnVarlenFunc.apply(
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]            ^^^^^^^^^^^^^^^^^^^^^^^^^^
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]   File "/usr/local/lib/python3.12/dist-packages/torch/autograd/function.py", line 581, in apply
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]     return super().apply(*args, **kwargs)  # type: ignore[misc]
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]   File "/usr/local/lib/python3.12/dist-packages/flash_attn/flash_attn_interface.py", line 925, in forward
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]     out_padded, softmax_lse, S_dmask, rng_state = _wrapped_flash_attn_varlen_forward(
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]                                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1254, in __call__
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]     return self._op(*args, **kwargs)
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]            ^^^^^^^^^^^^^^^^^^^^^^^^^
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]   File "/usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py", line 342, in backend_impl
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]     result = self._backend_fns[device_type](*args, **kwargs)
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]   File "/usr/local/lib/python3.12/dist-packages/torch/_compile.py", line 53, in inner
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]     return disable_fn(*args, **kwargs)
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 1005, in _fn
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]     return fn(*args, **kwargs)
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]            ^^^^^^^^^^^^^^^^^^^
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]   File "/usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py", line 375, in wrapped_fn
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]     return fn(*args, **kwargs)
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]            ^^^^^^^^^^^^^^^^^^^
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]   File "/usr/local/lib/python3.12/dist-packages/flash_attn/flash_attn_interface.py", line 165, in _flash_attn_varlen_forward
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]     out, softmax_lse, S_dmask, rng_state = flash_attn_gpu.varlen_fwd(
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793]                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^
worker        | (EngineCore_DP0 pid=249) ERROR 11-04 16:00:28 [core.py:793] RuntimeError: HIP Function Failed (/app/flash-attention/csrc/composable_kernel/include/ck_tile/host/kernel_launch_hip.hpp,65) invalid device function
worker        | (EngineCore_DP0 pid=249) Process EngineCore_DP0:
worker        | (EngineCore_DP0 pid=249) Traceback (most recent call last):
worker        | (EngineCore_DP0 pid=249)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
worker        | (EngineCore_DP0 pid=249)     self.run()
worker        | (EngineCore_DP0 pid=249)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
worker        | (EngineCore_DP0 pid=249)     self._target(*self._args, **self._kwargs)
worker        | (EngineCore_DP0 pid=249)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 797, in run_engine_core
worker        | (EngineCore_DP0 pid=249)     raise e
worker        | (EngineCore_DP0 pid=249)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 784, in run_engine_core
worker        | (EngineCore_DP0 pid=249)     engine_core = EngineCoreProc(*args, **kwargs)
worker        | (EngineCore_DP0 pid=249)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
worker        | (EngineCore_DP0 pid=249)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 552, in __init__
worker        | (EngineCore_DP0 pid=249)     super().__init__(
worker        | (EngineCore_DP0 pid=249)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 113, in __init__
worker        | (EngineCore_DP0 pid=249)     num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(
worker        | (EngineCore_DP0 pid=249)                                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
worker        | (EngineCore_DP0 pid=249)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 224, in _initialize_kv_caches
worker        | (EngineCore_DP0 pid=249)     available_gpu_memory = self.model_executor.determine_available_memory()
worker        | (EngineCore_DP0 pid=249)                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
worker        | (EngineCore_DP0 pid=249)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 88, in determine_available_memory
worker        | (EngineCore_DP0 pid=249)     return self.collective_rpc("determine_available_memory")
worker        | (EngineCore_DP0 pid=249)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
worker        | (EngineCore_DP0 pid=249)   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 74, in collective_rpc
worker        | (EngineCore_DP0 pid=249)     return [run_method(self.driver_worker, method, args, kwargs)]
worker        | (EngineCore_DP0 pid=249)             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
worker        | (EngineCore_DP0 pid=249)   File "/usr/local/lib/python3.12/dist-packages/vllm/utils/__init__.py", line 2089, in run_method
worker        | (EngineCore_DP0 pid=249)     return func(*args, **kwargs)
worker        | (EngineCore_DP0 pid=249)            ^^^^^^^^^^^^^^^^^^^^^
worker        | (EngineCore_DP0 pid=249)   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
worker        | (EngineCore_DP0 pid=249)     return func(*args, **kwargs)
worker        | (EngineCore_DP0 pid=249)            ^^^^^^^^^^^^^^^^^^^^^
worker        | (EngineCore_DP0 pid=249)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 280, in determine_available_memory
worker        | (EngineCore_DP0 pid=249)     self.model_runner.profile_run()
worker        | (EngineCore_DP0 pid=249)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3686, in profile_run
worker        | (EngineCore_DP0 pid=249)     dummy_encoder_outputs = self.model.get_multimodal_embeddings(
worker        | (EngineCore_DP0 pid=249)                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
worker        | (EngineCore_DP0 pid=249)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_vl.py", line 1604, in get_multimodal_embeddings
worker        | (EngineCore_DP0 pid=249)     image_embeddings = self._process_image_input(multimodal_input)
worker        | (EngineCore_DP0 pid=249)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
worker        | (EngineCore_DP0 pid=249)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_vl.py", line 1416, in _process_image_input
worker        | (EngineCore_DP0 pid=249)     image_embeds = self.visual(pixel_values, grid_thw=grid_thw_list)
worker        | (EngineCore_DP0 pid=249)                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
worker        | (EngineCore_DP0 pid=249)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
worker        | (EngineCore_DP0 pid=249)     return self._call_impl(*args, **kwargs)
worker        | (EngineCore_DP0 pid=249)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
worker        | (EngineCore_DP0 pid=249)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
worker        | (EngineCore_DP0 pid=249)     return forward_call(*args, **kwargs)
worker        | (EngineCore_DP0 pid=249)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
worker        | (EngineCore_DP0 pid=249)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_vl.py", line 547, in forward
worker        | (EngineCore_DP0 pid=249)     hidden_states = blk(
worker        | (EngineCore_DP0 pid=249)                     ^^^^
worker        | (EngineCore_DP0 pid=249)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
worker        | (EngineCore_DP0 pid=249)     return self._call_impl(*args, **kwargs)
worker        | (EngineCore_DP0 pid=249)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
worker        | (EngineCore_DP0 pid=249)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
worker        | (EngineCore_DP0 pid=249)     return forward_call(*args, **kwargs)
worker        | (EngineCore_DP0 pid=249)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
worker        | (EngineCore_DP0 pid=249)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_vl.py", line 231, in forward
worker        | (EngineCore_DP0 pid=249)     x = x + self.attn(
worker        | (EngineCore_DP0 pid=249)             ^^^^^^^^^^
worker        | (EngineCore_DP0 pid=249)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
worker        | (EngineCore_DP0 pid=249)     return self._call_impl(*args, **kwargs)
worker        | (EngineCore_DP0 pid=249)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
worker        | (EngineCore_DP0 pid=249)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
worker        | (EngineCore_DP0 pid=249)     return forward_call(*args, **kwargs)
worker        | (EngineCore_DP0 pid=249)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
worker        | (EngineCore_DP0 pid=249)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2_5_vl.py", line 409, in forward
worker        | (EngineCore_DP0 pid=249)     output = self.flash_attn_varlen_func(
worker        | (EngineCore_DP0 pid=249)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
worker        | (EngineCore_DP0 pid=249)   File "/usr/local/lib/python3.12/dist-packages/flash_attn/flash_attn_interface.py", line 1443, in flash_attn_varlen_func
worker        | (EngineCore_DP0 pid=249)     return FlashAttnVarlenFunc.apply(
worker        | (EngineCore_DP0 pid=249)            ^^^^^^^^^^^^^^^^^^^^^^^^^^
worker        | (EngineCore_DP0 pid=249)   File "/usr/local/lib/python3.12/dist-packages/torch/autograd/function.py", line 581, in apply
worker        | (EngineCore_DP0 pid=249)     return super().apply(*args, **kwargs)  # type: ignore[misc]
worker        | (EngineCore_DP0 pid=249)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
worker        | (EngineCore_DP0 pid=249)   File "/usr/local/lib/python3.12/dist-packages/flash_attn/flash_attn_interface.py", line 925, in forward
worker        | (EngineCore_DP0 pid=249)     out_padded, softmax_lse, S_dmask, rng_state = _wrapped_flash_attn_varlen_forward(
worker        | (EngineCore_DP0 pid=249)                                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
worker        | (EngineCore_DP0 pid=249)   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1254, in __call__
worker        | (EngineCore_DP0 pid=249)     return self._op(*args, **kwargs)
worker        | (EngineCore_DP0 pid=249)            ^^^^^^^^^^^^^^^^^^^^^^^^^
worker        | (EngineCore_DP0 pid=249)   File "/usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py", line 342, in backend_impl
worker        | (EngineCore_DP0 pid=249)     result = self._backend_fns[device_type](*args, **kwargs)
worker        | (EngineCore_DP0 pid=249)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
worker        | (EngineCore_DP0 pid=249)   File "/usr/local/lib/python3.12/dist-packages/torch/_compile.py", line 53, in inner
worker        | (EngineCore_DP0 pid=249)     return disable_fn(*args, **kwargs)
worker        | (EngineCore_DP0 pid=249)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
worker        | (EngineCore_DP0 pid=249)   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 1005, in _fn
worker        | (EngineCore_DP0 pid=249)     return fn(*args, **kwargs)
worker        | (EngineCore_DP0 pid=249)            ^^^^^^^^^^^^^^^^^^^
worker        | (EngineCore_DP0 pid=249)   File "/usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py", line 375, in wrapped_fn
worker        | (EngineCore_DP0 pid=249)     return fn(*args, **kwargs)
worker        | (EngineCore_DP0 pid=249)            ^^^^^^^^^^^^^^^^^^^
worker        | (EngineCore_DP0 pid=249)   File "/usr/local/lib/python3.12/dist-packages/flash_attn/flash_attn_interface.py", line 165, in _flash_attn_varlen_forward
worker        | (EngineCore_DP0 pid=249)     out, softmax_lse, S_dmask, rng_state = flash_attn_gpu.varlen_fwd(
worker        | (EngineCore_DP0 pid=249)                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^
worker        | (EngineCore_DP0 pid=249) RuntimeError: HIP Function Failed (/app/flash-attention/csrc/composable_kernel/include/ck_tile/host/kernel_launch_hip.hpp,65) invalid device function
worker        | [rank0]:[W1104 16:00:28.050688195 ProcessGroupNCCL.cpp:1522] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingrocmRelated to AMD ROCm

    Type

    No type

    Projects

    Status

    In Progress

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions