Skip to content

[CI Failure]: mi325_1: Quantization TestΒ #29525

@AndreasKaratzas

Description

@AndreasKaratzas

Name of failing test

uv pip install --system torchao==0.13.0 && VLLM_TEST_FORCE_LOAD_FORMAT=auto pytest -v -s quantization/ --ignore quantization/test_blackwell_moe.py

Basic information

  • Flaky test
  • Can reproduce locally
  • Caused by external libraries (e.g. bug in transformers)

πŸ§ͺ Describe the failing test

Multiple Failing Tests in quantization test suite (33 failures)

Common failure patterns across quantization tests:

Pattern 1 - Unsupported on ROCm:

  • test_model_rtn_startup - ValidationError: "rtn quantization is currently not supported in rocm"
  • Expected behavior: Test should be skipped on ROCm platform

Pattern 2 - Engine Core Initialization Failures:

  • test_load_fp16_model (8 variants), test_ptpc_fp8_rocm (6 variants), test_pre_quantized_model, test_scaled_fp8_quant (2 variants)
  • Failure: RuntimeError during wait_for_engine_startup β†’ engine core subprocess exits prematurely
  • Same root cause as compilation test failures: multiprocessing initialization issues in spawned workers

Pattern 3 - Model/Config Issues:

  • test_auto_gptq (6 variants), test_cpu_offload_gptq, test_cpu_offload_compressed_tensors, test_gptq_with_dynamic (2 variants), test_custom_quant
  • Likely cause: AssertionErrors suggest model loading or weight quantization configuration mismatches specific to ROCm

Pattern 4 - Specific Model Failures:

  • test_experts_int8 for Jamba/Plamo models, test_compressed_tensors_nvfp4, test_compressed_tensors_fp8_block_enabled
  • Likely cause: ROCm-specific kernel or quantization format incompatibilities

Root causes:

  1. Missing platform checks - Tests not skipping unsupported quantization methods on ROCm
  2. Same multiprocessing/CUDA initialization issue affecting engine startup across multiple quantization configs
  3. ROCm-specific quantization kernel/format compatibility issues not properly handled
  4. Model architecture inspection triggering device capability checks in forked processes (similar to data_parallel.py issue)

πŸ“ History of failing test

AMD-CI build Buildkite references:

  • 1041
  • 1077
  • 1088
  • 1109
  • 1111

CC List.

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    ci-failureIssue about an unexpected test failure in CI

    Type

    No type

    Projects

    Status

    No status

    Status

    In progress

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions