-
-
Notifications
You must be signed in to change notification settings - Fork 12.1k
Open
Labels
ci-failureIssue about an unexpected test failure in CIIssue about an unexpected test failure in CI
Description
Name of failing test
uv pip install --system torchao==0.13.0 && VLLM_TEST_FORCE_LOAD_FORMAT=auto pytest -v -s quantization/ --ignore quantization/test_blackwell_moe.py
Basic information
- Flaky test
- Can reproduce locally
- Caused by external libraries (e.g. bug in
transformers)
π§ͺ Describe the failing test
Multiple Failing Tests in quantization test suite (33 failures)
Common failure patterns across quantization tests:
Pattern 1 - Unsupported on ROCm:
test_model_rtn_startup- ValidationError: "rtn quantization is currently not supported in rocm"- Expected behavior: Test should be skipped on ROCm platform
Pattern 2 - Engine Core Initialization Failures:
test_load_fp16_model(8 variants),test_ptpc_fp8_rocm(6 variants),test_pre_quantized_model,test_scaled_fp8_quant(2 variants)- Failure: RuntimeError during
wait_for_engine_startupβ engine core subprocess exits prematurely - Same root cause as compilation test failures: multiprocessing initialization issues in spawned workers
Pattern 3 - Model/Config Issues:
test_auto_gptq(6 variants),test_cpu_offload_gptq,test_cpu_offload_compressed_tensors,test_gptq_with_dynamic(2 variants),test_custom_quant- Likely cause: AssertionErrors suggest model loading or weight quantization configuration mismatches specific to ROCm
Pattern 4 - Specific Model Failures:
test_experts_int8for Jamba/Plamo models,test_compressed_tensors_nvfp4,test_compressed_tensors_fp8_block_enabled- Likely cause: ROCm-specific kernel or quantization format incompatibilities
Root causes:
- Missing platform checks - Tests not skipping unsupported quantization methods on ROCm
- Same multiprocessing/CUDA initialization issue affecting engine startup across multiple quantization configs
- ROCm-specific quantization kernel/format compatibility issues not properly handled
- Model architecture inspection triggering device capability checks in forked processes (similar to data_parallel.py issue)
π History of failing test
AMD-CI build Buildkite references:
- 1041
- 1077
- 1088
- 1109
- 1111
CC List.
No response
Metadata
Metadata
Assignees
Labels
ci-failureIssue about an unexpected test failure in CIIssue about an unexpected test failure in CI
Type
Projects
Status
No status
Status
In progress