Skip to content

[ROCm/DCU] Remove BF16 workarounds: is_bfloat16_available, delete_pass, _keep_in_fp32_modules#5112

Open
oldzhu wants to merge 1 commit into
PaddlePaddle:developfrom
oldzhu:hip-bf16-remove-rocm-workarounds
Open

[ROCm/DCU] Remove BF16 workarounds: is_bfloat16_available, delete_pass, _keep_in_fp32_modules#5112
oldzhu wants to merge 1 commit into
PaddlePaddle:developfrom
oldzhu:hip-bf16-remove-rocm-workarounds

Conversation

@oldzhu
Copy link
Copy Markdown

@oldzhu oldzhu commented Apr 23, 2026

Summary

Remove ROCm BF16 workarounds in PaddleX now that the root causes are fixed in upstream Paddle (PaddlePaddle/Paddle#78760).

Closes #5111

Changes

1. paddlex/inference/utils/misc.py

Add 'dcu' to the device allowlist in is_bfloat16_available(). DCU is the device_type string for ROCm/HIP hardware in PaddleX.

2. paddlex/inference/models/common/static_infer.py

Remove four duplicated if paddle.is_compiled_with_rocm(): config.delete_pass('conv2d_add_act_fuse_pass'/'conv2d_add_fuse_pass') blocks. Root cause fixed in Paddle: PADDLE_WITH_HIP guard added to InitializePatterns() in both passes (PaddlePaddle/Paddle#78760).

3. paddlex/inference/models/doc_vlm/modeling/paddleocr_vl/_paddleocr_vl.py

Two changes:

  • Remove _keep_in_fp32_modules = ['visual', 'mlp_AR']. MIOpen BF16 convolution is validated correct on gfx1100/ROCm 7.2 (SNR 44 dB vs FP32 reference, 8/8 tests PASS).
  • Add temporary LayerNorm.forward BF16 compatibility shim for ROCm. Paddle HIP wheel (<=3.4.0.dev20260408) does not register phi::bfloat16 for layer_norm. The shim casts BF16→FP32→BF16 around LayerNorm. Remove this shim after Paddle PR #78760 merges and a new wheel ships.

4. paddlex/inference/models/common/transformers/utils.py

Add 'dcu' → 'gpu' device mapping in device_guard(). paddle.set_device('dcu:N') is not accepted; must use 'gpu:N' on ROCm hardware.

Validation

Tested on AMD Radeon RX 7900 GRE (gfx1100) + ROCm 7.2.0 + Python 3.12:

Check Result
is_bfloat16_available('dcu:0') ✅ True
_keep_in_fp32_modules ✅ None (removed)
BF16 conv2d SNR vs FP32 ✅ 44 dB
device_guard('dcu', 0) ✅ No error
PaddleOCR-VL-1.5 BF16 pipeline ✅ PASS — 202.8s, EXIT:0
OCR output ✅ Correct

Evidence log: https://github.com/oldzhu/paddle-amd/blob/main/evidence/bf16_pipeline_validation_gfx1100.log

Related

## Background

Three workarounds existed in PaddleX to work around Paddle HIP/ROCm BF16
limitations. These are now fixed in upstream Paddle (see PaddlePaddle/Paddle
PR: fix: enable BF16 support for layer_norm and conv2d fuse passes on HIP).

## Changes

### 1. paddlex/inference/utils/misc.py
Add 'dcu' to the device allowlist in is_bfloat16_available(). DCU is the
device_type for ROCm/HIP hardware in PaddleX; BF16 is supported on gfx1100+.

### 2. paddlex/inference/models/common/static_infer.py
Remove four scattered 'if paddle.is_compiled_with_rocm(): delete_pass()'
blocks that existed because fused_conv2d_add_act had no HIP kernel. The root
cause is fixed in Paddle (PADDLE_WITH_HIP guard in InitializePatterns()).

### 3. paddlex/inference/models/doc_vlm/modeling/paddleocr_vl/_paddleocr_vl.py
a) Remove _keep_in_fp32_modules = ['visual', 'mlp_AR']. MIOpen BF16
   convolution was validated correct on gfx1100/ROCm 7.2 (SNR 44 dB vs FP32).
   The visual encoder (SigLIP) runs correctly in BF16.
b) Add temporary LayerNorm BF16 compatibility shim for ROCm. Paddle HIP wheel
   versions <= 3.4.0.dev20260408 do not register phi::bfloat16 in the
   layer_norm HIP kernel. The shim casts BF16->FP32->BF16 around LayerNorm.
   Remove this shim once the upstream Paddle PR is merged and a new wheel ships.

### 4. paddlex/inference/models/common/transformers/utils.py
Add 'dcu' -> 'gpu' device mapping in device_guard(). paddle.set_device()
does not accept 'dcu:N'; it must be 'gpu:N' on ROCm hardware.

## Validation

Tested on AMD Radeon RX 7900 GRE (gfx1100) + ROCm 7.2.0 + Python 3.12:
- paddle.is_compiled_with_rocm() = True
- is_bfloat16_available('dcu:0') = True
- BF16 conv2d SNR = 44 dB (8/8 tests PASS)
- PaddleOCR-VL-1.5 full BF16 pipeline: load 14.6s, inference 202.8s, EXIT:0
- OCR output correct (5 layout blocks detected, text content verified)
@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented Apr 23, 2026

Thanks for your contribution!

@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants