Fix flaky layer tests by disabling TF32 against eager reference by jlamypoirier · Pull Request #515 · ServiceNow/Fast-LLM

jlamypoirier · 2026-05-15T18:05:07Z

Summary

test_post_norms (in test_decoder_block.py) and test_hybrid_moe_mlp (in test_mlp.py) were flakily failing on CUDA: the block forward uses Triton RMSNorm while the eager reference uses torch.rms_norm, and the ~1e-7 FP-noise between the two gets amplified into ~1e-5 output drift through TF32 matmuls (~1e-3 mantissa precision), occasionally crossing the 1e-5 tolerance.
Fix: promote test_attention.py's local _no_tf32 context manager to tests/utils/utils.py::no_tf32 and use it in both layer tests for the forward-vs-reference comparison.
Drive-bys in test_decoder_block.py::test_post_norms:
- Switch torch.testing.assert_close → Assert.rms_close_relative to match the codebase convention used elsewhere (RMS-based comparison, robust to single-element outliers).
- Change the output_scale parametrization from 2.5 to 0.8 — a realistic Gemma 4 layer_scalar value rather than an artificially-large multiplier.

Test plan

Targeted: 5 consecutive runs of pytest -v -n 4 tests/layers/test_decoder_block.py tests/layers/test_mlp.py — all 17 tests pass each time.
Full suite: pytest -v -n 8 tests/ → 2603 passed, 99 skipped (vs. 2599 passed / 4 failed before the fix, no other regressions).

🤖 Generated with Claude Code

`test_post_norms` and `test_hybrid_moe_mlp` compare the block forward (which uses Triton RMSNorm) to an eager reference built from `torch.rms_norm`. The two norm implementations agree to ~1e-7 (FP32 noise floor), but PyTorch's default TF32 matmul has only ~1e-3 mantissa precision and amplifies that tiny input perturbation into ~1e-5 output drift through the block's matmuls — flakily crossing the 1e-5 tolerance. `test_attention` already worked around this with a local `_no_tf32` context manager. Promote it to `tests.utils.utils.no_tf32` and use it in `test_post_norms` / `test_hybrid_moe_mlp` too. Also: - `test_post_norms` now uses `Assert.rms_close_relative` instead of `torch.testing.assert_close`, matching the codebase convention (the RMS-based assertion is robust to single-element outliers from FP rounding). - The `output_scale` parametrization uses 0.8 (a realistic Gemma 4 layer_scalar value) instead of 2.5. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix flaky layer tests by disabling TF32 against eager reference#515

Fix flaky layer tests by disabling TF32 against eager reference#515
jlamypoirier wants to merge 1 commit into
mainfrom
jlp_fix-flaky-layer-tests

jlamypoirier commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jlamypoirier commented May 15, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant