Restrict BF16 backward epilogue reorder to BF16_Optimizer#1
Merged
maxyu1115 merged 1 commit intoApr 29, 2026
Conversation
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Owner
|
Ahhh I didn't test the other optimizers, had claude do some code review and concluded it was fine. Thanks for fixing it! |
d5e54e8
into
maxyu1115:fix/bf16-optimizer-grad-accum-boundary-leak
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Follow-up for deepspeedai#7985.
PR deepspeedai#7985 moved
ZeROOptimizer.backward_epilogue()before gradient reduction soBF16_Optimizerincludes the boundary microbatch grad in the fp32 reduction buffer. That ordering is only needed forBF16_Optimizer.This change keeps the pre-allreduce epilogue only for
BF16_Optimizer, while preserving the previous post-allreduce epilogue ordering for normal ZeRO optimizer paths.Also adds a focused BF16 regression test for the original issue: the final accumulation microbatch must be included in the reduced fp32 gradient buffer.