You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# BF16 Low-Precision Master Weights and Optimizer States
2
2
3
-
This example demonstrates DeepSpeed's new low-precision training options that can significantly reduce memory usage:
3
+
This example demonstrates DeepSpeed's [new low-precision training options](https://github.com/deepspeedai/DeepSpeed/pull/7700) that can significantly reduce memory usage:
4
4
5
5
-`bf16_master_weights_and_grads`: Keep master parameters and gradients in BF16 instead of FP32
6
6
-`bf16_optimizer_states`: Keep optimizer states (e.g., Adam moments) in BF16
7
7
8
-
These options work with ZeRO Stage 3 and `torch.autocast` to provide memory-efficient training while maintaining numerical stability.
9
8
10
-
##Memory Savings
9
+
### Running an Example
11
10
12
-
Using a 254M parameter simple transformer model with the following configuration:
13
-
- Hidden dimension: 1024
14
-
- Layers: 12
15
-
- Attention heads: 16
16
-
- Batch size: 4
17
-
- Sequence length: 512
18
-
- ZeRO Stage: 3
11
+
The following commands run training for 1000 steps on the Wikitext-103 dataset using both the baseline and BF16 low-precision configurations, then generates a loss comparison plot.
12
+
The model has approximately 6.86 billion parameters (hidden=4096, layers=32, heads=32, batch=1, seq=512).
13
+
For BF16 low-precision training, we use `torch.autocast`.
With activation checkpointing, peak memory drops significantly for both configurations, but the bf16 low-precision option shows an even larger relative improvement - nearly **40% reduction in peak memory**.
61
-
62
-
The allocated memory reflects the optimizer state memory, which is where the low-precision options provide savings. Peak memory includes activations and temporary buffers which can vary based on execution order.
63
-
64
-
## Loss Curve Comparison
65
-
66
-
To verify that BF16 low-precision training maintains numerical stability, we trained for 1000 steps on the Wikitext-103 dataset:
The loss curves show that both configurations converge similarly, demonstrating that the reduced precision does not significantly impact training quality while providing substantial memory savings.
This eliminates the FP32 copies, reducing memory by approximately 2 bytes per parameter for master weights and 4 bytes per parameter for optimizer states (for Adam which has 2 state buffers).
53
+
The loss curves show that both configurations converge similarly, demonstrating that the reduced precision does not significantly impact training quality while providing substantial memory savings.
197
54
198
55
### Memory Breakdown
199
56
@@ -208,10 +65,9 @@ For a model with N parameters:
This gives a theoretical ~44% reduction in optimizer-related memory. The actual savings depend on activation memory and other factors.
68
+
This gives a theoretical ~44% reduction in optimizer-related memory. The actual savings depend on activation memory and other factors, but our results show a very close match to the theoretical savings.
0 commit comments