Skip to content

Commit fe10b7e

Browse files
committed
add results
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
1 parent 516082c commit fe10b7e

File tree

4 files changed

+2022
-164
lines changed

4 files changed

+2022
-164
lines changed
Lines changed: 20 additions & 164 deletions
Original file line numberDiff line numberDiff line change
@@ -1,80 +1,17 @@
11
# BF16 Low-Precision Master Weights and Optimizer States
22

3-
This example demonstrates DeepSpeed's new low-precision training options that can significantly reduce memory usage:
3+
This example demonstrates DeepSpeed's [new low-precision training options](https://github.com/deepspeedai/DeepSpeed/pull/7700) that can significantly reduce memory usage:
44

55
- `bf16_master_weights_and_grads`: Keep master parameters and gradients in BF16 instead of FP32
66
- `bf16_optimizer_states`: Keep optimizer states (e.g., Adam moments) in BF16
77

8-
These options work with ZeRO Stage 3 and `torch.autocast` to provide memory-efficient training while maintaining numerical stability.
98

10-
## Memory Savings
9+
### Running an Example
1110

12-
Using a 254M parameter simple transformer model with the following configuration:
13-
- Hidden dimension: 1024
14-
- Layers: 12
15-
- Attention heads: 16
16-
- Batch size: 4
17-
- Sequence length: 512
18-
- ZeRO Stage: 3
11+
The following commands run training for 1000 steps on the Wikitext-103 dataset using both the baseline and BF16 low-precision configurations, then generates a loss comparison plot.
12+
The model has approximately 6.86 billion parameters (hidden=4096, layers=32, heads=32, batch=1, seq=512).
13+
For BF16 low-precision training, we use `torch.autocast`.
1914

20-
### 1-GPU Results
21-
22-
| Configuration | Allocated Memory | Peak Memory | Avg Step Time |
23-
|---------------|------------------|-------------|---------------|
24-
| Baseline (fp32 master) | 4.14 GB | 5.93 GB | 0.1042s |
25-
| BF16 low-precision (master + opt states) | **2.71 GB** | 5.73 GB | 0.1121s |
26-
27-
**Allocated memory reduction: 1.43 GB (34.5%)**
28-
29-
### 4-GPU Results (per GPU) - 254M Model
30-
31-
| Configuration | Allocated Memory | Peak Memory | Avg Step Time |
32-
|---------------|------------------|-------------|---------------|
33-
| Baseline (fp32 master) | 1.29 GB | 3.57 GB | 0.1189s |
34-
| BF16 low-precision (master + opt states) | **0.94 GB** | 4.44 GB | 0.1249s |
35-
36-
**Allocated memory reduction: 0.35 GB per GPU (27%)**
37-
38-
### 4-GPU Results (per GPU) - 6.86B Model
39-
40-
Using a 6.86B parameter model (hidden=4096, layers=32, heads=32, batch=1, seq=512):
41-
42-
| Configuration | Allocated Memory | Peak Memory | Avg Step Time |
43-
|---------------|------------------|-------------|---------------|
44-
| Baseline (fp32 master) | 25.74 GB | 41.28 GB | 0.5078s |
45-
| BF16 low-precision (master + opt states) | **16.17 GB** | **33.20 GB** | 0.5064s |
46-
47-
**Memory reduction: 9.57 GB allocated (37%), 8.08 GB peak (19.6%)**
48-
49-
### 4-GPU Results (per GPU) - 6.86B Model with Activation Checkpointing
50-
51-
With activation checkpointing enabled, the optimizer state memory becomes the dominant factor, making the savings even more visible:
52-
53-
| Configuration | Allocated Memory | Peak Memory | Avg Step Time |
54-
|---------------|------------------|-------------|---------------|
55-
| Baseline (fp32 master) | 25.74 GB | 31.38 GB | 0.6016s |
56-
| BF16 low-precision (master + opt states) | **16.17 GB** | **18.93 GB** | 0.6427s |
57-
58-
**Memory reduction: 9.57 GB allocated (37%), 12.45 GB peak (39.7%)**
59-
60-
With activation checkpointing, peak memory drops significantly for both configurations, but the bf16 low-precision option shows an even larger relative improvement - nearly **40% reduction in peak memory**.
61-
62-
The allocated memory reflects the optimizer state memory, which is where the low-precision options provide savings. Peak memory includes activations and temporary buffers which can vary based on execution order.
63-
64-
## Loss Curve Comparison
65-
66-
To verify that BF16 low-precision training maintains numerical stability, we trained for 1000 steps on the Wikitext-103 dataset:
67-
68-
![Loss Comparison](logs/7b_loss_run/loss_comparison.png)
69-
70-
| Configuration | Final Loss | Mean Loss | Loss Std |
71-
|---------------|------------|-----------|----------|
72-
| Baseline (fp32 master) | 3.09 | 2.78 | 1.56 |
73-
| BF16 Low-Precision | 3.12 | 2.90 | 2.37 |
74-
75-
The loss curves show that both configurations converge similarly, demonstrating that the reduced precision does not significantly impact training quality while providing substantial memory savings.
76-
77-
To reproduce the loss curve comparison:
7815

7916
```bash
8017
# Run 1000 steps with wikitext dataset
@@ -93,107 +30,27 @@ python plot_loss.py --baseline logs/baseline_loss.csv --bf16 logs/bf16_full_loss
9330
--output loss_comparison.png
9431
```
9532

96-
## Configuration
33+
Here is a summary of the memory usage and training time results using 4xH100.
34+
This shows a significant memory reduction: Memory reduction: 9.57 GB allocated (37%), 12.45 GB peak (39.7%)**.
9735

98-
### Baseline (FP32 master weights and optimizer states)
99-
100-
```json
101-
{
102-
"bf16": {
103-
"enabled": true
104-
},
105-
"zero_optimization": {
106-
"stage": 3
107-
}
108-
}
109-
```
110-
111-
### BF16 Low-Precision (BF16 master weights, gradients, and optimizer states)
112-
113-
```json
114-
{
115-
"bf16": {
116-
"enabled": true,
117-
"bf16_master_weights_and_grads": true,
118-
"bf16_optimizer_states": true
119-
},
120-
"zero_optimization": {
121-
"stage": 3
122-
},
123-
"torch_autocast": {
124-
"enabled": true,
125-
"dtype": "torch.bfloat16"
126-
}
127-
}
128-
```
129-
130-
## Usage
131-
132-
### Run Individual Configurations
133-
134-
```bash
135-
# Run baseline configuration
136-
deepspeed --num_gpus=1 train.py --deepspeed_config configs/baseline.json
137-
138-
# Run BF16 low-precision configuration
139-
deepspeed --num_gpus=1 train.py --deepspeed_config configs/bf16_full.json
140-
```
141-
142-
### Run Memory Comparison
143-
144-
```bash
145-
# Run both configurations and generate comparison report
146-
./run_comparison.sh
147-
148-
# With custom settings
149-
./run_comparison.sh --num_layers 24 --hidden_dim 2048 --batch_size 2
150-
```
151-
152-
### Gather Results from Logs
153-
154-
```bash
155-
python gather_memory.py --log_dir logs/<timestamp>
156-
```
157-
158-
## Training Script Options
159-
160-
```
161-
--hidden_dim Hidden dimension size (default: 1024)
162-
--num_layers Number of transformer layers (default: 12)
163-
--num_heads Number of attention heads (default: 16)
164-
--vocab_size Vocabulary size (default: 50000)
165-
--batch_size Batch size per GPU (default: 4)
166-
--seq_length Sequence length (default: 512)
167-
--num_steps Number of training steps (default: 20)
168-
--warmup_steps Warmup steps before measuring (default: 5)
169-
--deepspeed_config Path to DeepSpeed config file
170-
```
171-
172-
## Requirements
173-
174-
- DeepSpeed with BF16 support
175-
- PyTorch with BF16 support
176-
- GPU with BF16 support (e.g., NVIDIA Ampere or newer)
177-
178-
## How It Works
36+
| Configuration | Allocated Memory | Peak Memory | Avg Step Time |
37+
|---------------|------------------|-------------|---------------|
38+
| Baseline (fp32 master) | 25.74 GB | 31.38 GB | 0.6016s |
39+
| BF16 low-precision (master + opt states) | **16.17 GB** | **18.93 GB** | 0.6427s |
17940

180-
### Standard BF16 Training (Baseline)
18141

182-
In standard BF16 training with DeepSpeed:
183-
- Model parameters are stored in BF16
184-
- Forward/backward computations use BF16 via `torch.autocast`
185-
- Master weights are maintained in FP32 for optimizer updates
186-
- Optimizer states (Adam momentum and variance) are in FP32
42+
## Loss Curve Comparison
18743

188-
This requires significant memory for the FP32 copies.
44+
To verify that BF16 low-precision training maintains numerical stability, we trained for 1000 steps on the Wikitext-103 dataset:
18945

190-
### BF16 Low-Precision Training
46+
![Loss Comparison](logs/7b_loss_run/loss_comparison.png)
19147

192-
With the new options enabled:
193-
- `bf16_master_weights_and_grads=true`: Master weights and gradients stay in BF16
194-
- `bf16_optimizer_states=true`: Adam momentum and variance buffers use BF16
48+
| Configuration | Final Loss | Mean Loss | Loss Std |
49+
|---------------|------------|-----------|----------|
50+
| Baseline (fp32 master) | 3.09 | 2.78 | 1.56 |
51+
| BF16 Low-Precision | 3.12 | 2.90 | 2.37 |
19552

196-
This eliminates the FP32 copies, reducing memory by approximately 2 bytes per parameter for master weights and 4 bytes per parameter for optimizer states (for Adam which has 2 state buffers).
53+
The loss curves show that both configurations converge similarly, demonstrating that the reduced precision does not significantly impact training quality while providing substantial memory savings.
19754

19855
### Memory Breakdown
19956

@@ -208,10 +65,9 @@ For a model with N parameters:
20865
| Adam variance | 4N bytes (FP32) | 2N bytes (BF16) |
20966
| **Total** | **18N bytes** | **10N bytes** |
21067

211-
This gives a theoretical ~44% reduction in optimizer-related memory. The actual savings depend on activation memory and other factors.
68+
This gives a theoretical ~44% reduction in optimizer-related memory. The actual savings depend on activation memory and other factors, but our results show a very close match to the theoretical savings.
21269

21370
## Related Resources
21471

21572
- [DeepSpeed BF16 Documentation](https://www.deepspeed.ai/docs/config-json/#bf16-training-options)
216-
- [DeepSpeed Core API Updates Blog](../../blogs/core_api_update/README.md)
21773
- [Low-precision master params PR](https://github.com/deepspeedai/DeepSpeed/pull/7700)

0 commit comments

Comments
 (0)