You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add three major new sections to flare_272.rst covering work merged after the
initial 2.7.2 draft:
Memory Management (restructured):
- Zero Tensor Copy at CJ process via LazyDownloadRef pass-through (PR NVIDIA#4210)
- Client-side memory management: malloc_trim, jemalloc, torch.cuda.empty_cache
injected after flare.send() without training script changes (PR NVIDIA#4211)
- Retain existing TensorDownloader and server-side cleanup content
F3 Streaming Reliability and Performance (new section):
- HOL stall mitigation: bounded send_frame() timeout, ACK watchdog, stall
detection/recovery with recommended env-var settings (PR NVIDIA#4206)
- Stream pool starvation fix: blob callbacks dispatched to dedicated thread
pool, preventing stream worker exhaustion (PR NVIDIA#4171/NVIDIA#4172)
- Streaming download retry with exponential backoff on timeout (PR NVIDIA#4167)
- RxTask self-deadlock fix: stop() deferred until after map_lock released (PR NVIDIA#4204)
- Lock contention reduction in produce_item() for concurrent model downloads (PR NVIDIA#4174)
Hierarchical FL Startup Stability (new section):
- Deployment timeout correctly classified as failure; min_sites check applied
at deployment phase (PR NVIDIA#4209)
- Startup grace period for dead-client detection (debounce default=true) (PR NVIDIA#4209)
- Selective client exclusion on start-job timeout instead of full abort (PR NVIDIA#4209)
- Hardened job metadata parsing: TypeError replaced with descriptive RuntimeError (PR NVIDIA#4209)
- Recommended config snippets for HPC/Lustre environments (Frontier/ORNL scale)
Bug Fixes section updated with all streaming and hierarchical startup fixes.
Intro paragraph updated to reflect system hardening scope.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
NVIDIA FLARE 2.7.2 is a feature release that builds on the Job Recipe API introduced in 2.7.0, bringing it to general availability.
6
-
This release also introduces significant memory management improvements with the new Tensor-based Downloader for efficient large model handling,
7
-
and comprehensive timeout documentation.
5
+
NVIDIA FLARE 2.7.2 is a feature release that builds on the Job Recipe API introduced in 2.7.0,
6
+
bringing it to general availability.
7
+
This release also delivers major system hardening across the F3 streaming layer, comprehensive
8
+
memory management improvements for large-model training, and startup stability fixes for
9
+
large-scale hierarchical FL deployments.
8
10
9
11
Job Recipe API - Generally Available
10
12
=====================================
@@ -65,14 +67,21 @@ Key Highlights
65
67
66
68
For a complete list of available recipes with code examples and links to corresponding examples, see :ref:`available_recipes`.
67
69
70
+
Memory Management
71
+
-----------------
72
+
73
+
FLARE 2.7.2 delivers a full memory management stack covering the server, the CJ relay process,
74
+
and the client training process — addressing the peak memory challenges that arise when running
75
+
large-model FL at scale.
76
+
68
77
Memory Management with Tensor-based Downloader
69
-
----------------------------------------------
78
+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
70
79
71
80
FLARE 2.7.2 introduces the **TensorDownloader** for PyTorch models, extending the FileDownloader concept introduced in 2.7.0 specifically for tensor data.
72
81
This feature addresses critical memory challenges when working with large language models (LLMs) and other large-scale models in federated learning.
73
82
74
83
Key Features
75
-
~~~~~~~~~~~~
84
+
^^^^^^^^^^^^
76
85
77
86
- **Zero Code Changes Required**: Your existing PyTorch FL jobs benefit from memory optimization without any modification.
78
87
@@ -81,7 +90,7 @@ Key Features
81
90
- **Pull-based Architecture**: Unlike push-based streaming, each recipient pulls data at its own pace, making it more reliable for heterogeneous network conditions.
82
91
83
92
Performance Results
84
-
~~~~~~~~~~~~~~~~~~~
93
+
^^^^^^^^^^^^^^^^^^^
85
94
86
95
Based on our internal testing with a 5GB model and 4 clients using FedAvg, we observed **20% to 50% memory usage reduction** on both server and client sides.
87
96
@@ -90,7 +99,7 @@ Based on our internal testing with a 5GB model and 4 clients using FedAvg, we ob
90
99
Your results may vary depending on model size, number of clients, network conditions, and different FL algorithms and workflows.
91
100
92
101
How It Works
93
-
~~~~~~~~~~~~
102
+
^^^^^^^^^^^^
94
103
95
104
The TensorDownloader operates transparently behind the scenes:
96
105
@@ -116,7 +125,7 @@ For advanced users who need direct control, the low-level API is available:
116
125
)
117
126
118
127
Benefits for LLM Training
119
-
~~~~~~~~~~~~~~~~~~~~~~~~~
128
+
^^^^^^^^^^^^^^^^^^^^^^^^^^
120
129
121
130
- **Reduced Memory Footprint**: 20-50% reduction critical for large models that approach memory limits
122
131
- **Improved Scalability**: Multiple clients can download at different rates without blocking
@@ -131,9 +140,47 @@ Benefits for LLM Training
131
140
- User guide with configuration and tuning: :ref:`tensor_downloader`
For hierarchical and large-model deployments, the Client Job (CJ) relay process previously
147
+
deserialized and re-serialized every model tensor before forwarding it to the client subprocess.
148
+
This doubled the memory footprint at the relay tier for every round.
149
+
150
+
FLARE 2.7.2 introduces a **pass-through architecture** for ``ClientAPILauncherExecutor``:
151
+
152
+
- **Lazy references instead of full tensors**: The CJ process holds lightweight
153
+
``LazyDownloadRef`` placeholders rather than materializing the full model, so the CJ
154
+
memory footprint is independent of model size.
155
+
- **Direct subprocess download**: The training subprocess fetches tensors directly from the
156
+
FL server, eliminating the CJ as a memory bottleneck and halving network transfers between
157
+
the server and CJ tier.
158
+
- **Zero code changes**: Existing jobs using ``ClientAPILauncherExecutor`` benefit
159
+
automatically.
160
+
161
+
This is particularly impactful for LLM-scale models (7B–70B parameters) where CJ memory
162
+
previously equalled the full model size.
163
+
164
+
Client-Side Memory Management
165
+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
166
+
167
+
FLARE 2.7.2 extends memory lifecycle control to the client training process, complementing
168
+
the existing server-side cleanup:
169
+
170
+
- **Allocator-aware cleanup**: After each ``flare.send()`` call, FLARE automatically
171
+
invokes ``gc.collect()`` plus allocator-specific trimming — ``malloc_trim(0)`` for
172
+
glibc (Linux), jemalloc arena purge where available, and ``torch.cuda.empty_cache()``
173
+
for GPU memory — returning freed pages to the OS between rounds.
174
+
- **Configurable frequency**: Cleanup runs every ``N`` rounds (default: every round),
175
+
configurable via recipe parameters (``client_memory_gc_rounds``) and ``ScriptRunner``.
176
+
- **No training script changes**: Cleanup is injected transparently into the FLARE
177
+
client lifecycle without touching user training code.
178
+
- **Combined with server-side cleanup**: Together with the server-side garbage collection
179
+
introduced in 2.7.2, this prevents unbounded RSS growth in both the server and client
180
+
processes across long-running jobs with many rounds.
134
181
135
182
Server-Side Memory Cleanup
136
-
~~~~~~~~~~~~~~~~~~~~~~~~~~
183
+
~~~~~~~~~~~~~~~~~~~~~~~~~~~
137
184
138
185
FLARE 2.7.2 adds automatic server-side memory management to address RSS (Resident Set Size — the actual physical memory used by a process) growth in long-running jobs:
// config_fed_client.json — extend sync timeout for Lustre/HPC
351
+
{
352
+
"runner_sync_timeout": 120,
353
+
"max_runner_sync_timeout": 7200
354
+
}
355
+
149
356
Comprehensive Timeout Documentation
150
357
------------------------------------
151
358
@@ -200,7 +407,7 @@ Documentation
200
407
201
408
- **Timeout Documentation**: New :doc:`/user_guide/timeout_troubleshooting` for common job failures, and :doc:`/programming_guide/timeouts` as comprehensive reference for all 100+ timeout parameters.
202
409
203
-
- **Memory Management Guide**: New :doc:`/programming_guide/memory_management` covering server-side garbage collection, ``MALLOC_ARENA_MAX`` tuning, platform compatibility, and troubleshooting.
410
+
- **Memory Management Guide**: New :doc:`/programming_guide/memory_management` covering server-side and client-side garbage collection, ``MALLOC_ARENA_MAX`` tuning, platform compatibility, and troubleshooting.
204
411
205
412
- **Tensor Downloader Guide**: Expanded :doc:`/programming_guide/tensor_downloader` with configuration examples, architecture details, and tuning guidance.
206
413
@@ -213,10 +420,16 @@ Documentation
213
420
Bug Fixes
214
421
~~~~~~~~~
215
422
423
+
- Fixed F3 streaming Head-of-Line stall: ``send_frame()`` no longer holds the connection lock without a timeout bound.
424
+
- Fixed RxTask self-deadlock triggered by stream error signals during active receive.
425
+
- Fixed stream thread pool starvation that prevented concurrent model downloads from completing.
426
+
- Fixed deployment timeout silent pass-through: timed-out clients are now counted against ``min_sites``.
427
+
- Fixed premature dead-job detection: clients are no longer reported missing before their first positive heartbeat.
428
+
- Fixed ``TypeError`` crash in client job process when job metadata is absent (replaced with descriptive ``RuntimeError``).
429
+
- Fixed Swarm Learning self-message deadlock for local result submission.
216
430
- Fixed TLS corruption by replacing ``fork`` with ``posix_spawn`` for subprocess creation.
217
431
- Fixed potential data corruption issue in the Streamer component.
218
432
- Fixed Swarm Learning controller compatibility with tensor streaming.
219
-
- Fixed Swarm Learning controller bug.
220
433
- Fixed XGBoost adaptor and recipe integration issues.
221
434
- Addressed client-side vulnerability for tree-based horizontal XGBoost.
0 commit comments