Skip to content

Commit 28b9f7c

Browse files
chesterxgchenclaude
andcommitted
[2.7] Update 2.7.2 release notes: streaming hardening, memory management, hierarchical startup stability
Add three major new sections to flare_272.rst covering work merged after the initial 2.7.2 draft: Memory Management (restructured): - Zero Tensor Copy at CJ process via LazyDownloadRef pass-through (PR NVIDIA#4210) - Client-side memory management: malloc_trim, jemalloc, torch.cuda.empty_cache injected after flare.send() without training script changes (PR NVIDIA#4211) - Retain existing TensorDownloader and server-side cleanup content F3 Streaming Reliability and Performance (new section): - HOL stall mitigation: bounded send_frame() timeout, ACK watchdog, stall detection/recovery with recommended env-var settings (PR NVIDIA#4206) - Stream pool starvation fix: blob callbacks dispatched to dedicated thread pool, preventing stream worker exhaustion (PR NVIDIA#4171/NVIDIA#4172) - Streaming download retry with exponential backoff on timeout (PR NVIDIA#4167) - RxTask self-deadlock fix: stop() deferred until after map_lock released (PR NVIDIA#4204) - Lock contention reduction in produce_item() for concurrent model downloads (PR NVIDIA#4174) Hierarchical FL Startup Stability (new section): - Deployment timeout correctly classified as failure; min_sites check applied at deployment phase (PR NVIDIA#4209) - Startup grace period for dead-client detection (debounce default=true) (PR NVIDIA#4209) - Selective client exclusion on start-job timeout instead of full abort (PR NVIDIA#4209) - Hardened job metadata parsing: TypeError replaced with descriptive RuntimeError (PR NVIDIA#4209) - Recommended config snippets for HPC/Lustre environments (Frontier/ORNL scale) Bug Fixes section updated with all streaming and hierarchical startup fixes. Intro paragraph updated to reflect system hardening scope. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent 572990d commit 28b9f7c

File tree

1 file changed

+224
-11
lines changed

1 file changed

+224
-11
lines changed

docs/release_notes/flare_272.rst

Lines changed: 224 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,11 @@
22
What's New in FLARE v2.7.2
33
**************************
44

5-
NVIDIA FLARE 2.7.2 is a feature release that builds on the Job Recipe API introduced in 2.7.0, bringing it to general availability.
6-
This release also introduces significant memory management improvements with the new Tensor-based Downloader for efficient large model handling,
7-
and comprehensive timeout documentation.
5+
NVIDIA FLARE 2.7.2 is a feature release that builds on the Job Recipe API introduced in 2.7.0,
6+
bringing it to general availability.
7+
This release also delivers major system hardening across the F3 streaming layer, comprehensive
8+
memory management improvements for large-model training, and startup stability fixes for
9+
large-scale hierarchical FL deployments.
810

911
Job Recipe API - Generally Available
1012
=====================================
@@ -65,14 +67,21 @@ Key Highlights
6567

6668
For a complete list of available recipes with code examples and links to corresponding examples, see :ref:`available_recipes`.
6769

70+
Memory Management
71+
-----------------
72+
73+
FLARE 2.7.2 delivers a full memory management stack covering the server, the CJ relay process,
74+
and the client training process — addressing the peak memory challenges that arise when running
75+
large-model FL at scale.
76+
6877
Memory Management with Tensor-based Downloader
69-
----------------------------------------------
78+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
7079

7180
FLARE 2.7.2 introduces the **TensorDownloader** for PyTorch models, extending the FileDownloader concept introduced in 2.7.0 specifically for tensor data.
7281
This feature addresses critical memory challenges when working with large language models (LLMs) and other large-scale models in federated learning.
7382

7483
Key Features
75-
~~~~~~~~~~~~
84+
^^^^^^^^^^^^
7685

7786
- **Zero Code Changes Required**: Your existing PyTorch FL jobs benefit from memory optimization without any modification.
7887

@@ -81,7 +90,7 @@ Key Features
8190
- **Pull-based Architecture**: Unlike push-based streaming, each recipient pulls data at its own pace, making it more reliable for heterogeneous network conditions.
8291

8392
Performance Results
84-
~~~~~~~~~~~~~~~~~~~
93+
^^^^^^^^^^^^^^^^^^^
8594

8695
Based on our internal testing with a 5GB model and 4 clients using FedAvg, we observed **20% to 50% memory usage reduction** on both server and client sides.
8796

@@ -90,7 +99,7 @@ Based on our internal testing with a 5GB model and 4 clients using FedAvg, we ob
9099
Your results may vary depending on model size, number of clients, network conditions, and different FL algorithms and workflows.
91100

92101
How It Works
93-
~~~~~~~~~~~~
102+
^^^^^^^^^^^^
94103

95104
The TensorDownloader operates transparently behind the scenes:
96105

@@ -116,7 +125,7 @@ For advanced users who need direct control, the low-level API is available:
116125
)
117126
118127
Benefits for LLM Training
119-
~~~~~~~~~~~~~~~~~~~~~~~~~
128+
^^^^^^^^^^^^^^^^^^^^^^^^^^
120129

121130
- **Reduced Memory Footprint**: 20-50% reduction critical for large models that approach memory limits
122131
- **Improved Scalability**: Multiple clients can download at different rates without blocking
@@ -131,9 +140,47 @@ Benefits for LLM Training
131140
- User guide with configuration and tuning: :ref:`tensor_downloader`
132141
- FOBS decomposer architecture: :ref:`decomposer_for_large_object`
133142

143+
Zero Tensor Copy at the CJ Process (Pass-Through)
144+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
145+
146+
For hierarchical and large-model deployments, the Client Job (CJ) relay process previously
147+
deserialized and re-serialized every model tensor before forwarding it to the client subprocess.
148+
This doubled the memory footprint at the relay tier for every round.
149+
150+
FLARE 2.7.2 introduces a **pass-through architecture** for ``ClientAPILauncherExecutor``:
151+
152+
- **Lazy references instead of full tensors**: The CJ process holds lightweight
153+
``LazyDownloadRef`` placeholders rather than materializing the full model, so the CJ
154+
memory footprint is independent of model size.
155+
- **Direct subprocess download**: The training subprocess fetches tensors directly from the
156+
FL server, eliminating the CJ as a memory bottleneck and halving network transfers between
157+
the server and CJ tier.
158+
- **Zero code changes**: Existing jobs using ``ClientAPILauncherExecutor`` benefit
159+
automatically.
160+
161+
This is particularly impactful for LLM-scale models (7B–70B parameters) where CJ memory
162+
previously equalled the full model size.
163+
164+
Client-Side Memory Management
165+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
166+
167+
FLARE 2.7.2 extends memory lifecycle control to the client training process, complementing
168+
the existing server-side cleanup:
169+
170+
- **Allocator-aware cleanup**: After each ``flare.send()`` call, FLARE automatically
171+
invokes ``gc.collect()`` plus allocator-specific trimming — ``malloc_trim(0)`` for
172+
glibc (Linux), jemalloc arena purge where available, and ``torch.cuda.empty_cache()``
173+
for GPU memory — returning freed pages to the OS between rounds.
174+
- **Configurable frequency**: Cleanup runs every ``N`` rounds (default: every round),
175+
configurable via recipe parameters (``client_memory_gc_rounds``) and ``ScriptRunner``.
176+
- **No training script changes**: Cleanup is injected transparently into the FLARE
177+
client lifecycle without touching user training code.
178+
- **Combined with server-side cleanup**: Together with the server-side garbage collection
179+
introduced in 2.7.2, this prevents unbounded RSS growth in both the server and client
180+
processes across long-running jobs with many rounds.
134181

135182
Server-Side Memory Cleanup
136-
~~~~~~~~~~~~~~~~~~~~~~~~~~
183+
~~~~~~~~~~~~~~~~~~~~~~~~~~~
137184

138185
FLARE 2.7.2 adds automatic server-side memory management to address RSS (Resident Set Size — the actual physical memory used by a process) growth in long-running jobs:
139186

@@ -146,6 +193,166 @@ FLARE 2.7.2 adds automatic server-side memory management to address RSS (Residen
146193

147194
For configuration details, platform compatibility, recommended settings, and API reference, see :doc:`/programming_guide/memory_management`.
148195

196+
F3 Streaming Reliability and Performance
197+
-----------------------------------------
198+
199+
A focused hardening effort on the F3 streaming layer addresses several concurrency and
200+
stability issues that manifested at scale, particularly in hierarchical and large-model
201+
deployments.
202+
203+
Head-of-Line (HOL) Stall Mitigation
204+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
205+
206+
In 2.7.0/2.7.1, a slow or congested connection could hold the per-connection SFM send lock
207+
indefinitely, blocking all outgoing traffic on that relay — heartbeats, admin commands, and
208+
task requests — behind a single large frame send.
209+
210+
FLARE 2.7.2 eliminates this with a multi-layer guard:
211+
212+
- **Bounded send timeout**: ``send_frame()`` now has a configurable deadline
213+
(``STREAMING_SEND_TIMEOUT``); a send that exceeds it raises rather than blocking forever.
214+
- **ACK-progress watchdog**: A background monitor checks that ACKs advance within
215+
``STREAMING_ACK_PROGRESS_TIMEOUT``; if a connection stalls it is flagged.
216+
- **Stall detection and optional recovery**: Consecutive stall detections (configurable via
217+
``SFM_SEND_STALL_CONSECUTIVE_CHECKS``) can optionally trigger connection reset
218+
(``SFM_CLOSE_STALLED_CONNECTION``), unblocking all pending traffic.
219+
220+
.. code-block:: text
221+
222+
# Recommended server environment settings for large hierarchical deployments
223+
STREAMING_SEND_TIMEOUT=30
224+
STREAMING_ACK_PROGRESS_TIMEOUT=60
225+
SFM_SEND_STALL_TIMEOUT=75
226+
SFM_SEND_STALL_CONSECUTIVE_CHECKS=3
227+
SFM_CLOSE_STALLED_CONNECTION=true # enable after confirming no false positives
228+
229+
Stream Pool Starvation Fix
230+
~~~~~~~~~~~~~~~~~~~~~~~~~~~
231+
232+
A thread pool deadlock in blob streaming caused ``handle_blob_cb`` to be dispatched on the
233+
same stream worker pool that it was waiting to resolve — exhausting the pool and preventing
234+
any concurrent downloads from completing.
235+
236+
The fix dispatches blob callbacks to a dedicated ``callback_thread_pool``, keeping stream
237+
workers free. An end-to-end test validates that 8 concurrent downloads complete without
238+
starvation.
239+
240+
Streaming Download Retry on Timeout
241+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
242+
243+
Transient timeouts during streaming downloads (particularly in LLM swarming scenarios over
244+
congested networks) previously resulted in silent stream loss. FLARE 2.7.2 adds structured
245+
retry semantics:
246+
247+
- **Exponential-backoff retry**: Up to 3 retries with configurable backoff, capped at 60 s.
248+
- **Abort-signal aware**: Retry loop respects abort signals; no stale retries after job stop.
249+
- **State-safe**: Retry is idempotent; re-requesting the same stream is safe for the server.
250+
251+
RxTask Self-Deadlock Fix
252+
~~~~~~~~~~~~~~~~~~~~~~~~~
253+
254+
A self-deadlock in the receiver path occurred when ``find_or_create_task()`` called
255+
``task.stop()`` while holding ``RxTask.map_lock``, and ``stop()`` also acquired
256+
``map_lock``. This was triggered by stream error signals arriving for an active task.
257+
258+
The fix defers the ``stop()`` call until after ``map_lock`` is released, eliminating the
259+
deadlock without changing the correctness of stream error handling.
260+
261+
Lock Contention Reduction in Model Downloads
262+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
263+
264+
The ``produce_item()`` call in the cacheable streaming layer previously ran inside
265+
``self.lock``, serializing all concurrent clients needing the same model chunk behind a
266+
single slow operation. The lock scope has been reduced so that item production occurs
267+
outside the lock, with a compare-and-store pattern for the cache write. This significantly
268+
reduces model-download latency when many clients (e.g., 24 per relay) request the same
269+
chunk concurrently.
270+
271+
Hierarchical FL Startup Stability
272+
-----------------------------------
273+
274+
Large-scale hierarchical FL deployments (many clients across relay tiers) are subject to
275+
startup race conditions that can abort jobs before training begins. FLARE 2.7.2 addresses
276+
these with a set of coordinated fixes and new configuration controls.
277+
278+
.. note::
279+
280+
This fix set is particularly relevant for HPC deployments such as Frontier (ORNL) and
281+
similar supercomputers where 100+ FL clients start via a batch scheduler (e.g., Slurm)
282+
under shared filesystem (Lustre) load.
283+
284+
Deployment Timeout Now Treated as Failure
285+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
286+
287+
Previously, a client that did not acknowledge job deployment within the timeout window
288+
(``reply=None``) was silently treated as successfully deployed. The server proceeded to
289+
start the job including that client in the participant list, creating a state inconsistency
290+
that led to premature dead-client detection and job abort.
291+
292+
FLARE 2.7.2 correctly classifies deployment timeouts as failures, applying the existing
293+
``min_sites`` / ``required_sites`` tolerance check at the deployment phase. Timed-out
294+
clients are excluded from the job before ``start_client_job`` is called, preventing the
295+
state inconsistency from ever forming.
296+
297+
Startup Grace Period for Dead-Client Detection
298+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
299+
300+
The server's heartbeat monitor previously fired a dead-job notification on the very first
301+
heartbeat from a client that was not yet running the job — there was no startup grace period.
302+
For clients that were still initializing (slow filesystem, GPU allocation, subprocess
303+
spawning), this caused premature dead-client classification.
304+
305+
FLARE 2.7.2 adds a debounce mechanism: a client must first be positively observed reporting
306+
the job in a heartbeat before a subsequent missing report triggers a dead-job notification.
307+
This gives clients the time they need to start without false alarms.
308+
309+
This behavior is now the **default** (``sync_client_jobs_require_previous_report=true``).
310+
Operators who need the legacy aggressive detection can opt out via configuration.
311+
312+
Selective Client Exclusion on Start-Job Timeout
313+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
314+
315+
When strict start-job reply checking is enabled
316+
(``strict_start_job_reply_check=true``), clients that time out at the start-job phase are
317+
now **excluded from the run** rather than causing a full job abort — provided the remaining
318+
active client count still satisfies ``min_clients``. A warning is logged identifying the
319+
excluded clients.
320+
321+
This allows a job to proceed with e.g., 142 of 144 clients when 2 stragglers fail to
322+
respond, rather than aborting when the training majority is ready.
323+
324+
Hardened Client Job Metadata Parsing
325+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
326+
327+
If a client process started after the job was already aborted, it would crash with an
328+
opaque ``TypeError: 'NoneType' object is not iterable`` when reading job client metadata.
329+
FLARE 2.7.2 replaces this with an explicit ``RuntimeError`` that names the missing field,
330+
making the failure actionable in logs.
331+
332+
Recommended Configuration for Large-Scale Hierarchical Deployments
333+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
334+
335+
.. code-block:: json
336+
337+
// config_fed_server.json — relax FedAvg tolerance
338+
{
339+
"workflows": [{
340+
"path": "nvflare.app_common.workflows.fedavg.FedAvg",
341+
"args": {
342+
"num_clients": 144,
343+
"min_clients": 138
344+
}
345+
}]
346+
}
347+
348+
.. code-block:: json
349+
350+
// config_fed_client.json — extend sync timeout for Lustre/HPC
351+
{
352+
"runner_sync_timeout": 120,
353+
"max_runner_sync_timeout": 7200
354+
}
355+
149356
Comprehensive Timeout Documentation
150357
------------------------------------
151358

@@ -200,7 +407,7 @@ Documentation
200407

201408
- **Timeout Documentation**: New :doc:`/user_guide/timeout_troubleshooting` for common job failures, and :doc:`/programming_guide/timeouts` as comprehensive reference for all 100+ timeout parameters.
202409

203-
- **Memory Management Guide**: New :doc:`/programming_guide/memory_management` covering server-side garbage collection, ``MALLOC_ARENA_MAX`` tuning, platform compatibility, and troubleshooting.
410+
- **Memory Management Guide**: New :doc:`/programming_guide/memory_management` covering server-side and client-side garbage collection, ``MALLOC_ARENA_MAX`` tuning, platform compatibility, and troubleshooting.
204411

205412
- **Tensor Downloader Guide**: Expanded :doc:`/programming_guide/tensor_downloader` with configuration examples, architecture details, and tuning guidance.
206413

@@ -213,10 +420,16 @@ Documentation
213420
Bug Fixes
214421
~~~~~~~~~
215422

423+
- Fixed F3 streaming Head-of-Line stall: ``send_frame()`` no longer holds the connection lock without a timeout bound.
424+
- Fixed RxTask self-deadlock triggered by stream error signals during active receive.
425+
- Fixed stream thread pool starvation that prevented concurrent model downloads from completing.
426+
- Fixed deployment timeout silent pass-through: timed-out clients are now counted against ``min_sites``.
427+
- Fixed premature dead-job detection: clients are no longer reported missing before their first positive heartbeat.
428+
- Fixed ``TypeError`` crash in client job process when job metadata is absent (replaced with descriptive ``RuntimeError``).
429+
- Fixed Swarm Learning self-message deadlock for local result submission.
216430
- Fixed TLS corruption by replacing ``fork`` with ``posix_spawn`` for subprocess creation.
217431
- Fixed potential data corruption issue in the Streamer component.
218432
- Fixed Swarm Learning controller compatibility with tensor streaming.
219-
- Fixed Swarm Learning controller bug.
220433
- Fixed XGBoost adaptor and recipe integration issues.
221434
- Addressed client-side vulnerability for tree-based horizontal XGBoost.
222435
- Fixed NumPy cross-site evaluation regression.

0 commit comments

Comments
 (0)