[2.7] Update 2.7.2 release notes: streaming hardening, memory management, hierarchical startup stability

chesterxgchen · claude · chesterxgchen · commit 28b9f7cd2d4a · 2026-02-24T15:38:27.000-08:00
Add three major new sections to flare_272.rst covering work merged after the initial 2.7.2 draft: Memory Management (restructured): - Zero Tensor Copy at CJ process via LazyDownloadRef pass-through (PR NVIDIA#4210) - Client-side memory management: malloc_trim, jemalloc, torch.cuda.empty_cache injected after flare.send() without training script changes (PR NVIDIA#4211) - Retain existing TensorDownloader and server-side cleanup content F3 Streaming Reliability and Performance (new section): - HOL stall mitigation: bounded send_frame() timeout, ACK watchdog, stall detection/recovery with recommended env-var settings (PR NVIDIA#4206) - Stream pool starvation fix: blob callbacks dispatched to dedicated thread pool, preventing stream worker exhaustion (PR NVIDIA#4171/NVIDIA#4172) - Streaming download retry with exponential backoff on timeout (PR NVIDIA#4167) - RxTask self-deadlock fix: stop() deferred until after map_lock released (PR NVIDIA#4204) - Lock contention reduction in produce_item() for concurrent model downloads (PR NVIDIA#4174) Hierarchical FL Startup Stability (new section): - Deployment timeout correctly classified as failure; min_sites check applied at deployment phase (PR NVIDIA#4209) - Startup grace period for dead-client detection (debounce default=true) (PR NVIDIA#4209) - Selective client exclusion on start-job timeout instead of full abort (PR NVIDIA#4209) - Hardened job metadata parsing: TypeError replaced with descriptive RuntimeError (PR NVIDIA#4209) - Recommended config snippets for HPC/Lustre environments (Frontier/ORNL scale) Bug Fixes section updated with all streaming and hierarchical startup fixes. Intro paragraph updated to reflect system hardening scope. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
diff --git a/docs/release_notes/flare_272.rst b/docs/release_notes/flare_272.rst
@@ -2,9 +2,11 @@
 What's New in FLARE v2.7.2
 **************************
 
-NVIDIA FLARE 2.7.2 is a feature release that builds on the Job Recipe API introduced in 2.7.0, bringing it to general availability.
-This release also introduces significant memory management improvements with the new Tensor-based Downloader for efficient large model handling,
-and comprehensive timeout documentation.
+NVIDIA FLARE 2.7.2 is a feature release that builds on the Job Recipe API introduced in 2.7.0,
+bringing it to general availability.
+This release also delivers major system hardening across the F3 streaming layer, comprehensive
+memory management improvements for large-model training, and startup stability fixes for
+large-scale hierarchical FL deployments.
 
 Job Recipe API - Generally Available
 =====================================
@@ -65,14 +67,21 @@ Key Highlights
 
     For a complete list of available recipes with code examples and links to corresponding examples, see :ref:`available_recipes`.
 
+Memory Management
+-----------------
+
+FLARE 2.7.2 delivers a full memory management stack covering the server, the CJ relay process,
+and the client training process — addressing the peak memory challenges that arise when running
+large-model FL at scale.
+
 Memory Management with Tensor-based Downloader
-----------------------------------------------
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 FLARE 2.7.2 introduces the **TensorDownloader** for PyTorch models, extending the FileDownloader concept introduced in 2.7.0 specifically for tensor data.
 This feature addresses critical memory challenges when working with large language models (LLMs) and other large-scale models in federated learning.
 
 Key Features
-~~~~~~~~~~~~
+^^^^^^^^^^^^
 
 - **Zero Code Changes Required**: Your existing PyTorch FL jobs benefit from memory optimization without any modification.
 
@@ -81,7 +90,7 @@ Key Features
 - **Pull-based Architecture**: Unlike push-based streaming, each recipient pulls data at its own pace, making it more reliable for heterogeneous network conditions.
 
 Performance Results
-~~~~~~~~~~~~~~~~~~~
+^^^^^^^^^^^^^^^^^^^
 
 Based on our internal testing with a 5GB model and 4 clients using FedAvg, we observed **20% to 50% memory usage reduction** on both server and client sides.
 
@@ -90,7 +99,7 @@ Based on our internal testing with a 5GB model and 4 clients using FedAvg, we ob
     Your results may vary depending on model size, number of clients, network conditions, and different FL algorithms and workflows.
 
 How It Works
-~~~~~~~~~~~~
+^^^^^^^^^^^^
 
 The TensorDownloader operates transparently behind the scenes:
 
@@ -116,7 +125,7 @@ For advanced users who need direct control, the low-level API is available:
     )
 
 Benefits for LLM Training
-~~~~~~~~~~~~~~~~~~~~~~~~~
+^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 - **Reduced Memory Footprint**: 20-50% reduction critical for large models that approach memory limits
 - **Improved Scalability**: Multiple clients can download at different rates without blocking
@@ -131,9 +140,47 @@ Benefits for LLM Training
     - User guide with configuration and tuning: :ref:`tensor_downloader`
     - FOBS decomposer architecture: :ref:`decomposer_for_large_object`
 
+Zero Tensor Copy at the CJ Process (Pass-Through)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+For hierarchical and large-model deployments, the Client Job (CJ) relay process previously
+deserialized and re-serialized every model tensor before forwarding it to the client subprocess.
+This doubled the memory footprint at the relay tier for every round.
+
+FLARE 2.7.2 introduces a **pass-through architecture** for ``ClientAPILauncherExecutor``:
+
+- **Lazy references instead of full tensors**: The CJ process holds lightweight
+  ``LazyDownloadRef`` placeholders rather than materializing the full model, so the CJ
+  memory footprint is independent of model size.
+- **Direct subprocess download**: The training subprocess fetches tensors directly from the
+  FL server, eliminating the CJ as a memory bottleneck and halving network transfers between
+  the server and CJ tier.
+- **Zero code changes**: Existing jobs using ``ClientAPILauncherExecutor`` benefit
+  automatically.
+
+This is particularly impactful for LLM-scale models (7B–70B parameters) where CJ memory
+previously equalled the full model size.
+
+Client-Side Memory Management
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+FLARE 2.7.2 extends memory lifecycle control to the client training process, complementing
+the existing server-side cleanup:
+
+- **Allocator-aware cleanup**: After each ``flare.send()`` call, FLARE automatically
+  invokes ``gc.collect()`` plus allocator-specific trimming — ``malloc_trim(0)`` for
+  glibc (Linux), jemalloc arena purge where available, and ``torch.cuda.empty_cache()``
+  for GPU memory — returning freed pages to the OS between rounds.
+- **Configurable frequency**: Cleanup runs every ``N`` rounds (default: every round),
+  configurable via recipe parameters (``client_memory_gc_rounds``) and ``ScriptRunner``.
+- **No training script changes**: Cleanup is injected transparently into the FLARE
+  client lifecycle without touching user training code.
+- **Combined with server-side cleanup**: Together with the server-side garbage collection
+  introduced in 2.7.2, this prevents unbounded RSS growth in both the server and client
+  processes across long-running jobs with many rounds.
 
 Server-Side Memory Cleanup
-~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 FLARE 2.7.2 adds automatic server-side memory management to address RSS (Resident Set Size — the actual physical memory used by a process) growth in long-running jobs:
 
@@ -146,6 +193,166 @@ FLARE 2.7.2 adds automatic server-side memory management to address RSS (Residen
 
     For configuration details, platform compatibility, recommended settings, and API reference, see :doc:`/programming_guide/memory_management`.
 
+F3 Streaming Reliability and Performance
+-----------------------------------------
+
+A focused hardening effort on the F3 streaming layer addresses several concurrency and
+stability issues that manifested at scale, particularly in hierarchical and large-model
+deployments.
+
+Head-of-Line (HOL) Stall Mitigation
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In 2.7.0/2.7.1, a slow or congested connection could hold the per-connection SFM send lock
+indefinitely, blocking all outgoing traffic on that relay — heartbeats, admin commands, and
+task requests — behind a single large frame send.
+
+FLARE 2.7.2 eliminates this with a multi-layer guard:
+
+- **Bounded send timeout**: ``send_frame()`` now has a configurable deadline
+  (``STREAMING_SEND_TIMEOUT``); a send that exceeds it raises rather than blocking forever.
+- **ACK-progress watchdog**: A background monitor checks that ACKs advance within
+  ``STREAMING_ACK_PROGRESS_TIMEOUT``; if a connection stalls it is flagged.
+- **Stall detection and optional recovery**: Consecutive stall detections (configurable via
+  ``SFM_SEND_STALL_CONSECUTIVE_CHECKS``) can optionally trigger connection reset
+  (``SFM_CLOSE_STALLED_CONNECTION``), unblocking all pending traffic.
+
+.. code-block:: text
+
+    # Recommended server environment settings for large hierarchical deployments
+    STREAMING_SEND_TIMEOUT=30
+    STREAMING_ACK_PROGRESS_TIMEOUT=60
+    SFM_SEND_STALL_TIMEOUT=75
+    SFM_SEND_STALL_CONSECUTIVE_CHECKS=3
+    SFM_CLOSE_STALLED_CONNECTION=true   # enable after confirming no false positives
+
+Stream Pool Starvation Fix
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+A thread pool deadlock in blob streaming caused ``handle_blob_cb`` to be dispatched on the
+same stream worker pool that it was waiting to resolve — exhausting the pool and preventing
+any concurrent downloads from completing.
+
+The fix dispatches blob callbacks to a dedicated ``callback_thread_pool``, keeping stream
+workers free. An end-to-end test validates that 8 concurrent downloads complete without
+starvation.
+
+Streaming Download Retry on Timeout
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Transient timeouts during streaming downloads (particularly in LLM swarming scenarios over
+congested networks) previously resulted in silent stream loss. FLARE 2.7.2 adds structured
+retry semantics:
+
+- **Exponential-backoff retry**: Up to 3 retries with configurable backoff, capped at 60 s.
+- **Abort-signal aware**: Retry loop respects abort signals; no stale retries after job stop.
+- **State-safe**: Retry is idempotent; re-requesting the same stream is safe for the server.
+
+RxTask Self-Deadlock Fix
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+A self-deadlock in the receiver path occurred when ``find_or_create_task()`` called
+``task.stop()`` while holding ``RxTask.map_lock``, and ``stop()`` also acquired
+``map_lock``. This was triggered by stream error signals arriving for an active task.
+
+The fix defers the ``stop()`` call until after ``map_lock`` is released, eliminating the
+deadlock without changing the correctness of stream error handling.
+
+Lock Contention Reduction in Model Downloads
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The ``produce_item()`` call in the cacheable streaming layer previously ran inside
+``self.lock``, serializing all concurrent clients needing the same model chunk behind a
+single slow operation. The lock scope has been reduced so that item production occurs
+outside the lock, with a compare-and-store pattern for the cache write. This significantly
+reduces model-download latency when many clients (e.g., 24 per relay) request the same
+chunk concurrently.
+
+Hierarchical FL Startup Stability
+-----------------------------------
+
+Large-scale hierarchical FL deployments (many clients across relay tiers) are subject to
+startup race conditions that can abort jobs before training begins. FLARE 2.7.2 addresses
+these with a set of coordinated fixes and new configuration controls.
+
+.. note::
+
+    This fix set is particularly relevant for HPC deployments such as Frontier (ORNL) and
+    similar supercomputers where 100+ FL clients start via a batch scheduler (e.g., Slurm)
+    under shared filesystem (Lustre) load.
+
+Deployment Timeout Now Treated as Failure
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Previously, a client that did not acknowledge job deployment within the timeout window
+(``reply=None``) was silently treated as successfully deployed. The server proceeded to
+start the job including that client in the participant list, creating a state inconsistency
+that led to premature dead-client detection and job abort.
+
+FLARE 2.7.2 correctly classifies deployment timeouts as failures, applying the existing
+``min_sites`` / ``required_sites`` tolerance check at the deployment phase. Timed-out
+clients are excluded from the job before ``start_client_job`` is called, preventing the
+state inconsistency from ever forming.
+
+Startup Grace Period for Dead-Client Detection
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The server's heartbeat monitor previously fired a dead-job notification on the very first
+heartbeat from a client that was not yet running the job — there was no startup grace period.
+For clients that were still initializing (slow filesystem, GPU allocation, subprocess
+spawning), this caused premature dead-client classification.
+
+FLARE 2.7.2 adds a debounce mechanism: a client must first be positively observed reporting
+the job in a heartbeat before a subsequent missing report triggers a dead-job notification.
+This gives clients the time they need to start without false alarms.
+
+This behavior is now the **default** (``sync_client_jobs_require_previous_report=true``).
+Operators who need the legacy aggressive detection can opt out via configuration.
+
+Selective Client Exclusion on Start-Job Timeout
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When strict start-job reply checking is enabled
+(``strict_start_job_reply_check=true``), clients that time out at the start-job phase are
+now **excluded from the run** rather than causing a full job abort — provided the remaining
+active client count still satisfies ``min_clients``. A warning is logged identifying the
+excluded clients.
+
+This allows a job to proceed with e.g., 142 of 144 clients when 2 stragglers fail to
+respond, rather than aborting when the training majority is ready.
+
+Hardened Client Job Metadata Parsing
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If a client process started after the job was already aborted, it would crash with an
+opaque ``TypeError: 'NoneType' object is not iterable`` when reading job client metadata.
+FLARE 2.7.2 replaces this with an explicit ``RuntimeError`` that names the missing field,
+making the failure actionable in logs.
+
+Recommended Configuration for Large-Scale Hierarchical Deployments
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: json
+
+    // config_fed_server.json — relax FedAvg tolerance
+    {
+      "workflows": [{
+        "path": "nvflare.app_common.workflows.fedavg.FedAvg",
+        "args": {
+          "num_clients": 144,
+          "min_clients": 138
+        }
+      }]
+    }
+
+.. code-block:: json
+
+    // config_fed_client.json — extend sync timeout for Lustre/HPC
+    {
+      "runner_sync_timeout": 120,
+      "max_runner_sync_timeout": 7200
+    }
+
 Comprehensive Timeout Documentation
 ------------------------------------
 
@@ -200,7 +407,7 @@ Documentation
 
 - **Timeout Documentation**: New :doc:`/user_guide/timeout_troubleshooting` for common job failures, and :doc:`/programming_guide/timeouts` as comprehensive reference for all 100+ timeout parameters.
 
-- **Memory Management Guide**: New :doc:`/programming_guide/memory_management` covering server-side garbage collection, ``MALLOC_ARENA_MAX`` tuning, platform compatibility, and troubleshooting.
+- **Memory Management Guide**: New :doc:`/programming_guide/memory_management` covering server-side and client-side garbage collection, ``MALLOC_ARENA_MAX`` tuning, platform compatibility, and troubleshooting.
 
 - **Tensor Downloader Guide**: Expanded :doc:`/programming_guide/tensor_downloader` with configuration examples, architecture details, and tuning guidance.
 
@@ -213,10 +420,16 @@ Documentation
 Bug Fixes
 ~~~~~~~~~
 
+- Fixed F3 streaming Head-of-Line stall: ``send_frame()`` no longer holds the connection lock without a timeout bound.
+- Fixed RxTask self-deadlock triggered by stream error signals during active receive.
+- Fixed stream thread pool starvation that prevented concurrent model downloads from completing.
+- Fixed deployment timeout silent pass-through: timed-out clients are now counted against ``min_sites``.
+- Fixed premature dead-job detection: clients are no longer reported missing before their first positive heartbeat.
+- Fixed ``TypeError`` crash in client job process when job metadata is absent (replaced with descriptive ``RuntimeError``).
+- Fixed Swarm Learning self-message deadlock for local result submission.
 - Fixed TLS corruption by replacing ``fork`` with ``posix_spawn`` for subprocess creation.
 - Fixed potential data corruption issue in the Streamer component.
 - Fixed Swarm Learning controller compatibility with tensor streaming.
-- Fixed Swarm Learning controller bug.
 - Fixed XGBoost adaptor and recipe integration issues.
 - Addressed client-side vulnerability for tree-based horizontal XGBoost.
 - Fixed NumPy cross-site evaluation regression.