[2.7] Mitigate F3 streaming Head-of-Line (HOL) stalls and add guardrails by chesterxgchen · Pull Request #4206 · NVIDIA/NVFlare

chesterxgchen · 2026-02-20T01:20:58Z

Many Thanks to @GeorgeWang-nv to find the issues and perform the root cause analysis on this issues, suggested possible fixes

1) Root cause: connection-level Head-of-Line (HOL) blocking

Large byte streams can freeze when a blocking socket send holds the per-connection SFM send lock.
While that lock is held, unrelated traffic on the same connection (including progress and coordination frames) is serialized behind it.
This creates Head-of-Line (HOL) blocking at connection level and can manifest as long apparent stalls.

2) Proposed fixes: critical must-haves, and rationale

Critical fix A: bound low-level send blocking time

Add a bounded send timeout in the socket send path so one blocked send cannot hold the connection lock indefinitely.
Why this way: minimally invasive, directly targets lock-hold source, and fails deterministically instead of hanging.

Critical fix B: ACK-progress watchdog in streaming sender

Track ACK forward progress and fail fast when ACK offset does not advance for a configurable interval.
Why this way: protects sender flow control from waiting too long on non-progressing streams, independent of generic ACK wait.

Critical fix C: SFM stalled-send detection with guarded recovery

Monitor send stall duration in heartbeat monitor; optionally close the stalled connection to recover.
Add a consecutive-check threshold before close to reduce false positives from transient jitter.
Why this way: staged recovery lever (warn-only first, then auto-close) balances reliability and operational safety.

Configuration surface added

streaming_send_timeout
streaming_ack_progress_timeout
streaming_ack_progress_check_interval
sfm_send_stall_timeout
sfm_close_stalled_connection
sfm_send_stall_consecutive_checks

3) Test strategy and coverage

Added focused unit tests for each mitigation path with both positive and negative cases.

Socket send timeout tests

Config application, partial-write success, non-writable timeout, closed-send path, and closing-state suppression.

ACK watchdog tests

ACK progress timestamp update, progress-allowed completion, no-progress timeout, legacy ack_wait behavior, and fast-fail precedence when progress timeout is small.

Stall monitor plus guard tests

Warn-only behavior, close-enabled recovery, close-disabled non-recovery, heartbeat behavior when healthy, send-state reset on success/exception, warning emission, no-warning in healthy flow, and intermittent-stall false-alarm suppression via consecutive guard.

Validation run

pytest on the three new F3 test modules: 24 passed.

4) User/operator guide updates

Added a new section in docs/user_guide/timeout_troubleshooting.rst:
- Runtime defaults for new stall-related parameters.
- Recommended rollout: warn-only first, then auto-close only if repeated stall warnings persist.
- Suggested values from discussion: sfm_send_stall_timeout=75, sfm_send_stall_consecutive_checks=3.
- Log interpretation guidance and false-alarm suppression expectations.

5) Additional notes and risk profile

This change prefers deterministic failure/recovery over silent long hangs.
Potential side effect when auto-close is enabled: transient connection resets under severe network instability; mitigated by conservative timeout plus consecutive-check guard.
Defaults remain conservative (sfm_close_stalled_connection=false) so operators can observe first and enable recovery when needed.
Scope is intentionally focused on the critical Head-of-Line (HOL) mitigation path and operational guardrails.

Test plan

python3 -m pytest tests/unit_test/fuel/f3/drivers/socket_conn_timeout_test.py tests/unit_test/fuel/f3/streaming/byte_streamer_ack_watchdog_test.py tests/unit_test/fuel/f3/sfm/sfm_stall_monitor_test.py
python3 -m black <changed_python_files>
python3 -m isort <changed_python_files>
python3 -m flake8 <changed_python_files>

Latest Timeout Hardening Update

Close socket connections on send_frame timeout to prevent frame-boundary desync after partial writes.
Preserve specific send error codes in send_frame for common socket exceptions:
- map timeout-like exceptions to CommError.TIMEOUT (and close connection),
- map closed-socket exceptions to CommError.CLOSED,
- keep CommError.ERROR as fallback for unknown exceptions.
Add positive and negative unit tests in socket_conn_timeout_test.py:
- Positive: timeout after partial write closes the connection.
- Positive: TimeoutError -> TIMEOUT; BrokenPipeError -> CLOSED.
- Negative: successful partial-write sends do not close; non-timeout CommError.CLOSED does not force close; unknown exceptions remain ERROR.
Document stall timing relationship in timeout_troubleshooting.rst, including the practical close window formula and outer-timeout sizing guideline.

Validation

python3 -m pytest tests/unit_test/fuel/f3/drivers/socket_conn_timeout_test.py -q
Result: 10 passed

greptile-apps · 2026-02-20T01:23:29Z

Greptile Summary

This PR implements three complementary mitigation strategies to address Head-of-Line (HOL) blocking in F3 streaming, where large byte streams could freeze when a blocking socket send holds the per-connection lock. The implementation includes bounded send timeouts with connection close on partial-write timeouts (to prevent frame desync), an ACK progress watchdog in the streaming sender (to fail fast when ACK offset stops advancing), and an SFM stall monitor with consecutive-check guards (to detect and optionally close stalled connections). All three fixes work together to convert silent hangs into deterministic failures with recovery options.

Key Changes:

Socket send timeout with select-based flow control closes connection after partial writes to avoid frame-boundary desync
ACK progress tracking with configurable timeout fails fast when stream makes no progress, independent of generic ACK wait
Heartbeat monitor detects send stalls and can optionally close connections after consecutive threshold checks to reduce false positives
Conservative defaults (sfm_close_stalled_connection=false) enable warn-only mode for safe initial deployment
Comprehensive unit tests (24 tests) cover timeout paths, watchdog behavior, and consecutive-check guards
Documentation includes deployment guidelines, timing relationships, and log interpretation guidance

Confidence Score: 5/5

This PR is safe to merge with minimal risk
The implementation addresses a critical production issue (HOL blocking) with well-designed, minimally invasive fixes. Conservative defaults ensure safe rollout (warn-only mode), comprehensive test coverage validates all mitigation paths, and the consecutive-check guard reduces false positives. The previous review concern about error code preservation has been addressed.
No files require special attention

Important Files Changed

Filename	Overview
nvflare/fuel/f3/comm_config.py	Adds configuration parameters for streaming timeout and stall detection features with appropriate getter methods
nvflare/fuel/f3/drivers/socket_conn.py	Implements bounded socket send timeout with select-based flow control and connection close on timeout to prevent frame desync
nvflare/fuel/f3/sfm/sfm_conn.py	Adds send state tracking with monotonic timestamps to enable stall detection by heartbeat monitor
nvflare/fuel/f3/streaming/byte_streamer.py	Adds ACK progress watchdog that fails fast when offset does not advance within configured timeout
nvflare/fuel/f3/sfm/heartbeat_monitor.py	Adds send stall detection with consecutive-check guard and optional connection close for recovery
docs/user_guide/timeout_troubleshooting.rst	Comprehensive documentation for streaming stall guardrails with recommended deployment guidelines and timing relationships

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Large Byte Stream Send] --> B{Socket Send Blocks}
    B --> C[Holds SFM Connection Lock]
    C --> D[HOL Blocking: Other Traffic Stalled]
    
    D --> E[Fix A: Bounded Send Timeout]
    E --> E1[select with timeout]
    E1 --> E2{Timeout After Partial Write?}
    E2 -->|Yes| E3[Close Connection<br/>Prevent Frame Desync]
    E2 -->|No| E4[Complete Send Successfully]
    
    D --> F[Fix B: ACK Progress Watchdog]
    F --> F1[Track last_ack_progress_ts]
    F1 --> F2{ACK Offset Advancing?}
    F2 -->|No Progress| F3[Fail Fast After<br/>ack_progress_timeout]
    F2 -->|Progress| F4[Allow Stream to Continue]
    
    D --> G[Fix C: SFM Stall Monitor]
    G --> G1[Heartbeat Monitor Checks<br/>send_stall_seconds]
    G1 --> G2{Stall > Timeout?}
    G2 -->|Yes| G3[Increment Consecutive Count]
    G3 --> G4{Count >= Threshold?}
    G4 -->|Yes & Close Enabled| G5[Close Connection<br/>Force Recovery]
    G4 -->|No or Close Disabled| G6[Warn Only]
    G2 -->|No| G7[Reset Counter to 0]

_{Last reviewed commit: a6a4253}

greptile-apps

_{10 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{10 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

chesterxgchen · 2026-02-20T01:39:04Z

/build

chesterxgchen · 2026-02-20T02:32:31Z

/build

greptile-apps

_{10 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

pcnudde

I'm ok with this approach. It goes a bit further then PR 4205, but still is safe. only a few comments

nvflare/fuel/f3/sfm/sfm_conn.py

tests/unit_test/fuel/f3/drivers/socket_conn_timeout_test.py

nvflare/fuel/f3/streaming/byte_streamer.py

greptile-apps

_{12 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

nvflare/fuel/f3/drivers/socket_conn.py

Bound socket send, fail fast on missing ACK progress, and detect stalled sends with a consecutive-check guard so connection-level HOL stalls are surfaced and recovered deterministically. Add focused positive/negative unit tests and user-facing comm_config guidance for safe warn-only to auto-close rollout.

Add the standard Apache 2.0 header to the new SFM unit-test package marker file to satisfy repository license checks.

Use monotonic timing for SFM send-stall tracking, clamp ACK progress poll interval to avoid busy-spin under misconfiguration, and complete Apache headers in added F3 tests with coverage for the interval clamp behavior.

Keep timeout/closed CommError codes from _send_with_timeout instead of rewrapping them as generic errors, and align socket timeout tests to assert code-specific behavior with coverage for unexpected exception wrapping.

greptile-apps

_{12 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

chesterxgchen · 2026-02-20T06:24:14Z

/build

greptile-apps

_{10 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

chesterxgchen · 2026-02-20T06:28:27Z

/build

nvflare/fuel/f3/drivers/socket_conn.py

Close F3 socket connections on send timeout to prevent frame-boundary desync after partial writes, and add positive/negative unit coverage to ensure only timeout paths force close. Document the stall auto-close timing relationship so outer timeouts can be configured safely.

chesterxgchen · 2026-02-20T17:00:48Z

/build

greptile-apps

_{10 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Map common socket exceptions in send_frame to CommError.TIMEOUT or CommError.CLOSED to preserve diagnostics, while keeping ERROR as fallback for unknown failures. Add focused positive and negative tests for timeout, closed-socket, and fallback error paths.

chesterxgchen · 2026-02-20T17:24:28Z

/build

greptile-apps

_{10 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

…ent, hierarchical startup stability Add three major new sections to flare_272.rst covering work merged after the initial 2.7.2 draft: Memory Management (restructured): - Zero Tensor Copy at CJ process via LazyDownloadRef pass-through (PR NVIDIA#4210) - Client-side memory management: malloc_trim, jemalloc, torch.cuda.empty_cache injected after flare.send() without training script changes (PR NVIDIA#4211) - Retain existing TensorDownloader and server-side cleanup content F3 Streaming Reliability and Performance (new section): - HOL stall mitigation: bounded send_frame() timeout, ACK watchdog, stall detection/recovery with recommended env-var settings (PR NVIDIA#4206) - Stream pool starvation fix: blob callbacks dispatched to dedicated thread pool, preventing stream worker exhaustion (PR NVIDIA#4171/NVIDIA#4172) - Streaming download retry with exponential backoff on timeout (PR NVIDIA#4167) - RxTask self-deadlock fix: stop() deferred until after map_lock released (PR NVIDIA#4204) - Lock contention reduction in produce_item() for concurrent model downloads (PR NVIDIA#4174) Hierarchical FL Startup Stability (new section): - Deployment timeout correctly classified as failure; min_sites check applied at deployment phase (PR NVIDIA#4209) - Startup grace period for dead-client detection (debounce default=true) (PR NVIDIA#4209) - Selective client exclusion on start-job timeout instead of full abort (PR NVIDIA#4209) - Hardened job metadata parsing: TypeError replaced with descriptive RuntimeError (PR NVIDIA#4209) - Recommended config snippets for HPC/Lustre environments (Frontier/ORNL scale) Bug Fixes section updated with all streaming and hierarchical startup fixes. Intro paragraph updated to reflect system hardening scope. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ctive client exclusion, and dead-detection debounce (#4209) ## Problem Large-scale hierarchical FL jobs (e.g. BERT NER, 144 clients, 6 relays on Frontier) abort in Round 0 due to a cascading startup failure chain. The root sequence is: 1. F3 streaming HOL stall (PR #4206) delays deployment ACKs from relay-connected clients 2. **`_deploy_job()`** treats `reply=None` (timeout) as `"unknown"` — not a failure — so timed-out clients silently appear to have been deployed 3. **`_start_run()`** tries to start those clients; they again time out, and `check_client_replies()` ignores the `None` reply 4. **`_sync_client_jobs()`** fires dead-job notification on the very first heartbeat with no startup grace period 5. FedAvg requires 144/144 — one or two missing clients → abort 6. A late-starting CJ crashes with `TypeError: 'NoneType' object is not iterable` when `get_job_clients()` receives `None` metadata from an already-aborted job PRs #4206, #4204, #4174, #4172, #4186, #4211, #4210 (all merged in 2.7.2) address the transport layer. This PR addresses the remaining job lifecycle layer. --- ## Fixes Included ### 1 — `_deploy_job()`: Treat deployment timeout as failure (`job_runner.py`) **Root bug**: `reply=None` was logged as `"unknown"` and excluded from `failed_clients`, so timed-out clients counted as "successfully deployed" for the `min_sites` check. **Fix**: Add timed-out clients to `failed_clients` with a `"deployment timeout"` label. The existing `min_sites` / `required_sites` logic then correctly decides whether to abort. ### 2 — `check_client_replies()`: Return timed-out clients instead of raising (`admin.py`) **Root bug**: In strict mode, any timeout raised immediately, aborting the whole job even when the remaining active clients satisfied `min_sites`. **Fix**: In strict mode, collect timed-out clients into a return list rather than raising. Explicit errors (non-OK return code or error body) still raise. Also fixes the non-strict mode to use name-keyed dict lookup instead of fragile positional `zip()`. New signature: `check_client_replies(...) -> List[str]` (timed-out client names; empty = none). ### 3 — `_start_run()`: Selective exclusion with min_sites re-evaluation (`job_runner.py`) **Root bug**: A start-job timeout under strict mode aborted the entire job with no tolerance for stragglers within `min_sites` bounds. **Fix**: Use the returned timed-out list from `check_client_replies()`. If remaining active clients >= `min_sites`, log a warning and proceed. Only abort when below tolerance. ### 4 — `_sync_client_jobs()`: Require-prior-report default changed to `True` (`fed_server.py`) **Root bug**: `SYNC_CLIENT_JOBS_REQUIRE_PREVIOUS_REPORT` defaulted to `False`, meaning the bug fix was opt-in and the unsafe behaviour remained the default. **Fix**: Default changed to `True`. Operators who want the aggressive legacy detection can set it to `False` explicitly. ### 5 — `_sync_client_jobs()`: Move `_reported_clients` out of `job_info` dict (`fed_server.py`) **Root bug**: Positive-observation tracking was stored as `job_info["_reported_clients"]`, injecting algorithm state into a data dict with no corresponding `RunProcessKey` constant. **Fix**: Tracking moved to `self._job_reported_clients: Dict[str, set]` on `FederatedServer`. Stale entries are purged whenever a job is no longer in `run_processes`. ### 6 — `ClientRunManager.get_job_clients()`: Explicit meta validation (`client_run_manager.py`) Raises `RuntimeError` with a descriptive message instead of an opaque `TypeError` when `JOB_CLIENTS` is absent or the wrong type. --- ## Configuration Recommendations (No Code Change Needed) | Setting | Recommended value | Effect | |---|---|---| | `FedAvg(min_clients=...)` | 96-98% of `num_clients` | Tolerates a few startup stragglers | | `runner_sync_timeout` | `120` s | Allows Lustre-backed deployments time to complete | | `strict_start_job_reply_check` | `true` | Start-job timeouts surfaced, straggler clients excluded | | `sync_client_jobs_require_previous_report` | `true` (now the default) | Prevents premature dead-job from startup delay | | `SFM_CLOSE_STALLED_CONNECTION` (PR #4206) | `true` after staging | Disconnects stalled relay connections | --- ## Files Changed - `nvflare/private/fed/server/job_runner.py` — `_deploy_job()` timeout as failure; `_start_run()` selective exclusion - `nvflare/private/fed/server/admin.py` — `check_client_replies()` returns timed-out list; dict-keyed non-strict path - `nvflare/private/fed/server/fed_server.py` — `_sync_client_jobs()` default `True`; `_job_reported_clients` attr; stale cleanup - `nvflare/private/fed/client/client_run_manager.py` — explicit meta validation in `get_job_clients()` --- ## Test Coverage New and updated unit tests with both positive and negative cases: | File | Tests | What they cover | |---|---|---| | `admin_test.py` | 8 | Timeout returned not raised; dict lookup; error still raises; reorder OK | | `job_runner_test.py` | 4 | strict flag wiring; timeout within tolerance → warn; timeout below tolerance → raise | | `job_runner_deploy_test.py` | 9 (new file) | Timeout counted as failure; OK reply not failed; mixed outcomes; detail label; min_sites with timeouts; integration sequence | | `fed_server_test.py` | 5 | Default requires-prior-report; legacy explicit-False still fires; tracking in server attr not job_info; stale cleanup | All 29 targeted unit tests pass. ## Test Plan - [x] Unit tests for each changed function (positive + negative) - [x] New `job_runner_deploy_test.py` covering deployment timeout classification end-to-end - [x] All 29 targeted unit tests pass - [ ] Hierarchical staging run with all flags at default - [ ] Hierarchical staging run with `strict_start_job_reply_check=true` and reduced `min_clients` - [ ] Verify no regression on standard (non-hierarchical) FL jobs --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

…ctive client exclusion, and dead-detection debounce (NVIDIA#4209) ## Problem Large-scale hierarchical FL jobs (e.g. BERT NER, 144 clients, 6 relays on Frontier) abort in Round 0 due to a cascading startup failure chain. The root sequence is: 1. F3 streaming HOL stall (PR NVIDIA#4206) delays deployment ACKs from relay-connected clients 2. **`_deploy_job()`** treats `reply=None` (timeout) as `"unknown"` — not a failure — so timed-out clients silently appear to have been deployed 3. **`_start_run()`** tries to start those clients; they again time out, and `check_client_replies()` ignores the `None` reply 4. **`_sync_client_jobs()`** fires dead-job notification on the very first heartbeat with no startup grace period 5. FedAvg requires 144/144 — one or two missing clients → abort 6. A late-starting CJ crashes with `TypeError: 'NoneType' object is not iterable` when `get_job_clients()` receives `None` metadata from an already-aborted job PRs NVIDIA#4206, NVIDIA#4204, NVIDIA#4174, NVIDIA#4172, NVIDIA#4186, NVIDIA#4211, NVIDIA#4210 (all merged in 2.7.2) address the transport layer. This PR addresses the remaining job lifecycle layer. --- ## Fixes Included ### 1 — `_deploy_job()`: Treat deployment timeout as failure (`job_runner.py`) **Root bug**: `reply=None` was logged as `"unknown"` and excluded from `failed_clients`, so timed-out clients counted as "successfully deployed" for the `min_sites` check. **Fix**: Add timed-out clients to `failed_clients` with a `"deployment timeout"` label. The existing `min_sites` / `required_sites` logic then correctly decides whether to abort. ### 2 — `check_client_replies()`: Return timed-out clients instead of raising (`admin.py`) **Root bug**: In strict mode, any timeout raised immediately, aborting the whole job even when the remaining active clients satisfied `min_sites`. **Fix**: In strict mode, collect timed-out clients into a return list rather than raising. Explicit errors (non-OK return code or error body) still raise. Also fixes the non-strict mode to use name-keyed dict lookup instead of fragile positional `zip()`. New signature: `check_client_replies(...) -> List[str]` (timed-out client names; empty = none). ### 3 — `_start_run()`: Selective exclusion with min_sites re-evaluation (`job_runner.py`) **Root bug**: A start-job timeout under strict mode aborted the entire job with no tolerance for stragglers within `min_sites` bounds. **Fix**: Use the returned timed-out list from `check_client_replies()`. If remaining active clients >= `min_sites`, log a warning and proceed. Only abort when below tolerance. ### 4 — `_sync_client_jobs()`: Require-prior-report default changed to `True` (`fed_server.py`) **Root bug**: `SYNC_CLIENT_JOBS_REQUIRE_PREVIOUS_REPORT` defaulted to `False`, meaning the bug fix was opt-in and the unsafe behaviour remained the default. **Fix**: Default changed to `True`. Operators who want the aggressive legacy detection can set it to `False` explicitly. ### 5 — `_sync_client_jobs()`: Move `_reported_clients` out of `job_info` dict (`fed_server.py`) **Root bug**: Positive-observation tracking was stored as `job_info["_reported_clients"]`, injecting algorithm state into a data dict with no corresponding `RunProcessKey` constant. **Fix**: Tracking moved to `self._job_reported_clients: Dict[str, set]` on `FederatedServer`. Stale entries are purged whenever a job is no longer in `run_processes`. ### 6 — `ClientRunManager.get_job_clients()`: Explicit meta validation (`client_run_manager.py`) Raises `RuntimeError` with a descriptive message instead of an opaque `TypeError` when `JOB_CLIENTS` is absent or the wrong type. --- ## Configuration Recommendations (No Code Change Needed) | Setting | Recommended value | Effect | |---|---|---| | `FedAvg(min_clients=...)` | 96-98% of `num_clients` | Tolerates a few startup stragglers | | `runner_sync_timeout` | `120` s | Allows Lustre-backed deployments time to complete | | `strict_start_job_reply_check` | `true` | Start-job timeouts surfaced, straggler clients excluded | | `sync_client_jobs_require_previous_report` | `true` (now the default) | Prevents premature dead-job from startup delay | | `SFM_CLOSE_STALLED_CONNECTION` (PR NVIDIA#4206) | `true` after staging | Disconnects stalled relay connections | --- ## Files Changed - `nvflare/private/fed/server/job_runner.py` — `_deploy_job()` timeout as failure; `_start_run()` selective exclusion - `nvflare/private/fed/server/admin.py` — `check_client_replies()` returns timed-out list; dict-keyed non-strict path - `nvflare/private/fed/server/fed_server.py` — `_sync_client_jobs()` default `True`; `_job_reported_clients` attr; stale cleanup - `nvflare/private/fed/client/client_run_manager.py` — explicit meta validation in `get_job_clients()` --- ## Test Coverage New and updated unit tests with both positive and negative cases: | File | Tests | What they cover | |---|---|---| | `admin_test.py` | 8 | Timeout returned not raised; dict lookup; error still raises; reorder OK | | `job_runner_test.py` | 4 | strict flag wiring; timeout within tolerance → warn; timeout below tolerance → raise | | `job_runner_deploy_test.py` | 9 (new file) | Timeout counted as failure; OK reply not failed; mixed outcomes; detail label; min_sites with timeouts; integration sequence | | `fed_server_test.py` | 5 | Default requires-prior-report; legacy explicit-False still fires; tracking in server attr not job_info; stale cleanup | All 29 targeted unit tests pass. ## Test Plan - [x] Unit tests for each changed function (positive + negative) - [x] New `job_runner_deploy_test.py` covering deployment timeout classification end-to-end - [x] All 29 targeted unit tests pass - [ ] Hierarchical staging run with all flags at default - [ ] Hierarchical staging run with `strict_start_job_reply_check=true` and reduced `min_clients` - [ ] Verify no regression on standard (non-hierarchical) FL jobs --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

…ent, hierarchical startup stability Add three major new sections to flare_272.rst covering work merged after the initial 2.7.2 draft: Memory Management (restructured): - Zero Tensor Copy at CJ process via LazyDownloadRef pass-through (PR NVIDIA#4210) - Client-side memory management: malloc_trim, jemalloc, torch.cuda.empty_cache injected after flare.send() without training script changes (PR NVIDIA#4211) - Retain existing TensorDownloader and server-side cleanup content F3 Streaming Reliability and Performance (new section): - HOL stall mitigation: bounded send_frame() timeout, ACK watchdog, stall detection/recovery with recommended env-var settings (PR NVIDIA#4206) - Stream pool starvation fix: blob callbacks dispatched to dedicated thread pool, preventing stream worker exhaustion (PR NVIDIA#4171/NVIDIA#4172) - Streaming download retry with exponential backoff on timeout (PR NVIDIA#4167) - RxTask self-deadlock fix: stop() deferred until after map_lock released (PR NVIDIA#4204) - Lock contention reduction in produce_item() for concurrent model downloads (PR NVIDIA#4174) Hierarchical FL Startup Stability (new section): - Deployment timeout correctly classified as failure; min_sites check applied at deployment phase (PR NVIDIA#4209) - Startup grace period for dead-client detection (debounce default=true) (PR NVIDIA#4209) - Selective client exclusion on start-job timeout instead of full abort (PR NVIDIA#4209) - Hardened job metadata parsing: TypeError replaced with descriptive RuntimeError (PR NVIDIA#4209) - Recommended config snippets for HPC/Lustre environments (Frontier/ORNL scale) Bug Fixes section updated with all streaming and hierarchical startup fixes. Intro paragraph updated to reflect system hardening scope. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ent, hierarchical startup stability [skip ci] (#4218) ## Merge Dependency > ⚠️ **Depends on #4209** — The *Hierarchical FL Startup Stability* section documents changes introduced by PR #4209 (currently open). **Merge PR #4209 into `2.7` before merging this PR.** All other sections cover already-merged PRs. --- ## Summary This PR updates `docs/release_notes/flare_272.rst` to reflect all major changes merged into the 2.7.x line after the initial 2.7.2 draft, covering three new areas: - **Memory Management** — restructured and expanded with Zero Tensor Copy at CJ (PR #4210) and client-side memory lifecycle management (PR #4211) - **F3 Streaming Reliability and Performance** — new section covering HOL stall mitigation (PR #4206), stream pool starvation fix (PR #4171/#4172), streaming download retry (PR #4167), RxTask self-deadlock fix (PR #4204), and lock contention reduction (PR #4174) - **Hierarchical FL Startup Stability** — new section covering deployment timeout classification, startup grace period, selective client exclusion, and metadata hardening (PR #4209 — pending merge), with recommended config snippets for HPC/Lustre environments The Bug Fixes section and intro paragraph are also updated accordingly. A source-level RST comment has been added above the Hierarchical FL section in the file to alert future maintainers to the merge dependency. ## Merged PRs Documented | PR | Area | Status | |---|---|---| | #4171 / #4172 | Stream pool starvation fix | Merged | | #4174 | Lock contention reduction | Merged | | #4167 | Streaming download retry | Merged | | #4204 | RxTask self-deadlock fix | Merged | | #4206 | HOL stall mitigation | Merged | | #4210 | Zero tensor copy at CJ | Merged | | #4211 | Client-side memory management | Merged | | #4209 | Hierarchical FL startup stability | **Open — merge before this PR** | ## Changes ### Memory Management (restructured) - **Zero Tensor Copy at CJ** (`ClientAPILauncherExecutor`): CJ now holds `LazyDownloadRef` placeholders instead of materializing full tensors, eliminating the CJ as a memory bottleneck for LLM-scale models. - **Client-Side Memory Management**: `gc.collect()` + `malloc_trim(0)` / jemalloc purge / `torch.cuda.empty_cache()` injected after every `flare.send()`, configurable via `client_memory_gc_rounds`. - Existing TensorDownloader and server-side cleanup content retained. ### F3 Streaming Reliability and Performance (new section) - **HOL Stall Mitigation**: Bounded `send_frame()` timeout, ACK-progress watchdog, and stall detection/recovery. Includes recommended environment variable settings for large hierarchical deployments. - **Stream Pool Starvation Fix**: Blob callbacks dispatched to a dedicated `callback_thread_pool`, keeping stream workers free for concurrent downloads. - **Streaming Download Retry**: Exponential-backoff retry (up to 3 attempts, capped at 60 s) on `TIMEOUT` errors; abort-signal aware. - **RxTask Self-Deadlock Fix**: `stop()` deferred until after `map_lock` released, eliminating stream-error-triggered deadlock. - **Lock Contention Reduction**: `produce_item()` runs outside `self.lock`; compare-and-store for cache write. Reduces model-download latency under high client concurrency. ### Hierarchical FL Startup Stability (new section — pending PR #4209) - **Deployment Timeout as Failure**: `reply=None` correctly counted against `min_sites`; timed-out clients excluded before `start_client_job`. - **Startup Grace Period**: Dead-client detection debounced — client must be observed once before absence triggers dead-job notification. Default changed to `True`. - **Selective Client Exclusion**: Stragglers at start-job excluded rather than causing full abort, if remaining count ≥ `min_clients`. - **Metadata Hardening**: `TypeError` on absent job metadata replaced with descriptive `RuntimeError`. - Recommended `config_fed_server.json` / `config_fed_client.json` snippets for HPC (Frontier/ORNL) scale. ## Test plan - [ ] Sphinx build (`make html`) passes without RST warnings on the updated file - [ ] All new cross-references (`.. code-block::`, `.. note::`) render correctly in the docs build - [ ] Verify section hierarchy (underline characters) is consistent throughout the file - [ ] Confirm PR #4209 is merged before this PR is merged 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

…meouts, selective client exclusion, and dead-detection debounce (#4209) (#4288) ## Problem Large-scale hierarchical FL jobs (e.g. BERT NER, 144 clients, 6 relays on Frontier) abort in Round 0 due to a cascading startup failure chain. The root sequence is: 1. F3 streaming HOL stall (PR #4206) delays deployment ACKs from relay-connected clients 2. **`_deploy_job()`** treats `reply=None` (timeout) as `"unknown"` — not a failure — so timed-out clients silently appear to have been deployed 3. **`_start_run()`** tries to start those clients; they again time out, and `check_client_replies()` ignores the `None` reply 4. **`_sync_client_jobs()`** fires dead-job notification on the very first heartbeat with no startup grace period 5. FedAvg requires 144/144 — one or two missing clients → abort 6. A late-starting CJ crashes with `TypeError: 'NoneType' object is not iterable` when `get_job_clients()` receives `None` metadata from an already-aborted job PRs #4206, #4204, #4174, #4172, #4186, #4211, #4210 (all merged in 2.7.2) address the transport layer. This PR addresses the remaining job lifecycle layer. --- ## Fixes Included ### 1 — `_deploy_job()`: Treat deployment timeout as failure (`job_runner.py`) **Root bug**: `reply=None` was logged as `"unknown"` and excluded from `failed_clients`, so timed-out clients counted as "successfully deployed" for the `min_sites` check. **Fix**: Add timed-out clients to `failed_clients` with a `"deployment timeout"` label. The existing `min_sites` / `required_sites` logic then correctly decides whether to abort. ### 2 — `check_client_replies()`: Return timed-out clients instead of raising (`admin.py`) **Root bug**: In strict mode, any timeout raised immediately, aborting the whole job even when the remaining active clients satisfied `min_sites`. **Fix**: In strict mode, collect timed-out clients into a return list rather than raising. Explicit errors (non-OK return code or error body) still raise. Also fixes the non-strict mode to use name-keyed dict lookup instead of fragile positional `zip()`. New signature: `check_client_replies(...) -> List[str]` (timed-out client names; empty = none). ### 3 — `_start_run()`: Selective exclusion with min_sites re-evaluation (`job_runner.py`) **Root bug**: A start-job timeout under strict mode aborted the entire job with no tolerance for stragglers within `min_sites` bounds. **Fix**: Use the returned timed-out list from `check_client_replies()`. If remaining active clients >= `min_sites`, log a warning and proceed. Only abort when below tolerance. ### 4 — `_sync_client_jobs()`: Require-prior-report default changed to `True` (`fed_server.py`) **Root bug**: `SYNC_CLIENT_JOBS_REQUIRE_PREVIOUS_REPORT` defaulted to `False`, meaning the bug fix was opt-in and the unsafe behaviour remained the default. **Fix**: Default changed to `True`. Operators who want the aggressive legacy detection can set it to `False` explicitly. ### 5 — `_sync_client_jobs()`: Move `_reported_clients` out of `job_info` dict (`fed_server.py`) **Root bug**: Positive-observation tracking was stored as `job_info["_reported_clients"]`, injecting algorithm state into a data dict with no corresponding `RunProcessKey` constant. **Fix**: Tracking moved to `self._job_reported_clients: Dict[str, set]` on `FederatedServer`. Stale entries are purged whenever a job is no longer in `run_processes`. ### 6 — `ClientRunManager.get_job_clients()`: Explicit meta validation (`client_run_manager.py`) Raises `RuntimeError` with a descriptive message instead of an opaque `TypeError` when `JOB_CLIENTS` is absent or the wrong type. --- ## Configuration Recommendations (No Code Change Needed) | Setting | Recommended value | Effect | |---|---|---| | `FedAvg(min_clients=...)` | 96-98% of `num_clients` | Tolerates a few startup stragglers | | `runner_sync_timeout` | `120` s | Allows Lustre-backed deployments time to complete | | `strict_start_job_reply_check` | `true` | Start-job timeouts surfaced, straggler clients excluded | | `sync_client_jobs_require_previous_report` | `true` (now the default) | Prevents premature dead-job from startup delay | | `SFM_CLOSE_STALLED_CONNECTION` (PR #4206) | `true` after staging | Disconnects stalled relay connections | --- ## Files Changed - `nvflare/private/fed/server/job_runner.py` — `_deploy_job()` timeout as failure; `_start_run()` selective exclusion - `nvflare/private/fed/server/admin.py` — `check_client_replies()` returns timed-out list; dict-keyed non-strict path - `nvflare/private/fed/server/fed_server.py` — `_sync_client_jobs()` default `True`; `_job_reported_clients` attr; stale cleanup - `nvflare/private/fed/client/client_run_manager.py` — explicit meta validation in `get_job_clients()` --- ## Test Coverage New and updated unit tests with both positive and negative cases: | File | Tests | What they cover | |---|---|---| | `admin_test.py` | 8 | Timeout returned not raised; dict lookup; error still raises; reorder OK | | `job_runner_test.py` | 4 | strict flag wiring; timeout within tolerance → warn; timeout below tolerance → raise | | `job_runner_deploy_test.py` | 9 (new file) | Timeout counted as failure; OK reply not failed; mixed outcomes; detail label; min_sites with timeouts; integration sequence | | `fed_server_test.py` | 5 | Default requires-prior-report; legacy explicit-False still fires; tracking in server attr not job_info; stale cleanup | All 29 targeted unit tests pass. ## Test Plan - [x] Unit tests for each changed function (positive + negative) - [x] New `job_runner_deploy_test.py` covering deployment timeout classification end-to-end - [x] All 29 targeted unit tests pass - [ ] Hierarchical staging run with all flags at default - [ ] Hierarchical staging run with `strict_start_job_reply_check=true` and reduced `min_clients` - [ ] Verify no regression on standard (non-hierarchical) FL jobs --------- Fixes # . ### Description A few sentences describing the changes proposed in this pull request. ### Types of changes  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Peter Cnudde <pcnudde@nvidia.com>

greptile-apps bot reviewed Feb 20, 2026

View reviewed changes

chesterxgchen requested review from YuanTingHsieh, nvidianz and pcnudde February 20, 2026 01:35

greptile-apps bot reviewed Feb 20, 2026

View reviewed changes

chesterxgchen mentioned this pull request Feb 20, 2026

[2.7] Add SFM slow-send diagnostics and minimal streaming timeout tuning #4205

Closed

pcnudde reviewed Feb 20, 2026

View reviewed changes

greptile-apps bot reviewed Feb 20, 2026

View reviewed changes

nvflare/fuel/f3/drivers/socket_conn.py Show resolved Hide resolved

chesterxgchen added 4 commits February 19, 2026 22:21

[2.7] Add license header for F3 SFM test package

7254839

Add the standard Apache 2.0 header to the new SFM unit-test package marker file to satisfy repository license checks.

[2.7] Address PR-4206 review feedback on F3 stall handling

1895f5a

Use monotonic timing for SFM send-stall tracking, clamp ACK progress poll interval to avoid busy-spin under misconfiguration, and complete Apache headers in added F3 tests with coverage for the interval clamp behavior.

[2.7] Preserve socket send CommError codes in send_frame

151a588

Keep timeout/closed CommError codes from _send_with_timeout instead of rewrapping them as generic errors, and align socket timeout tests to assert code-specific behavior with coverage for unexpected exception wrapping.

greptile-apps bot reviewed Feb 20, 2026

View reviewed changes

chesterxgchen force-pushed the streaming_bug branch from 117d2e6 to 151a588 Compare February 20, 2026 06:22

greptile-apps bot reviewed Feb 20, 2026

View reviewed changes

GeorgeWang-nv reviewed Feb 20, 2026

View reviewed changes

nvflare/fuel/f3/drivers/socket_conn.py Outdated Show resolved Hide resolved

greptile-apps bot reviewed Feb 20, 2026

View reviewed changes

pcnudde self-requested a review February 20, 2026 18:11

pcnudde approved these changes Feb 20, 2026

View reviewed changes

chesterxgchen merged commit a938188 into NVIDIA:2.7 Feb 20, 2026
19 checks passed

chesterxgchen deleted the streaming_bug branch February 20, 2026 19:24

chesterxgchen mentioned this pull request Feb 22, 2026

[2.7] Fix hierarchical FL startup failures: deployment timeouts, selective client exclusion, and dead-detection debounce #4209

Merged

6 tasks

chesterxgchen mentioned this pull request Feb 22, 2026

[2.7] Update 2.7.2 release notes: streaming hardening, memory management, hierarchical startup stability [skip ci] #4218

Merged

4 tasks

YuanTingHsieh mentioned this pull request Mar 11, 2026

Cherry-pick [2.7] Fix hierarchical FL startup failures: deployment timeouts, selective client exclusion, and dead-detection debounce (#4209) #4288

Merged

12 tasks

Conversation

chesterxgchen commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1) Root cause: connection-level Head-of-Line (HOL) blocking

2) Proposed fixes: critical must-haves, and rationale

Critical fix A: bound low-level send blocking time

Critical fix B: ACK-progress watchdog in streaming sender

Critical fix C: SFM stalled-send detection with guarded recovery

Configuration surface added

3) Test strategy and coverage

Socket send timeout tests

ACK watchdog tests

Stall monitor plus guard tests

Validation run

4) User/operator guide updates

5) Additional notes and risk profile

Test plan

Latest Timeout Hardening Update

Validation

Uh oh!

greptile-apps bot commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

chesterxgchen commented Feb 20, 2026

Uh oh!

chesterxgchen commented Feb 20, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

pcnudde left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

chesterxgchen commented Feb 20, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

chesterxgchen commented Feb 20, 2026

Uh oh!

Uh oh!

chesterxgchen commented Feb 20, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

chesterxgchen commented Feb 20, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

chesterxgchen commented Feb 20, 2026 •

edited

Loading

greptile-apps bot commented Feb 20, 2026 •

edited

Loading