[2.7] smaller lock in produce item by pcnudde · Pull Request #4174 · NVIDIA/NVFlare

pcnudde · 2026-02-11T17:22:14Z

Fixes # .

Description

A few sentences describing the changes proposed in this pull request.

Types of changes

Non-breaking change (fix or new feature that would not break existing functionality).
Breaking change (fix or new feature that would cause existing functionality to change).
New tests added to cover the changes.
Quick tests passed locally by running ./runtest.sh.
In-line docstrings updated.
Documentation updated.

nvflare/fuel/f3/streaming/cacheable.py

pcnudde · 2026-02-12T21:22:15Z

/build

greptile-apps · 2026-02-12T21:23:30Z

Greptile Overview

Greptile Summary

This PR optimizes the _get_item method in CacheableObject by moving the expensive produce_item() call outside the lock, significantly reducing lock contention when multiple receivers concurrently request the same item.

Key changes:

Early cache availability check and immediate lock release (lines 91-100)
produce_item() now executes without holding the lock (line 109)
Double-checked locking pattern on re-acquisition: if another thread cached the item concurrently, use that result instead (lines 111-119)
Preserves the reference count correctly when storing newly produced items (line 115)

The refactoring maintains correctness:

Handles cache cleared scenario by returning produced item directly (lines 99-100)
Prevents duplicate work when concurrent threads produce the same item
Properly preserves the count field in cache tuples

Confidence Score: 5/5

This PR is safe to merge with minimal risk
The refactoring implements a well-known double-checked locking pattern correctly. All edge cases are handled: cache invalidation (lines 99-100, 112), concurrent production (lines 113-119), and count preservation (line 115). The change is covered by existing comprehensive tests in cacheable_test.py which validate caching behavior, concurrent access, and cache clearing scenarios.
No files require special attention

Important Files Changed

Filename	Overview
nvflare/fuel/f3/streaming/cacheable.py	Refactored `_get_item` to produce items outside lock, reducing contention when multiple receivers request same item concurrently. Uses double-checked locking pattern with proper cache invalidation handling.

Sequence Diagram

sequenceDiagram
    participant R1 as Receiver 1
    participant R2 as Receiver 2
    participant CO as CacheableObject
    participant Lock as Lock
    participant Cache as Cache

    R1->>CO: _get_item(0)
    CO->>Lock: acquire
    Lock-->>CO: acquired
    CO->>Cache: check cache[0]
    Cache-->>CO: (None, 0)
    CO->>Lock: release
    
    R2->>CO: _get_item(0)
    CO->>Lock: acquire
    Lock-->>CO: acquired
    CO->>Cache: check cache[0]
    Cache-->>CO: (None, 0)
    CO->>Lock: release
    
    Note over R1,R2: Both produce concurrently (lock-free)
    R1->>CO: produce_item(0)
    CO-->>R1: data_A
    R2->>CO: produce_item(0)
    CO-->>R2: data_B
    
    R1->>CO: re-acquire lock
    CO->>Lock: acquire
    Lock-->>CO: acquired
    CO->>Cache: check existing
    Cache-->>CO: None
    CO->>Cache: store (data_A, 0)
    CO->>Lock: release
    
    R2->>CO: re-acquire lock
    CO->>Lock: acquire
    Lock-->>CO: acquired
    CO->>Cache: check existing
    Cache-->>CO: data_A (from R1)
    Note over CO,R2: Discard data_B, use data_A
    CO->>Lock: release
    CO-->>R2: data_A

_{Last reviewed commit: 9471cfa}

greptile-apps

_{1 file reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

nvflare/fuel/f3/streaming/cacheable.py

pcnudde · 2026-02-12T23:09:18Z

/build

greptile-apps

_{1 file reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

### Description Do not hold the lock around produce_item. It is not needed and this operation can be slow. We do not want/need to hold up everying during this time. ### Types of changes  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated.

## Summary - Cherry-pick of #4174: reduce lock scope in `Cacheable._get_item` — `produce_item` now runs outside the lock so concurrent receivers aren't blocked - Cherry-pick of #4186: avoid self-message deadlock when swarm trainer submits learn result to itself — local submission bypasses `broadcast_and_wait`, adds unit test coverage

…ent, hierarchical startup stability Add three major new sections to flare_272.rst covering work merged after the initial 2.7.2 draft: Memory Management (restructured): - Zero Tensor Copy at CJ process via LazyDownloadRef pass-through (PR NVIDIA#4210) - Client-side memory management: malloc_trim, jemalloc, torch.cuda.empty_cache injected after flare.send() without training script changes (PR NVIDIA#4211) - Retain existing TensorDownloader and server-side cleanup content F3 Streaming Reliability and Performance (new section): - HOL stall mitigation: bounded send_frame() timeout, ACK watchdog, stall detection/recovery with recommended env-var settings (PR NVIDIA#4206) - Stream pool starvation fix: blob callbacks dispatched to dedicated thread pool, preventing stream worker exhaustion (PR NVIDIA#4171/NVIDIA#4172) - Streaming download retry with exponential backoff on timeout (PR NVIDIA#4167) - RxTask self-deadlock fix: stop() deferred until after map_lock released (PR NVIDIA#4204) - Lock contention reduction in produce_item() for concurrent model downloads (PR NVIDIA#4174) Hierarchical FL Startup Stability (new section): - Deployment timeout correctly classified as failure; min_sites check applied at deployment phase (PR NVIDIA#4209) - Startup grace period for dead-client detection (debounce default=true) (PR NVIDIA#4209) - Selective client exclusion on start-job timeout instead of full abort (PR NVIDIA#4209) - Hardened job metadata parsing: TypeError replaced with descriptive RuntimeError (PR NVIDIA#4209) - Recommended config snippets for HPC/Lustre environments (Frontier/ORNL scale) Bug Fixes section updated with all streaming and hierarchical startup fixes. Intro paragraph updated to reflect system hardening scope. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ctive client exclusion, and dead-detection debounce (#4209) ## Problem Large-scale hierarchical FL jobs (e.g. BERT NER, 144 clients, 6 relays on Frontier) abort in Round 0 due to a cascading startup failure chain. The root sequence is: 1. F3 streaming HOL stall (PR #4206) delays deployment ACKs from relay-connected clients 2. **`_deploy_job()`** treats `reply=None` (timeout) as `"unknown"` — not a failure — so timed-out clients silently appear to have been deployed 3. **`_start_run()`** tries to start those clients; they again time out, and `check_client_replies()` ignores the `None` reply 4. **`_sync_client_jobs()`** fires dead-job notification on the very first heartbeat with no startup grace period 5. FedAvg requires 144/144 — one or two missing clients → abort 6. A late-starting CJ crashes with `TypeError: 'NoneType' object is not iterable` when `get_job_clients()` receives `None` metadata from an already-aborted job PRs #4206, #4204, #4174, #4172, #4186, #4211, #4210 (all merged in 2.7.2) address the transport layer. This PR addresses the remaining job lifecycle layer. --- ## Fixes Included ### 1 — `_deploy_job()`: Treat deployment timeout as failure (`job_runner.py`) **Root bug**: `reply=None` was logged as `"unknown"` and excluded from `failed_clients`, so timed-out clients counted as "successfully deployed" for the `min_sites` check. **Fix**: Add timed-out clients to `failed_clients` with a `"deployment timeout"` label. The existing `min_sites` / `required_sites` logic then correctly decides whether to abort. ### 2 — `check_client_replies()`: Return timed-out clients instead of raising (`admin.py`) **Root bug**: In strict mode, any timeout raised immediately, aborting the whole job even when the remaining active clients satisfied `min_sites`. **Fix**: In strict mode, collect timed-out clients into a return list rather than raising. Explicit errors (non-OK return code or error body) still raise. Also fixes the non-strict mode to use name-keyed dict lookup instead of fragile positional `zip()`. New signature: `check_client_replies(...) -> List[str]` (timed-out client names; empty = none). ### 3 — `_start_run()`: Selective exclusion with min_sites re-evaluation (`job_runner.py`) **Root bug**: A start-job timeout under strict mode aborted the entire job with no tolerance for stragglers within `min_sites` bounds. **Fix**: Use the returned timed-out list from `check_client_replies()`. If remaining active clients >= `min_sites`, log a warning and proceed. Only abort when below tolerance. ### 4 — `_sync_client_jobs()`: Require-prior-report default changed to `True` (`fed_server.py`) **Root bug**: `SYNC_CLIENT_JOBS_REQUIRE_PREVIOUS_REPORT` defaulted to `False`, meaning the bug fix was opt-in and the unsafe behaviour remained the default. **Fix**: Default changed to `True`. Operators who want the aggressive legacy detection can set it to `False` explicitly. ### 5 — `_sync_client_jobs()`: Move `_reported_clients` out of `job_info` dict (`fed_server.py`) **Root bug**: Positive-observation tracking was stored as `job_info["_reported_clients"]`, injecting algorithm state into a data dict with no corresponding `RunProcessKey` constant. **Fix**: Tracking moved to `self._job_reported_clients: Dict[str, set]` on `FederatedServer`. Stale entries are purged whenever a job is no longer in `run_processes`. ### 6 — `ClientRunManager.get_job_clients()`: Explicit meta validation (`client_run_manager.py`) Raises `RuntimeError` with a descriptive message instead of an opaque `TypeError` when `JOB_CLIENTS` is absent or the wrong type. --- ## Configuration Recommendations (No Code Change Needed) | Setting | Recommended value | Effect | |---|---|---| | `FedAvg(min_clients=...)` | 96-98% of `num_clients` | Tolerates a few startup stragglers | | `runner_sync_timeout` | `120` s | Allows Lustre-backed deployments time to complete | | `strict_start_job_reply_check` | `true` | Start-job timeouts surfaced, straggler clients excluded | | `sync_client_jobs_require_previous_report` | `true` (now the default) | Prevents premature dead-job from startup delay | | `SFM_CLOSE_STALLED_CONNECTION` (PR #4206) | `true` after staging | Disconnects stalled relay connections | --- ## Files Changed - `nvflare/private/fed/server/job_runner.py` — `_deploy_job()` timeout as failure; `_start_run()` selective exclusion - `nvflare/private/fed/server/admin.py` — `check_client_replies()` returns timed-out list; dict-keyed non-strict path - `nvflare/private/fed/server/fed_server.py` — `_sync_client_jobs()` default `True`; `_job_reported_clients` attr; stale cleanup - `nvflare/private/fed/client/client_run_manager.py` — explicit meta validation in `get_job_clients()` --- ## Test Coverage New and updated unit tests with both positive and negative cases: | File | Tests | What they cover | |---|---|---| | `admin_test.py` | 8 | Timeout returned not raised; dict lookup; error still raises; reorder OK | | `job_runner_test.py` | 4 | strict flag wiring; timeout within tolerance → warn; timeout below tolerance → raise | | `job_runner_deploy_test.py` | 9 (new file) | Timeout counted as failure; OK reply not failed; mixed outcomes; detail label; min_sites with timeouts; integration sequence | | `fed_server_test.py` | 5 | Default requires-prior-report; legacy explicit-False still fires; tracking in server attr not job_info; stale cleanup | All 29 targeted unit tests pass. ## Test Plan - [x] Unit tests for each changed function (positive + negative) - [x] New `job_runner_deploy_test.py` covering deployment timeout classification end-to-end - [x] All 29 targeted unit tests pass - [ ] Hierarchical staging run with all flags at default - [ ] Hierarchical staging run with `strict_start_job_reply_check=true` and reduced `min_clients` - [ ] Verify no regression on standard (non-hierarchical) FL jobs --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

…ctive client exclusion, and dead-detection debounce (NVIDIA#4209) ## Problem Large-scale hierarchical FL jobs (e.g. BERT NER, 144 clients, 6 relays on Frontier) abort in Round 0 due to a cascading startup failure chain. The root sequence is: 1. F3 streaming HOL stall (PR NVIDIA#4206) delays deployment ACKs from relay-connected clients 2. **`_deploy_job()`** treats `reply=None` (timeout) as `"unknown"` — not a failure — so timed-out clients silently appear to have been deployed 3. **`_start_run()`** tries to start those clients; they again time out, and `check_client_replies()` ignores the `None` reply 4. **`_sync_client_jobs()`** fires dead-job notification on the very first heartbeat with no startup grace period 5. FedAvg requires 144/144 — one or two missing clients → abort 6. A late-starting CJ crashes with `TypeError: 'NoneType' object is not iterable` when `get_job_clients()` receives `None` metadata from an already-aborted job PRs NVIDIA#4206, NVIDIA#4204, NVIDIA#4174, NVIDIA#4172, NVIDIA#4186, NVIDIA#4211, NVIDIA#4210 (all merged in 2.7.2) address the transport layer. This PR addresses the remaining job lifecycle layer. --- ## Fixes Included ### 1 — `_deploy_job()`: Treat deployment timeout as failure (`job_runner.py`) **Root bug**: `reply=None` was logged as `"unknown"` and excluded from `failed_clients`, so timed-out clients counted as "successfully deployed" for the `min_sites` check. **Fix**: Add timed-out clients to `failed_clients` with a `"deployment timeout"` label. The existing `min_sites` / `required_sites` logic then correctly decides whether to abort. ### 2 — `check_client_replies()`: Return timed-out clients instead of raising (`admin.py`) **Root bug**: In strict mode, any timeout raised immediately, aborting the whole job even when the remaining active clients satisfied `min_sites`. **Fix**: In strict mode, collect timed-out clients into a return list rather than raising. Explicit errors (non-OK return code or error body) still raise. Also fixes the non-strict mode to use name-keyed dict lookup instead of fragile positional `zip()`. New signature: `check_client_replies(...) -> List[str]` (timed-out client names; empty = none). ### 3 — `_start_run()`: Selective exclusion with min_sites re-evaluation (`job_runner.py`) **Root bug**: A start-job timeout under strict mode aborted the entire job with no tolerance for stragglers within `min_sites` bounds. **Fix**: Use the returned timed-out list from `check_client_replies()`. If remaining active clients >= `min_sites`, log a warning and proceed. Only abort when below tolerance. ### 4 — `_sync_client_jobs()`: Require-prior-report default changed to `True` (`fed_server.py`) **Root bug**: `SYNC_CLIENT_JOBS_REQUIRE_PREVIOUS_REPORT` defaulted to `False`, meaning the bug fix was opt-in and the unsafe behaviour remained the default. **Fix**: Default changed to `True`. Operators who want the aggressive legacy detection can set it to `False` explicitly. ### 5 — `_sync_client_jobs()`: Move `_reported_clients` out of `job_info` dict (`fed_server.py`) **Root bug**: Positive-observation tracking was stored as `job_info["_reported_clients"]`, injecting algorithm state into a data dict with no corresponding `RunProcessKey` constant. **Fix**: Tracking moved to `self._job_reported_clients: Dict[str, set]` on `FederatedServer`. Stale entries are purged whenever a job is no longer in `run_processes`. ### 6 — `ClientRunManager.get_job_clients()`: Explicit meta validation (`client_run_manager.py`) Raises `RuntimeError` with a descriptive message instead of an opaque `TypeError` when `JOB_CLIENTS` is absent or the wrong type. --- ## Configuration Recommendations (No Code Change Needed) | Setting | Recommended value | Effect | |---|---|---| | `FedAvg(min_clients=...)` | 96-98% of `num_clients` | Tolerates a few startup stragglers | | `runner_sync_timeout` | `120` s | Allows Lustre-backed deployments time to complete | | `strict_start_job_reply_check` | `true` | Start-job timeouts surfaced, straggler clients excluded | | `sync_client_jobs_require_previous_report` | `true` (now the default) | Prevents premature dead-job from startup delay | | `SFM_CLOSE_STALLED_CONNECTION` (PR NVIDIA#4206) | `true` after staging | Disconnects stalled relay connections | --- ## Files Changed - `nvflare/private/fed/server/job_runner.py` — `_deploy_job()` timeout as failure; `_start_run()` selective exclusion - `nvflare/private/fed/server/admin.py` — `check_client_replies()` returns timed-out list; dict-keyed non-strict path - `nvflare/private/fed/server/fed_server.py` — `_sync_client_jobs()` default `True`; `_job_reported_clients` attr; stale cleanup - `nvflare/private/fed/client/client_run_manager.py` — explicit meta validation in `get_job_clients()` --- ## Test Coverage New and updated unit tests with both positive and negative cases: | File | Tests | What they cover | |---|---|---| | `admin_test.py` | 8 | Timeout returned not raised; dict lookup; error still raises; reorder OK | | `job_runner_test.py` | 4 | strict flag wiring; timeout within tolerance → warn; timeout below tolerance → raise | | `job_runner_deploy_test.py` | 9 (new file) | Timeout counted as failure; OK reply not failed; mixed outcomes; detail label; min_sites with timeouts; integration sequence | | `fed_server_test.py` | 5 | Default requires-prior-report; legacy explicit-False still fires; tracking in server attr not job_info; stale cleanup | All 29 targeted unit tests pass. ## Test Plan - [x] Unit tests for each changed function (positive + negative) - [x] New `job_runner_deploy_test.py` covering deployment timeout classification end-to-end - [x] All 29 targeted unit tests pass - [ ] Hierarchical staging run with all flags at default - [ ] Hierarchical staging run with `strict_start_job_reply_check=true` and reduced `min_clients` - [ ] Verify no regression on standard (non-hierarchical) FL jobs --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

…ent, hierarchical startup stability Add three major new sections to flare_272.rst covering work merged after the initial 2.7.2 draft: Memory Management (restructured): - Zero Tensor Copy at CJ process via LazyDownloadRef pass-through (PR NVIDIA#4210) - Client-side memory management: malloc_trim, jemalloc, torch.cuda.empty_cache injected after flare.send() without training script changes (PR NVIDIA#4211) - Retain existing TensorDownloader and server-side cleanup content F3 Streaming Reliability and Performance (new section): - HOL stall mitigation: bounded send_frame() timeout, ACK watchdog, stall detection/recovery with recommended env-var settings (PR NVIDIA#4206) - Stream pool starvation fix: blob callbacks dispatched to dedicated thread pool, preventing stream worker exhaustion (PR NVIDIA#4171/NVIDIA#4172) - Streaming download retry with exponential backoff on timeout (PR NVIDIA#4167) - RxTask self-deadlock fix: stop() deferred until after map_lock released (PR NVIDIA#4204) - Lock contention reduction in produce_item() for concurrent model downloads (PR NVIDIA#4174) Hierarchical FL Startup Stability (new section): - Deployment timeout correctly classified as failure; min_sites check applied at deployment phase (PR NVIDIA#4209) - Startup grace period for dead-client detection (debounce default=true) (PR NVIDIA#4209) - Selective client exclusion on start-job timeout instead of full abort (PR NVIDIA#4209) - Hardened job metadata parsing: TypeError replaced with descriptive RuntimeError (PR NVIDIA#4209) - Recommended config snippets for HPC/Lustre environments (Frontier/ORNL scale) Bug Fixes section updated with all streaming and hierarchical startup fixes. Intro paragraph updated to reflect system hardening scope. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ent, hierarchical startup stability [skip ci] (#4218) ## Merge Dependency > ⚠️ **Depends on #4209** — The *Hierarchical FL Startup Stability* section documents changes introduced by PR #4209 (currently open). **Merge PR #4209 into `2.7` before merging this PR.** All other sections cover already-merged PRs. --- ## Summary This PR updates `docs/release_notes/flare_272.rst` to reflect all major changes merged into the 2.7.x line after the initial 2.7.2 draft, covering three new areas: - **Memory Management** — restructured and expanded with Zero Tensor Copy at CJ (PR #4210) and client-side memory lifecycle management (PR #4211) - **F3 Streaming Reliability and Performance** — new section covering HOL stall mitigation (PR #4206), stream pool starvation fix (PR #4171/#4172), streaming download retry (PR #4167), RxTask self-deadlock fix (PR #4204), and lock contention reduction (PR #4174) - **Hierarchical FL Startup Stability** — new section covering deployment timeout classification, startup grace period, selective client exclusion, and metadata hardening (PR #4209 — pending merge), with recommended config snippets for HPC/Lustre environments The Bug Fixes section and intro paragraph are also updated accordingly. A source-level RST comment has been added above the Hierarchical FL section in the file to alert future maintainers to the merge dependency. ## Merged PRs Documented | PR | Area | Status | |---|---|---| | #4171 / #4172 | Stream pool starvation fix | Merged | | #4174 | Lock contention reduction | Merged | | #4167 | Streaming download retry | Merged | | #4204 | RxTask self-deadlock fix | Merged | | #4206 | HOL stall mitigation | Merged | | #4210 | Zero tensor copy at CJ | Merged | | #4211 | Client-side memory management | Merged | | #4209 | Hierarchical FL startup stability | **Open — merge before this PR** | ## Changes ### Memory Management (restructured) - **Zero Tensor Copy at CJ** (`ClientAPILauncherExecutor`): CJ now holds `LazyDownloadRef` placeholders instead of materializing full tensors, eliminating the CJ as a memory bottleneck for LLM-scale models. - **Client-Side Memory Management**: `gc.collect()` + `malloc_trim(0)` / jemalloc purge / `torch.cuda.empty_cache()` injected after every `flare.send()`, configurable via `client_memory_gc_rounds`. - Existing TensorDownloader and server-side cleanup content retained. ### F3 Streaming Reliability and Performance (new section) - **HOL Stall Mitigation**: Bounded `send_frame()` timeout, ACK-progress watchdog, and stall detection/recovery. Includes recommended environment variable settings for large hierarchical deployments. - **Stream Pool Starvation Fix**: Blob callbacks dispatched to a dedicated `callback_thread_pool`, keeping stream workers free for concurrent downloads. - **Streaming Download Retry**: Exponential-backoff retry (up to 3 attempts, capped at 60 s) on `TIMEOUT` errors; abort-signal aware. - **RxTask Self-Deadlock Fix**: `stop()` deferred until after `map_lock` released, eliminating stream-error-triggered deadlock. - **Lock Contention Reduction**: `produce_item()` runs outside `self.lock`; compare-and-store for cache write. Reduces model-download latency under high client concurrency. ### Hierarchical FL Startup Stability (new section — pending PR #4209) - **Deployment Timeout as Failure**: `reply=None` correctly counted against `min_sites`; timed-out clients excluded before `start_client_job`. - **Startup Grace Period**: Dead-client detection debounced — client must be observed once before absence triggers dead-job notification. Default changed to `True`. - **Selective Client Exclusion**: Stragglers at start-job excluded rather than causing full abort, if remaining count ≥ `min_clients`. - **Metadata Hardening**: `TypeError` on absent job metadata replaced with descriptive `RuntimeError`. - Recommended `config_fed_server.json` / `config_fed_client.json` snippets for HPC (Frontier/ORNL) scale. ## Test plan - [ ] Sphinx build (`make html`) passes without RST warnings on the updated file - [ ] All new cross-references (`.. code-block::`, `.. note::`) render correctly in the docs build - [ ] Verify section hierarchy (underline characters) is consistent throughout the file - [ ] Confirm PR #4209 is merged before this PR is merged 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

…meouts, selective client exclusion, and dead-detection debounce (#4209) (#4288) ## Problem Large-scale hierarchical FL jobs (e.g. BERT NER, 144 clients, 6 relays on Frontier) abort in Round 0 due to a cascading startup failure chain. The root sequence is: 1. F3 streaming HOL stall (PR #4206) delays deployment ACKs from relay-connected clients 2. **`_deploy_job()`** treats `reply=None` (timeout) as `"unknown"` — not a failure — so timed-out clients silently appear to have been deployed 3. **`_start_run()`** tries to start those clients; they again time out, and `check_client_replies()` ignores the `None` reply 4. **`_sync_client_jobs()`** fires dead-job notification on the very first heartbeat with no startup grace period 5. FedAvg requires 144/144 — one or two missing clients → abort 6. A late-starting CJ crashes with `TypeError: 'NoneType' object is not iterable` when `get_job_clients()` receives `None` metadata from an already-aborted job PRs #4206, #4204, #4174, #4172, #4186, #4211, #4210 (all merged in 2.7.2) address the transport layer. This PR addresses the remaining job lifecycle layer. --- ## Fixes Included ### 1 — `_deploy_job()`: Treat deployment timeout as failure (`job_runner.py`) **Root bug**: `reply=None` was logged as `"unknown"` and excluded from `failed_clients`, so timed-out clients counted as "successfully deployed" for the `min_sites` check. **Fix**: Add timed-out clients to `failed_clients` with a `"deployment timeout"` label. The existing `min_sites` / `required_sites` logic then correctly decides whether to abort. ### 2 — `check_client_replies()`: Return timed-out clients instead of raising (`admin.py`) **Root bug**: In strict mode, any timeout raised immediately, aborting the whole job even when the remaining active clients satisfied `min_sites`. **Fix**: In strict mode, collect timed-out clients into a return list rather than raising. Explicit errors (non-OK return code or error body) still raise. Also fixes the non-strict mode to use name-keyed dict lookup instead of fragile positional `zip()`. New signature: `check_client_replies(...) -> List[str]` (timed-out client names; empty = none). ### 3 — `_start_run()`: Selective exclusion with min_sites re-evaluation (`job_runner.py`) **Root bug**: A start-job timeout under strict mode aborted the entire job with no tolerance for stragglers within `min_sites` bounds. **Fix**: Use the returned timed-out list from `check_client_replies()`. If remaining active clients >= `min_sites`, log a warning and proceed. Only abort when below tolerance. ### 4 — `_sync_client_jobs()`: Require-prior-report default changed to `True` (`fed_server.py`) **Root bug**: `SYNC_CLIENT_JOBS_REQUIRE_PREVIOUS_REPORT` defaulted to `False`, meaning the bug fix was opt-in and the unsafe behaviour remained the default. **Fix**: Default changed to `True`. Operators who want the aggressive legacy detection can set it to `False` explicitly. ### 5 — `_sync_client_jobs()`: Move `_reported_clients` out of `job_info` dict (`fed_server.py`) **Root bug**: Positive-observation tracking was stored as `job_info["_reported_clients"]`, injecting algorithm state into a data dict with no corresponding `RunProcessKey` constant. **Fix**: Tracking moved to `self._job_reported_clients: Dict[str, set]` on `FederatedServer`. Stale entries are purged whenever a job is no longer in `run_processes`. ### 6 — `ClientRunManager.get_job_clients()`: Explicit meta validation (`client_run_manager.py`) Raises `RuntimeError` with a descriptive message instead of an opaque `TypeError` when `JOB_CLIENTS` is absent or the wrong type. --- ## Configuration Recommendations (No Code Change Needed) | Setting | Recommended value | Effect | |---|---|---| | `FedAvg(min_clients=...)` | 96-98% of `num_clients` | Tolerates a few startup stragglers | | `runner_sync_timeout` | `120` s | Allows Lustre-backed deployments time to complete | | `strict_start_job_reply_check` | `true` | Start-job timeouts surfaced, straggler clients excluded | | `sync_client_jobs_require_previous_report` | `true` (now the default) | Prevents premature dead-job from startup delay | | `SFM_CLOSE_STALLED_CONNECTION` (PR #4206) | `true` after staging | Disconnects stalled relay connections | --- ## Files Changed - `nvflare/private/fed/server/job_runner.py` — `_deploy_job()` timeout as failure; `_start_run()` selective exclusion - `nvflare/private/fed/server/admin.py` — `check_client_replies()` returns timed-out list; dict-keyed non-strict path - `nvflare/private/fed/server/fed_server.py` — `_sync_client_jobs()` default `True`; `_job_reported_clients` attr; stale cleanup - `nvflare/private/fed/client/client_run_manager.py` — explicit meta validation in `get_job_clients()` --- ## Test Coverage New and updated unit tests with both positive and negative cases: | File | Tests | What they cover | |---|---|---| | `admin_test.py` | 8 | Timeout returned not raised; dict lookup; error still raises; reorder OK | | `job_runner_test.py` | 4 | strict flag wiring; timeout within tolerance → warn; timeout below tolerance → raise | | `job_runner_deploy_test.py` | 9 (new file) | Timeout counted as failure; OK reply not failed; mixed outcomes; detail label; min_sites with timeouts; integration sequence | | `fed_server_test.py` | 5 | Default requires-prior-report; legacy explicit-False still fires; tracking in server attr not job_info; stale cleanup | All 29 targeted unit tests pass. ## Test Plan - [x] Unit tests for each changed function (positive + negative) - [x] New `job_runner_deploy_test.py` covering deployment timeout classification end-to-end - [x] All 29 targeted unit tests pass - [ ] Hierarchical staging run with all flags at default - [ ] Hierarchical staging run with `strict_start_job_reply_check=true` and reduced `min_clients` - [ ] Verify no regression on standard (non-hierarchical) FL jobs --------- Fixes # . ### Description A few sentences describing the changes proposed in this pull request. ### Types of changes  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Peter Cnudde <pcnudde@nvidia.com>

smaller lock in produce item

f8fe87b

YuanTingHsieh reviewed Feb 12, 2026

View reviewed changes

nvflare/fuel/f3/streaming/cacheable.py Outdated Show resolved Hide resolved

pcnudde added 2 commits February 11, 2026 18:24

edge case was ibn lock

11b2f5d

Merge branch '2.7' into 27_debug2

0882e85

pcnudde marked this pull request as ready for review February 12, 2026 21:21

pcnudde requested review from YuanTingHsieh and chesterxgchen February 12, 2026 21:21

greptile-apps bot reviewed Feb 12, 2026

View reviewed changes

YuanTingHsieh approved these changes Feb 12, 2026

View reviewed changes

nvflare/fuel/f3/streaming/cacheable.py Show resolved Hide resolved

Merge branch '2.7' into 27_debug2

9471cfa

greptile-apps bot reviewed Feb 12, 2026

View reviewed changes

pcnudde enabled auto-merge (squash) February 12, 2026 23:11

pcnudde merged commit 88cdc87 into NVIDIA:2.7 Feb 12, 2026
19 checks passed

pcnudde mentioned this pull request Feb 13, 2026

Cherry-pick #4174 and #4186 from 2.7 to main #4193

Merged

chesterxgchen mentioned this pull request Feb 22, 2026

[2.7] Fix hierarchical FL startup failures: deployment timeouts, selective client exclusion, and dead-detection debounce #4209

Merged

6 tasks

chesterxgchen mentioned this pull request Feb 22, 2026

[2.7] Update 2.7.2 release notes: streaming hardening, memory management, hierarchical startup stability [skip ci] #4218

Merged

4 tasks

YuanTingHsieh mentioned this pull request Mar 11, 2026

Cherry-pick [2.7] Fix hierarchical FL startup failures: deployment timeouts, selective client exclusion, and dead-detection debounce (#4209) #4288

Merged

12 tasks

pcnudde deleted the 27_debug2 branch March 11, 2026 02:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[2.7] smaller lock in produce item#4174

[2.7] smaller lock in produce item#4174
pcnudde merged 4 commits intoNVIDIA:2.7from
pcnudde:27_debug2

pcnudde commented Feb 11, 2026

Uh oh!

Uh oh!

pcnudde commented Feb 12, 2026

Uh oh!

greptile-apps bot commented Feb 12, 2026 •

edited

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

pcnudde commented Feb 12, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pcnudde commented Feb 11, 2026

Description

Types of changes

Uh oh!

Uh oh!

pcnudde commented Feb 12, 2026

Uh oh!

greptile-apps bot commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pcnudde commented Feb 12, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps bot commented Feb 12, 2026 •

edited

Loading