[2.7] Fix hierarchical FL startup failures: deployment timeouts, selective client exclusion, and dead-detection debounce by chesterxgchen · Pull Request #4209 · NVIDIA/NVFlare

chesterxgchen · 2026-02-20T05:57:05Z

Problem

Large-scale hierarchical FL jobs (e.g. BERT NER, 144 clients, 6 relays on Frontier) abort in
Round 0 due to a cascading startup failure chain. The root sequence is:

F3 streaming HOL stall (PR [2.7] Mitigate F3 streaming Head-of-Line (HOL) stalls and add guardrails #4206) delays deployment ACKs from relay-connected clients
_deploy_job() treats reply=None (timeout) as "unknown" — not a failure — so
timed-out clients silently appear to have been deployed
_start_run() tries to start those clients; they again time out, and
check_client_replies() ignores the None reply
_sync_client_jobs() fires dead-job notification on the very first heartbeat with
no startup grace period
FedAvg requires 144/144 — one or two missing clients → abort
A late-starting CJ crashes with TypeError: 'NoneType' object is not iterable when
get_job_clients() receives None metadata from an already-aborted job

PRs #4206, #4204, #4174, #4172, #4186, #4211, #4210 (all merged in 2.7.2) address the
transport layer. This PR addresses the remaining job lifecycle layer.

Fixes Included

1 — `_deploy_job()`: Treat deployment timeout as failure (`job_runner.py`)

Root bug: reply=None was logged as "unknown" and excluded from failed_clients,
so timed-out clients counted as "successfully deployed" for the min_sites check.

Fix: Add timed-out clients to failed_clients with a "deployment timeout" label.
The existing min_sites / required_sites logic then correctly decides whether to abort.

2 — `check_client_replies()`: Return timed-out clients instead of raising (`admin.py`)

Root bug: In strict mode, any timeout raised immediately, aborting the whole job even
when the remaining active clients satisfied min_sites.

Fix: In strict mode, collect timed-out clients into a return list rather than raising.
Explicit errors (non-OK return code or error body) still raise. Also fixes the non-strict
mode to use name-keyed dict lookup instead of fragile positional zip().

New signature: check_client_replies(...) -> List[str] (timed-out client names; empty = none).

3 — `_start_run()`: Selective exclusion with min_sites re-evaluation (`job_runner.py`)

Root bug: A start-job timeout under strict mode aborted the entire job with no
tolerance for stragglers within min_sites bounds.

Fix: Use the returned timed-out list from check_client_replies(). If remaining
active clients >= min_sites, log a warning and proceed. Only abort when below tolerance.

4 — `_sync_client_jobs()`: Require-prior-report default changed to `True` (`fed_server.py`)

Root bug: SYNC_CLIENT_JOBS_REQUIRE_PREVIOUS_REPORT defaulted to False, meaning
the bug fix was opt-in and the unsafe behaviour remained the default.

Fix: Default changed to True. Operators who want the aggressive legacy detection
can set it to False explicitly.

5 — `_sync_client_jobs()`: Move `_reported_clients` out of `job_info` dict (`fed_server.py`)

Root bug: Positive-observation tracking was stored as job_info["_reported_clients"],
injecting algorithm state into a data dict with no corresponding RunProcessKey constant.

Fix: Tracking moved to self._job_reported_clients: Dict[str, set] on FederatedServer.
Stale entries are purged whenever a job is no longer in run_processes.

6 — `ClientRunManager.get_job_clients()`: Explicit meta validation (`client_run_manager.py`)

Raises RuntimeError with a descriptive message instead of an opaque TypeError when
JOB_CLIENTS is absent or the wrong type.

Configuration Recommendations (No Code Change Needed)

Setting	Recommended value	Effect
`FedAvg(min_clients=...)`	96-98% of `num_clients`	Tolerates a few startup stragglers
`runner_sync_timeout`	`120` s	Allows Lustre-backed deployments time to complete
`strict_start_job_reply_check`	`true`	Start-job timeouts surfaced, straggler clients excluded
`sync_client_jobs_require_previous_report`	`true` (now the default)	Prevents premature dead-job from startup delay
`SFM_CLOSE_STALLED_CONNECTION` (PR #4206)	`true` after staging	Disconnects stalled relay connections

Files Changed

nvflare/private/fed/server/job_runner.py — _deploy_job() timeout as failure; _start_run() selective exclusion
nvflare/private/fed/server/admin.py — check_client_replies() returns timed-out list; dict-keyed non-strict path
nvflare/private/fed/server/fed_server.py — _sync_client_jobs() default True; _job_reported_clients attr; stale cleanup
nvflare/private/fed/client/client_run_manager.py — explicit meta validation in get_job_clients()

Test Coverage

New and updated unit tests with both positive and negative cases:

File	Tests	What they cover
`admin_test.py`	8	Timeout returned not raised; dict lookup; error still raises; reorder OK
`job_runner_test.py`	4	strict flag wiring; timeout within tolerance → warn; timeout below tolerance → raise
`job_runner_deploy_test.py`	9 (new file)	Timeout counted as failure; OK reply not failed; mixed outcomes; detail label; min_sites with timeouts; integration sequence
`fed_server_test.py`	5	Default requires-prior-report; legacy explicit-False still fires; tracking in server attr not job_info; stale cleanup

All 29 targeted unit tests pass.

Test Plan

Unit tests for each changed function (positive + negative)
New job_runner_deploy_test.py covering deployment timeout classification end-to-end
All 29 targeted unit tests pass
Hierarchical staging run with all flags at default
Hierarchical staging run with strict_start_job_reply_check=true and reduced min_clients
Verify no regression on standard (non-hierarchical) FL jobs

greptile-apps · 2026-02-20T06:00:01Z

Greptile Summary

This PR addresses cascading startup failures in large-scale hierarchical FL by fixing timeout handling and client exclusion logic across the job lifecycle layer.

Key Changes:

_deploy_job() now treats deployment timeouts (reply=None) as failures rather than unknown status, enabling proper min_sites/required_sites validation
check_client_replies() refactored to return timed-out client list in strict mode instead of raising, allowing selective exclusion within tolerance bounds; also switched from fragile zip() to dict-based lookup
_start_run() implements selective client exclusion with required_sites and min_sites re-evaluation after timeouts, and correctly handles metadata in both strict and non-strict modes
_sync_client_jobs() default changed to require prior positive heartbeat before firing dead-job notifications (prevents false positives during startup), and tracking moved from job_info dict to clean instance variable with stale entry cleanup
get_job_clients() now validates metadata explicitly to prevent opaque TypeError crashes

All previous review concerns addressed:

Non-strict mode now correctly excludes timed-out clients from JOB_CLIENTS metadata (lines 307-320 in job_runner.py)
required_sites validation added to start phase, consistent with deployment phase
Double metadata assignment is intentional: set before start_client_job for serialization, then updated after timeout resolution

Test Coverage: 29 new/updated unit tests with comprehensive positive and negative cases validate all fixes.

Confidence Score: 5/5

This PR is safe to merge with minimal risk - all fixes are well-architected, thoroughly tested, and address real production failures
Score reflects comprehensive fix of critical production issue with excellent test coverage (29 tests), all previous review concerns properly addressed, clean refactoring that improves maintainability, and no breaking changes to existing functionality
No files require special attention - all changes are well-tested and properly validated

Important Files Changed

Filename	Overview
nvflare/private/fed/server/job_runner.py	Deployment timeout classification, selective client exclusion with min_sites/required_sites validation, and proper JOB_CLIENTS metadata management in both strict and non-strict modes
nvflare/private/fed/server/admin.py	Refactored check_client_replies to return timed-out clients list instead of raising, with dict-based lookup and proper type checking
nvflare/private/fed/server/fed_server.py	Changed default for require_previous_report to True, moved _job_reported_clients tracking to instance variable with proper cleanup, preventing premature dead-job detection
nvflare/private/fed/client/client_run_manager.py	Added explicit validation for JOB_CLIENTS metadata to prevent TypeError when metadata is None or wrong type

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    Start[Job Deployment Start] --> Deploy[_deploy_job: Send to clients]
    Deploy --> CheckDeploy{Check replies}
    CheckDeploy -->|reply=None| Timeout1[Add to failed_clients with 'deployment timeout']
    CheckDeploy -->|reply.rc!=OK| Fail1[Add to failed_clients with error detail]
    CheckDeploy -->|reply.rc=OK| Success1[Mark as deployed OK]
    
    Timeout1 --> Validate1{Validate min_sites/<br/>required_sites}
    Fail1 --> Validate1
    Success1 --> Validate1
    
    Validate1 -->|Failed client in<br/>required_sites| Abort1[Abort job]
    Validate1 -->|Active < min_sites| Abort1
    Validate1 -->|Within tolerance| StartPhase[_start_run: Start job]
    
    StartPhase --> SetMeta1[Set JOB_CLIENTS metadata<br/>before start_client_job]
    SetMeta1 --> StartClients[Call start_client_job]
    StartClients --> CheckStart{check_client_replies<br/>strict mode?}
    
    CheckStart -->|strict=True| StrictPath[Returns timed_out list]
    CheckStart -->|strict=False| NonStrictPath[Returns empty list]
    
    StrictPath --> HasTimeout{timed_out<br/>non-empty?}
    HasTimeout -->|Yes| CheckRequired2{Timed-out client<br/>in required_sites?}
    CheckRequired2 -->|Yes| Abort2[Abort job]
    CheckRequired2 -->|No| CheckMin2{Active >= min_sites?}
    CheckMin2 -->|No| Abort2
    CheckMin2 -->|Yes| Warn[Log warning,<br/>exclude timed-out clients]
    HasTimeout -->|No| UpdateMeta
    
    NonStrictPath --> RebuildFromReplies[Rebuild active_client_sites<br/>from actual replies]
    RebuildFromReplies --> UpdateMeta[Update JOB_CLIENTS metadata<br/>with active clients only]
    Warn --> UpdateMeta
    
    UpdateMeta --> Running[Job Running]
    Running --> Heartbeat[Client heartbeat]
    Heartbeat --> SyncJobs{_sync_client_jobs}
    
    SyncJobs --> CheckPrior{require_previous_report<br/>default=True}
    CheckPrior -->|Job on server<br/>but not client| HasReported{Client reported<br/>this job before?}
    HasReported -->|Yes| DeadJob[Fire dead-job notification]
    HasReported -->|No| SkipNotify[Skip notification<br/>still starting up]
    
    CheckPrior -->|Job on client<br/>but not server| AbortClient[Tell client to abort job]

_{Last reviewed commit: 71d200e}

greptile-apps

_{11 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

nvflare/private/fed/server/admin.py

chesterxgchen · 2026-02-22T01:58:59Z

/build

greptile-apps

_{12 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-22T01:59:37Z

Additional Comments (1)

nvflare/private/fed/server/job_runner.py
job.meta[JobMetaKey.JOB_CLIENTS] is populated with all clients from client_sites (line 264-265), but after timeout exclusion (line 295), some clients may be removed from the active set. This creates a mismatch: timed-out clients are in JOB_CLIENTS metadata but won't actually start. Downstream code expecting JOB_CLIENTS to reflect active participants may break.

        # After timeout exclusion, rebuild job.meta[JobMetaKey.JOB_CLIENTS] with only active clients
        if timed_out:
            active_count = len(client_sites_names) - len(timed_out)
            if job.min_sites and active_count < job.min_sites:
                raise RuntimeError(
                    f"start job ({job_id}): {len(timed_out)} client(s) timed out and remaining "
                    f"{active_count} < min_sites {job.min_sites}: {timed_out}"
                )
            self.log_warning(
                fl_ctx,
                f"start job ({job_id}): {len(timed_out)} client(s) timed out at start-job: {timed_out}; "
                f"{active_count} of {len(client_sites_names)} clients started successfully.",
            )
            client_sites_names = [c for c in client_sites_names if c not in timed_out]
            # Update JOB_CLIENTS to reflect only active participants
            active_clients = [c.to_dict() for token, c in job_clients.items() 
                            if c.name in client_sites_names]
            job.meta[JobMetaKey.JOB_CLIENTS] = active_clients

…ent, hierarchical startup stability Add three major new sections to flare_272.rst covering work merged after the initial 2.7.2 draft: Memory Management (restructured): - Zero Tensor Copy at CJ process via LazyDownloadRef pass-through (PR NVIDIA#4210) - Client-side memory management: malloc_trim, jemalloc, torch.cuda.empty_cache injected after flare.send() without training script changes (PR NVIDIA#4211) - Retain existing TensorDownloader and server-side cleanup content F3 Streaming Reliability and Performance (new section): - HOL stall mitigation: bounded send_frame() timeout, ACK watchdog, stall detection/recovery with recommended env-var settings (PR NVIDIA#4206) - Stream pool starvation fix: blob callbacks dispatched to dedicated thread pool, preventing stream worker exhaustion (PR NVIDIA#4171/NVIDIA#4172) - Streaming download retry with exponential backoff on timeout (PR NVIDIA#4167) - RxTask self-deadlock fix: stop() deferred until after map_lock released (PR NVIDIA#4204) - Lock contention reduction in produce_item() for concurrent model downloads (PR NVIDIA#4174) Hierarchical FL Startup Stability (new section): - Deployment timeout correctly classified as failure; min_sites check applied at deployment phase (PR NVIDIA#4209) - Startup grace period for dead-client detection (debounce default=true) (PR NVIDIA#4209) - Selective client exclusion on start-job timeout instead of full abort (PR NVIDIA#4209) - Hardened job metadata parsing: TypeError replaced with descriptive RuntimeError (PR NVIDIA#4209) - Recommended config snippets for HPC/Lustre environments (Frontier/ORNL scale) Bug Fixes section updated with all streaming and hierarchical startup fixes. Intro paragraph updated to reflect system hardening scope. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chesterxgchen · 2026-02-22T02:28:37Z

Addressed the metadata consistency feedback in commit e0af9ac.

Changes made:

JobRunner._start_run() now rebuilds job.meta[JobMetaKey.JOB_CLIENTS] after timeout exclusion so metadata reflects active participants only.
Added unit tests to cover:
- metadata update when clients time out (test_start_run_updates_job_clients_meta_after_timeout_exclusion)
- metadata unchanged when no timeouts (test_start_run_keeps_job_clients_meta_when_no_timeouts)
- deploy->start filtering path now actually executes _start_run() and verifies only deployable clients are started.

Validation:

pytest -q tests/unit_test/private/fed/server/job_runner_test.py
pytest -q tests/unit_test/private/fed/server/job_runner_deploy_test.py

greptile-apps

_{12 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

…ease notes Add a visible source comment above the Hierarchical FL Startup Stability section noting that its content depends on PR NVIDIA#4209 and should not be merged before that PR lands on 2.7. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chesterxgchen · 2026-02-22T02:41:22Z

Follow-up added in commit df7c06b:

Added an integration-style _start_run test that uses the real check_client_replies timeout path (no mock for reply-check behavior) and verifies JOB_CLIENTS is reduced to active clients after timeout exclusion.

Validation on updated test scope:

flake8 tests/unit_test/private/fed/server/job_runner_test.py
pytest -q tests/unit_test/private/fed/server/job_runner_test.py tests/unit_test/private/fed/server/job_runner_deploy_test.py (16 passed)

greptile-apps

_{12 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

chesterxgchen · 2026-02-22T02:52:19Z

/build

chesterxgchen · 2026-02-22T02:58:39Z

/build

greptile-apps

_{11 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

chesterxgchen · 2026-02-22T03:54:26Z

/build

greptile-apps

_{13 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

nvflare/private/fed/server/fed_server.py

nvflare/private/fed/server/admin.py

nvflare/private/fed/server/fed_server.py

…ctive exclusion, dead-detection debounce - _deploy_job(): treat reply=None (timeout) as deployment failure so timed-out clients are correctly evaluated against min_sites / required_sites, rather than silently counted as successfully deployed - check_client_replies(): strict mode now returns List[str] of timed-out clients instead of raising; explicit errors still raise; non-strict path uses dict-keyed lookup instead of fragile positional zip() - _start_run(): use returned timed-out list to selectively exclude stragglers; re-evaluate active count against job.min_sites before aborting - _sync_client_jobs(): change SYNC_CLIENT_JOBS_REQUIRE_PREVIOUS_REPORT default False->True so safe debounced behaviour is active without explicit config; move _reported_clients tracking to self._job_reported_clients on FederatedServer (out of job_info dict); purge stale entries when jobs leave run_processes - Add license header to tests/unit_test/private/fed/client/__init__.py - Tests: rewrite admin_test.py; extend job_runner_test.py and fed_server_test.py; add new job_runner_deploy_test.py (29 tests, all positive + negative cases) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Update _start_run to rebuild JOB_CLIENTS from active participants when timed-out clients are excluded, and add unit coverage to verify metadata consistency and deploy-to-start filtering behavior.

Add a start-run integration-style unit test that exercises the real timeout reply-check path and verifies JOB_CLIENTS metadata is reduced to active clients after timeout exclusion.

Drop the temporary hierarchical FL BERT-144 analysis markdown from the PR scope.

Add user-facing documentation for strict_start_job_reply_check and sync_client_jobs_require_previous_report, including defaults, usage guidance, and cross-reference from troubleshooting to the full timeout reference.

- admin.py: use missing_clients list directly in error message (no manual join) - admin.py: replace fragile zip() positional matching in non-strict path with dict-keyed lookup, consistent with strict mode - admin.py: add isinstance(r.reply.body, str) guard and use startswith() instead of `in` for ERROR_MSG_PREFIX check in non-strict path - fed_server.py: remove redundant `or []` from JOB_IDS header read; the isinstance check on the next line already handles None/invalid types Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… docs - admin.py: align strict-mode ERROR_MSG_PREFIX check to use startswith() - fed_server.py: add _job_reported_clients_lock for concurrent heartbeat safety - job_runner.py: add comment on timed-out clients and require_previous_report - admin_test.py: fix match pattern for list repr; add startswith semantics tests - fed_server_test.py: remove unused _make_server() helper - job_runner_deploy_test.py: remove no-op patch.object block - docs/timeouts.rst: clarify when to enable strict_start_job_reply_check Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- _start_run() now validates required_sites on timeout, consistent with _deploy_job(): if a timed-out client is in job.required_sites the job aborts even when active_count >= min_sites. - Remove redundant early metadata assignment (line 265); JOB_CLIENTS is now set once after timeout exclusion so it always reflects actual active participants. - Add two unit tests: required-site timeout aborts; non-required-site timeout proceeds and metadata is correct. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Restore JOB_CLIENTS metadata before start_client_job so client startup headers include participating clients. Add a unit test that asserts JOB_CLIENTS is present when start_client_job is invoked.

chesterxgchen · 2026-02-24T22:48:12Z

/build

chesterxgchen · 2026-02-24T22:51:28Z

/build

greptile-apps

_{13 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

nvflare/private/fed/server/job_runner.py

Keep JOB_CLIENTS metadata aligned with actual client start replies when strict_start_reply_check is disabled by deriving active participants from non-empty replies. Add a unit test that reproduces non-strict timeout behavior and verifies timed-out clients are excluded from JOB_CLIENTS.

greptile-apps

_{13 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

chesterxgchen · 2026-02-24T23:12:17Z

/build

…ctive client exclusion, and dead-detection debounce (NVIDIA#4209) ## Problem Large-scale hierarchical FL jobs (e.g. BERT NER, 144 clients, 6 relays on Frontier) abort in Round 0 due to a cascading startup failure chain. The root sequence is: 1. F3 streaming HOL stall (PR NVIDIA#4206) delays deployment ACKs from relay-connected clients 2. **`_deploy_job()`** treats `reply=None` (timeout) as `"unknown"` — not a failure — so timed-out clients silently appear to have been deployed 3. **`_start_run()`** tries to start those clients; they again time out, and `check_client_replies()` ignores the `None` reply 4. **`_sync_client_jobs()`** fires dead-job notification on the very first heartbeat with no startup grace period 5. FedAvg requires 144/144 — one or two missing clients → abort 6. A late-starting CJ crashes with `TypeError: 'NoneType' object is not iterable` when `get_job_clients()` receives `None` metadata from an already-aborted job PRs NVIDIA#4206, NVIDIA#4204, NVIDIA#4174, NVIDIA#4172, NVIDIA#4186, NVIDIA#4211, NVIDIA#4210 (all merged in 2.7.2) address the transport layer. This PR addresses the remaining job lifecycle layer. --- ## Fixes Included ### 1 — `_deploy_job()`: Treat deployment timeout as failure (`job_runner.py`) **Root bug**: `reply=None` was logged as `"unknown"` and excluded from `failed_clients`, so timed-out clients counted as "successfully deployed" for the `min_sites` check. **Fix**: Add timed-out clients to `failed_clients` with a `"deployment timeout"` label. The existing `min_sites` / `required_sites` logic then correctly decides whether to abort. ### 2 — `check_client_replies()`: Return timed-out clients instead of raising (`admin.py`) **Root bug**: In strict mode, any timeout raised immediately, aborting the whole job even when the remaining active clients satisfied `min_sites`. **Fix**: In strict mode, collect timed-out clients into a return list rather than raising. Explicit errors (non-OK return code or error body) still raise. Also fixes the non-strict mode to use name-keyed dict lookup instead of fragile positional `zip()`. New signature: `check_client_replies(...) -> List[str]` (timed-out client names; empty = none). ### 3 — `_start_run()`: Selective exclusion with min_sites re-evaluation (`job_runner.py`) **Root bug**: A start-job timeout under strict mode aborted the entire job with no tolerance for stragglers within `min_sites` bounds. **Fix**: Use the returned timed-out list from `check_client_replies()`. If remaining active clients >= `min_sites`, log a warning and proceed. Only abort when below tolerance. ### 4 — `_sync_client_jobs()`: Require-prior-report default changed to `True` (`fed_server.py`) **Root bug**: `SYNC_CLIENT_JOBS_REQUIRE_PREVIOUS_REPORT` defaulted to `False`, meaning the bug fix was opt-in and the unsafe behaviour remained the default. **Fix**: Default changed to `True`. Operators who want the aggressive legacy detection can set it to `False` explicitly. ### 5 — `_sync_client_jobs()`: Move `_reported_clients` out of `job_info` dict (`fed_server.py`) **Root bug**: Positive-observation tracking was stored as `job_info["_reported_clients"]`, injecting algorithm state into a data dict with no corresponding `RunProcessKey` constant. **Fix**: Tracking moved to `self._job_reported_clients: Dict[str, set]` on `FederatedServer`. Stale entries are purged whenever a job is no longer in `run_processes`. ### 6 — `ClientRunManager.get_job_clients()`: Explicit meta validation (`client_run_manager.py`) Raises `RuntimeError` with a descriptive message instead of an opaque `TypeError` when `JOB_CLIENTS` is absent or the wrong type. --- ## Configuration Recommendations (No Code Change Needed) | Setting | Recommended value | Effect | |---|---|---| | `FedAvg(min_clients=...)` | 96-98% of `num_clients` | Tolerates a few startup stragglers | | `runner_sync_timeout` | `120` s | Allows Lustre-backed deployments time to complete | | `strict_start_job_reply_check` | `true` | Start-job timeouts surfaced, straggler clients excluded | | `sync_client_jobs_require_previous_report` | `true` (now the default) | Prevents premature dead-job from startup delay | | `SFM_CLOSE_STALLED_CONNECTION` (PR NVIDIA#4206) | `true` after staging | Disconnects stalled relay connections | --- ## Files Changed - `nvflare/private/fed/server/job_runner.py` — `_deploy_job()` timeout as failure; `_start_run()` selective exclusion - `nvflare/private/fed/server/admin.py` — `check_client_replies()` returns timed-out list; dict-keyed non-strict path - `nvflare/private/fed/server/fed_server.py` — `_sync_client_jobs()` default `True`; `_job_reported_clients` attr; stale cleanup - `nvflare/private/fed/client/client_run_manager.py` — explicit meta validation in `get_job_clients()` --- ## Test Coverage New and updated unit tests with both positive and negative cases: | File | Tests | What they cover | |---|---|---| | `admin_test.py` | 8 | Timeout returned not raised; dict lookup; error still raises; reorder OK | | `job_runner_test.py` | 4 | strict flag wiring; timeout within tolerance → warn; timeout below tolerance → raise | | `job_runner_deploy_test.py` | 9 (new file) | Timeout counted as failure; OK reply not failed; mixed outcomes; detail label; min_sites with timeouts; integration sequence | | `fed_server_test.py` | 5 | Default requires-prior-report; legacy explicit-False still fires; tracking in server attr not job_info; stale cleanup | All 29 targeted unit tests pass. ## Test Plan - [x] Unit tests for each changed function (positive + negative) - [x] New `job_runner_deploy_test.py` covering deployment timeout classification end-to-end - [x] All 29 targeted unit tests pass - [ ] Hierarchical staging run with all flags at default - [ ] Hierarchical staging run with `strict_start_job_reply_check=true` and reduced `min_clients` - [ ] Verify no regression on standard (non-hierarchical) FL jobs --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

…ent, hierarchical startup stability Add three major new sections to flare_272.rst covering work merged after the initial 2.7.2 draft: Memory Management (restructured): - Zero Tensor Copy at CJ process via LazyDownloadRef pass-through (PR NVIDIA#4210) - Client-side memory management: malloc_trim, jemalloc, torch.cuda.empty_cache injected after flare.send() without training script changes (PR NVIDIA#4211) - Retain existing TensorDownloader and server-side cleanup content F3 Streaming Reliability and Performance (new section): - HOL stall mitigation: bounded send_frame() timeout, ACK watchdog, stall detection/recovery with recommended env-var settings (PR NVIDIA#4206) - Stream pool starvation fix: blob callbacks dispatched to dedicated thread pool, preventing stream worker exhaustion (PR NVIDIA#4171/NVIDIA#4172) - Streaming download retry with exponential backoff on timeout (PR NVIDIA#4167) - RxTask self-deadlock fix: stop() deferred until after map_lock released (PR NVIDIA#4204) - Lock contention reduction in produce_item() for concurrent model downloads (PR NVIDIA#4174) Hierarchical FL Startup Stability (new section): - Deployment timeout correctly classified as failure; min_sites check applied at deployment phase (PR NVIDIA#4209) - Startup grace period for dead-client detection (debounce default=true) (PR NVIDIA#4209) - Selective client exclusion on start-job timeout instead of full abort (PR NVIDIA#4209) - Hardened job metadata parsing: TypeError replaced with descriptive RuntimeError (PR NVIDIA#4209) - Recommended config snippets for HPC/Lustre environments (Frontier/ORNL scale) Bug Fixes section updated with all streaming and hierarchical startup fixes. Intro paragraph updated to reflect system hardening scope. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ease notes Add a visible source comment above the Hierarchical FL Startup Stability section noting that its content depends on PR NVIDIA#4209 and should not be merged before that PR lands on 2.7. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ent, hierarchical startup stability Add three major new sections to flare_272.rst covering work merged after the initial 2.7.2 draft: Memory Management (restructured): - Zero Tensor Copy at CJ process via LazyDownloadRef pass-through (PR NVIDIA#4210) - Client-side memory management: malloc_trim, jemalloc, torch.cuda.empty_cache injected after flare.send() without training script changes (PR NVIDIA#4211) - Retain existing TensorDownloader and server-side cleanup content F3 Streaming Reliability and Performance (new section): - HOL stall mitigation: bounded send_frame() timeout, ACK watchdog, stall detection/recovery with recommended env-var settings (PR NVIDIA#4206) - Stream pool starvation fix: blob callbacks dispatched to dedicated thread pool, preventing stream worker exhaustion (PR NVIDIA#4171/NVIDIA#4172) - Streaming download retry with exponential backoff on timeout (PR NVIDIA#4167) - RxTask self-deadlock fix: stop() deferred until after map_lock released (PR NVIDIA#4204) - Lock contention reduction in produce_item() for concurrent model downloads (PR NVIDIA#4174) Hierarchical FL Startup Stability (new section): - Deployment timeout correctly classified as failure; min_sites check applied at deployment phase (PR NVIDIA#4209) - Startup grace period for dead-client detection (debounce default=true) (PR NVIDIA#4209) - Selective client exclusion on start-job timeout instead of full abort (PR NVIDIA#4209) - Hardened job metadata parsing: TypeError replaced with descriptive RuntimeError (PR NVIDIA#4209) - Recommended config snippets for HPC/Lustre environments (Frontier/ORNL scale) Bug Fixes section updated with all streaming and hierarchical startup fixes. Intro paragraph updated to reflect system hardening scope. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ease notes Add a visible source comment above the Hierarchical FL Startup Stability section noting that its content depends on PR NVIDIA#4209 and should not be merged before that PR lands on 2.7. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ent, hierarchical startup stability [skip ci] (#4218) ## Merge Dependency > ⚠️ **Depends on #4209** — The *Hierarchical FL Startup Stability* section documents changes introduced by PR #4209 (currently open). **Merge PR #4209 into `2.7` before merging this PR.** All other sections cover already-merged PRs. --- ## Summary This PR updates `docs/release_notes/flare_272.rst` to reflect all major changes merged into the 2.7.x line after the initial 2.7.2 draft, covering three new areas: - **Memory Management** — restructured and expanded with Zero Tensor Copy at CJ (PR #4210) and client-side memory lifecycle management (PR #4211) - **F3 Streaming Reliability and Performance** — new section covering HOL stall mitigation (PR #4206), stream pool starvation fix (PR #4171/#4172), streaming download retry (PR #4167), RxTask self-deadlock fix (PR #4204), and lock contention reduction (PR #4174) - **Hierarchical FL Startup Stability** — new section covering deployment timeout classification, startup grace period, selective client exclusion, and metadata hardening (PR #4209 — pending merge), with recommended config snippets for HPC/Lustre environments The Bug Fixes section and intro paragraph are also updated accordingly. A source-level RST comment has been added above the Hierarchical FL section in the file to alert future maintainers to the merge dependency. ## Merged PRs Documented | PR | Area | Status | |---|---|---| | #4171 / #4172 | Stream pool starvation fix | Merged | | #4174 | Lock contention reduction | Merged | | #4167 | Streaming download retry | Merged | | #4204 | RxTask self-deadlock fix | Merged | | #4206 | HOL stall mitigation | Merged | | #4210 | Zero tensor copy at CJ | Merged | | #4211 | Client-side memory management | Merged | | #4209 | Hierarchical FL startup stability | **Open — merge before this PR** | ## Changes ### Memory Management (restructured) - **Zero Tensor Copy at CJ** (`ClientAPILauncherExecutor`): CJ now holds `LazyDownloadRef` placeholders instead of materializing full tensors, eliminating the CJ as a memory bottleneck for LLM-scale models. - **Client-Side Memory Management**: `gc.collect()` + `malloc_trim(0)` / jemalloc purge / `torch.cuda.empty_cache()` injected after every `flare.send()`, configurable via `client_memory_gc_rounds`. - Existing TensorDownloader and server-side cleanup content retained. ### F3 Streaming Reliability and Performance (new section) - **HOL Stall Mitigation**: Bounded `send_frame()` timeout, ACK-progress watchdog, and stall detection/recovery. Includes recommended environment variable settings for large hierarchical deployments. - **Stream Pool Starvation Fix**: Blob callbacks dispatched to a dedicated `callback_thread_pool`, keeping stream workers free for concurrent downloads. - **Streaming Download Retry**: Exponential-backoff retry (up to 3 attempts, capped at 60 s) on `TIMEOUT` errors; abort-signal aware. - **RxTask Self-Deadlock Fix**: `stop()` deferred until after `map_lock` released, eliminating stream-error-triggered deadlock. - **Lock Contention Reduction**: `produce_item()` runs outside `self.lock`; compare-and-store for cache write. Reduces model-download latency under high client concurrency. ### Hierarchical FL Startup Stability (new section — pending PR #4209) - **Deployment Timeout as Failure**: `reply=None` correctly counted against `min_sites`; timed-out clients excluded before `start_client_job`. - **Startup Grace Period**: Dead-client detection debounced — client must be observed once before absence triggers dead-job notification. Default changed to `True`. - **Selective Client Exclusion**: Stragglers at start-job excluded rather than causing full abort, if remaining count ≥ `min_clients`. - **Metadata Hardening**: `TypeError` on absent job metadata replaced with descriptive `RuntimeError`. - Recommended `config_fed_server.json` / `config_fed_client.json` snippets for HPC (Frontier/ORNL) scale. ## Test plan - [ ] Sphinx build (`make html`) passes without RST warnings on the updated file - [ ] All new cross-references (`.. code-block::`, `.. note::`) render correctly in the docs build - [ ] Verify section hierarchy (underline characters) is consistent throughout the file - [ ] Confirm PR #4209 is merged before this PR is merged 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

…meouts, selective client exclusion, and dead-detection debounce (#4209) (#4288) ## Problem Large-scale hierarchical FL jobs (e.g. BERT NER, 144 clients, 6 relays on Frontier) abort in Round 0 due to a cascading startup failure chain. The root sequence is: 1. F3 streaming HOL stall (PR #4206) delays deployment ACKs from relay-connected clients 2. **`_deploy_job()`** treats `reply=None` (timeout) as `"unknown"` — not a failure — so timed-out clients silently appear to have been deployed 3. **`_start_run()`** tries to start those clients; they again time out, and `check_client_replies()` ignores the `None` reply 4. **`_sync_client_jobs()`** fires dead-job notification on the very first heartbeat with no startup grace period 5. FedAvg requires 144/144 — one or two missing clients → abort 6. A late-starting CJ crashes with `TypeError: 'NoneType' object is not iterable` when `get_job_clients()` receives `None` metadata from an already-aborted job PRs #4206, #4204, #4174, #4172, #4186, #4211, #4210 (all merged in 2.7.2) address the transport layer. This PR addresses the remaining job lifecycle layer. --- ## Fixes Included ### 1 — `_deploy_job()`: Treat deployment timeout as failure (`job_runner.py`) **Root bug**: `reply=None` was logged as `"unknown"` and excluded from `failed_clients`, so timed-out clients counted as "successfully deployed" for the `min_sites` check. **Fix**: Add timed-out clients to `failed_clients` with a `"deployment timeout"` label. The existing `min_sites` / `required_sites` logic then correctly decides whether to abort. ### 2 — `check_client_replies()`: Return timed-out clients instead of raising (`admin.py`) **Root bug**: In strict mode, any timeout raised immediately, aborting the whole job even when the remaining active clients satisfied `min_sites`. **Fix**: In strict mode, collect timed-out clients into a return list rather than raising. Explicit errors (non-OK return code or error body) still raise. Also fixes the non-strict mode to use name-keyed dict lookup instead of fragile positional `zip()`. New signature: `check_client_replies(...) -> List[str]` (timed-out client names; empty = none). ### 3 — `_start_run()`: Selective exclusion with min_sites re-evaluation (`job_runner.py`) **Root bug**: A start-job timeout under strict mode aborted the entire job with no tolerance for stragglers within `min_sites` bounds. **Fix**: Use the returned timed-out list from `check_client_replies()`. If remaining active clients >= `min_sites`, log a warning and proceed. Only abort when below tolerance. ### 4 — `_sync_client_jobs()`: Require-prior-report default changed to `True` (`fed_server.py`) **Root bug**: `SYNC_CLIENT_JOBS_REQUIRE_PREVIOUS_REPORT` defaulted to `False`, meaning the bug fix was opt-in and the unsafe behaviour remained the default. **Fix**: Default changed to `True`. Operators who want the aggressive legacy detection can set it to `False` explicitly. ### 5 — `_sync_client_jobs()`: Move `_reported_clients` out of `job_info` dict (`fed_server.py`) **Root bug**: Positive-observation tracking was stored as `job_info["_reported_clients"]`, injecting algorithm state into a data dict with no corresponding `RunProcessKey` constant. **Fix**: Tracking moved to `self._job_reported_clients: Dict[str, set]` on `FederatedServer`. Stale entries are purged whenever a job is no longer in `run_processes`. ### 6 — `ClientRunManager.get_job_clients()`: Explicit meta validation (`client_run_manager.py`) Raises `RuntimeError` with a descriptive message instead of an opaque `TypeError` when `JOB_CLIENTS` is absent or the wrong type. --- ## Configuration Recommendations (No Code Change Needed) | Setting | Recommended value | Effect | |---|---|---| | `FedAvg(min_clients=...)` | 96-98% of `num_clients` | Tolerates a few startup stragglers | | `runner_sync_timeout` | `120` s | Allows Lustre-backed deployments time to complete | | `strict_start_job_reply_check` | `true` | Start-job timeouts surfaced, straggler clients excluded | | `sync_client_jobs_require_previous_report` | `true` (now the default) | Prevents premature dead-job from startup delay | | `SFM_CLOSE_STALLED_CONNECTION` (PR #4206) | `true` after staging | Disconnects stalled relay connections | --- ## Files Changed - `nvflare/private/fed/server/job_runner.py` — `_deploy_job()` timeout as failure; `_start_run()` selective exclusion - `nvflare/private/fed/server/admin.py` — `check_client_replies()` returns timed-out list; dict-keyed non-strict path - `nvflare/private/fed/server/fed_server.py` — `_sync_client_jobs()` default `True`; `_job_reported_clients` attr; stale cleanup - `nvflare/private/fed/client/client_run_manager.py` — explicit meta validation in `get_job_clients()` --- ## Test Coverage New and updated unit tests with both positive and negative cases: | File | Tests | What they cover | |---|---|---| | `admin_test.py` | 8 | Timeout returned not raised; dict lookup; error still raises; reorder OK | | `job_runner_test.py` | 4 | strict flag wiring; timeout within tolerance → warn; timeout below tolerance → raise | | `job_runner_deploy_test.py` | 9 (new file) | Timeout counted as failure; OK reply not failed; mixed outcomes; detail label; min_sites with timeouts; integration sequence | | `fed_server_test.py` | 5 | Default requires-prior-report; legacy explicit-False still fires; tracking in server attr not job_info; stale cleanup | All 29 targeted unit tests pass. ## Test Plan - [x] Unit tests for each changed function (positive + negative) - [x] New `job_runner_deploy_test.py` covering deployment timeout classification end-to-end - [x] All 29 targeted unit tests pass - [ ] Hierarchical staging run with all flags at default - [ ] Hierarchical staging run with `strict_start_job_reply_check=true` and reduced `min_clients` - [ ] Verify no regression on standard (non-hierarchical) FL jobs --------- Fixes # . ### Description A few sentences describing the changes proposed in this pull request. ### Types of changes  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Peter Cnudde <pcnudde@nvidia.com>

greptile-apps bot reviewed Feb 20, 2026

View reviewed changes

nvflare/private/fed/server/admin.py Outdated Show resolved Hide resolved

chesterxgchen changed the title ~~[2.7.3] Stabilize hierarchical startup checks for BERT-144~~ [2.7] Stabilize hierarchical startup checks for BERT-144 Feb 22, 2026

chesterxgchen changed the title ~~[2.7] Stabilize hierarchical startup checks for BERT-144~~ [2.7] Fix hierarchical FL startup failures: deployment timeouts, selective client exclusion, and dead-detection debounce Feb 22, 2026

chesterxgchen force-pushed the 2.7.3_hierarchical_startup_fix branch from 1e8b489 to 47bb81a Compare February 22, 2026 01:56

greptile-apps bot reviewed Feb 22, 2026

View reviewed changes

chesterxgchen mentioned this pull request Feb 22, 2026

[2.7] Update 2.7.2 release notes: streaming hardening, memory management, hierarchical startup stability [skip ci] #4218

Merged

4 tasks

greptile-apps bot reviewed Feb 22, 2026

View reviewed changes

chesterxgchen requested review from IsaacYangSLA, nvidianz and pcnudde February 23, 2026 19:45

pcnudde reviewed Feb 23, 2026

View reviewed changes

nvflare/private/fed/server/fed_server.py Show resolved Hide resolved

IsaacYangSLA reviewed Feb 23, 2026

View reviewed changes

nvflare/private/fed/server/fed_server.py Outdated Show resolved Hide resolved

IsaacYangSLA reviewed Feb 23, 2026

View reviewed changes

nvflare/private/fed/server/fed_server.py Outdated Show resolved Hide resolved

nvidianz reviewed Feb 23, 2026

View reviewed changes

nvflare/private/fed/server/admin.py Outdated Show resolved Hide resolved

nvidianz reviewed Feb 23, 2026

View reviewed changes

nvflare/private/fed/server/admin.py Show resolved Hide resolved

nvidianz reviewed Feb 23, 2026

View reviewed changes

nvflare/private/fed/server/admin.py Outdated Show resolved Hide resolved

nvidianz reviewed Feb 23, 2026

View reviewed changes

nvflare/private/fed/server/fed_server.py Outdated Show resolved Hide resolved

chesterxgchen and others added 10 commits February 24, 2026 14:47

[2.7] Sync JOB_CLIENTS metadata after start-job timeouts

653c211

Update _start_run to rebuild JOB_CLIENTS from active participants when timed-out clients are excluded, and add unit coverage to verify metadata consistency and deploy-to-start filtering behavior.

[2.7] Add integration-style start-run timeout metadata test

7361a5a

Add a start-run integration-style unit test that exercises the real timeout reply-check path and verifies JOB_CLIENTS metadata is reduced to active clients after timeout exclusion.

[2.7] Remove hierarchical BERT-144 analysis report doc

eddd27e

Drop the temporary hierarchical FL BERT-144 analysis markdown from the PR scope.

[2.7] Document startup safety flags for timeout guidance

7b3e24c

Add user-facing documentation for strict_start_job_reply_check and sync_client_jobs_require_previous_report, including defaults, usage guidance, and cross-reference from troubleshooting to the full timeout reference.

code format

a5c548f

[2.7] Set JOB_CLIENTS before start_client_job

1135840

Restore JOB_CLIENTS metadata before start_client_job so client startup headers include participating clients. Add a unit test that asserts JOB_CLIENTS is present when start_client_job is invoked.

chesterxgchen force-pushed the 2.7.3_hierarchical_startup_fix branch from 9f18728 to 1135840 Compare February 24, 2026 22:47

pcnudde previously approved these changes Feb 24, 2026

View reviewed changes

greptile-apps bot reviewed Feb 24, 2026

View reviewed changes

nvflare/private/fed/server/job_runner.py Show resolved Hide resolved

chesterxgchen dismissed pcnudde’s stale review via 71d200e February 24, 2026 23:00

greptile-apps bot reviewed Feb 24, 2026

View reviewed changes

pcnudde approved these changes Feb 24, 2026

View reviewed changes

chesterxgchen disabled auto-merge February 24, 2026 23:26

chesterxgchen merged commit 572990d into NVIDIA:2.7 Feb 24, 2026
18 of 19 checks passed

chesterxgchen deleted the 2.7.3_hierarchical_startup_fix branch February 24, 2026 23:27

Conversation

chesterxgchen commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fixes Included

1 — _deploy_job(): Treat deployment timeout as failure (job_runner.py)

2 — check_client_replies(): Return timed-out clients instead of raising (admin.py)

3 — _start_run(): Selective exclusion with min_sites re-evaluation (job_runner.py)

4 — _sync_client_jobs(): Require-prior-report default changed to True (fed_server.py)

5 — _sync_client_jobs(): Move _reported_clients out of job_info dict (fed_server.py)

6 — ClientRunManager.get_job_clients(): Explicit meta validation (client_run_manager.py)

Configuration Recommendations (No Code Change Needed)

Files Changed

Test Coverage

Test Plan

Uh oh!

greptile-apps bot commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chesterxgchen commented Feb 22, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Feb 22, 2026

Uh oh!

chesterxgchen commented Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

chesterxgchen commented Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

chesterxgchen commented Feb 22, 2026

Uh oh!

chesterxgchen commented Feb 22, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

chesterxgchen commented Feb 22, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chesterxgchen commented Feb 24, 2026

Uh oh!

chesterxgchen commented Feb 24, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

chesterxgchen commented Feb 24, 2026

Uh oh!

Uh oh!

Reviewers

chesterxgchen commented Feb 20, 2026 •

edited

Loading

1 — `_deploy_job()`: Treat deployment timeout as failure (`job_runner.py`)

2 — `check_client_replies()`: Return timed-out clients instead of raising (`admin.py`)

3 — `_start_run()`: Selective exclusion with min_sites re-evaluation (`job_runner.py`)

4 — `_sync_client_jobs()`: Require-prior-report default changed to `True` (`fed_server.py`)

5 — `_sync_client_jobs()`: Move `_reported_clients` out of `job_info` dict (`fed_server.py`)

6 — `ClientRunManager.get_job_clients()`: Explicit meta validation (`client_run_manager.py`)

greptile-apps bot commented Feb 20, 2026 •

edited

Loading

chesterxgchen commented Feb 22, 2026 •

edited

Loading

chesterxgchen commented Feb 22, 2026 •

edited

Loading