[2.7] Fix hierarchical FL startup failures: deployment timeouts, selective client exclusion, and dead-detection debounce#4209
Conversation
Greptile SummaryThis PR addresses cascading startup failures in large-scale hierarchical FL by fixing timeout handling and client exclusion logic across the job lifecycle layer. Key Changes:
All previous review concerns addressed:
Test Coverage: 29 new/updated unit tests with comprehensive positive and negative cases validate all fixes. Confidence Score: 5/5
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
Start[Job Deployment Start] --> Deploy[_deploy_job: Send to clients]
Deploy --> CheckDeploy{Check replies}
CheckDeploy -->|reply=None| Timeout1[Add to failed_clients with 'deployment timeout']
CheckDeploy -->|reply.rc!=OK| Fail1[Add to failed_clients with error detail]
CheckDeploy -->|reply.rc=OK| Success1[Mark as deployed OK]
Timeout1 --> Validate1{Validate min_sites/<br/>required_sites}
Fail1 --> Validate1
Success1 --> Validate1
Validate1 -->|Failed client in<br/>required_sites| Abort1[Abort job]
Validate1 -->|Active < min_sites| Abort1
Validate1 -->|Within tolerance| StartPhase[_start_run: Start job]
StartPhase --> SetMeta1[Set JOB_CLIENTS metadata<br/>before start_client_job]
SetMeta1 --> StartClients[Call start_client_job]
StartClients --> CheckStart{check_client_replies<br/>strict mode?}
CheckStart -->|strict=True| StrictPath[Returns timed_out list]
CheckStart -->|strict=False| NonStrictPath[Returns empty list]
StrictPath --> HasTimeout{timed_out<br/>non-empty?}
HasTimeout -->|Yes| CheckRequired2{Timed-out client<br/>in required_sites?}
CheckRequired2 -->|Yes| Abort2[Abort job]
CheckRequired2 -->|No| CheckMin2{Active >= min_sites?}
CheckMin2 -->|No| Abort2
CheckMin2 -->|Yes| Warn[Log warning,<br/>exclude timed-out clients]
HasTimeout -->|No| UpdateMeta
NonStrictPath --> RebuildFromReplies[Rebuild active_client_sites<br/>from actual replies]
RebuildFromReplies --> UpdateMeta[Update JOB_CLIENTS metadata<br/>with active clients only]
Warn --> UpdateMeta
UpdateMeta --> Running[Job Running]
Running --> Heartbeat[Client heartbeat]
Heartbeat --> SyncJobs{_sync_client_jobs}
SyncJobs --> CheckPrior{require_previous_report<br/>default=True}
CheckPrior -->|Job on server<br/>but not client| HasReported{Client reported<br/>this job before?}
HasReported -->|Yes| DeadJob[Fire dead-job notification]
HasReported -->|No| SkipNotify[Skip notification<br/>still starting up]
CheckPrior -->|Job on client<br/>but not server| AbortClient[Tell client to abort job]
Last reviewed commit: 71d200e |
1e8b489 to
47bb81a
Compare
|
/build |
Additional Comments (1)
|
…ent, hierarchical startup stability Add three major new sections to flare_272.rst covering work merged after the initial 2.7.2 draft: Memory Management (restructured): - Zero Tensor Copy at CJ process via LazyDownloadRef pass-through (PR NVIDIA#4210) - Client-side memory management: malloc_trim, jemalloc, torch.cuda.empty_cache injected after flare.send() without training script changes (PR NVIDIA#4211) - Retain existing TensorDownloader and server-side cleanup content F3 Streaming Reliability and Performance (new section): - HOL stall mitigation: bounded send_frame() timeout, ACK watchdog, stall detection/recovery with recommended env-var settings (PR NVIDIA#4206) - Stream pool starvation fix: blob callbacks dispatched to dedicated thread pool, preventing stream worker exhaustion (PR NVIDIA#4171/NVIDIA#4172) - Streaming download retry with exponential backoff on timeout (PR NVIDIA#4167) - RxTask self-deadlock fix: stop() deferred until after map_lock released (PR NVIDIA#4204) - Lock contention reduction in produce_item() for concurrent model downloads (PR NVIDIA#4174) Hierarchical FL Startup Stability (new section): - Deployment timeout correctly classified as failure; min_sites check applied at deployment phase (PR NVIDIA#4209) - Startup grace period for dead-client detection (debounce default=true) (PR NVIDIA#4209) - Selective client exclusion on start-job timeout instead of full abort (PR NVIDIA#4209) - Hardened job metadata parsing: TypeError replaced with descriptive RuntimeError (PR NVIDIA#4209) - Recommended config snippets for HPC/Lustre environments (Frontier/ORNL scale) Bug Fixes section updated with all streaming and hierarchical startup fixes. Intro paragraph updated to reflect system hardening scope. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Addressed the metadata consistency feedback in commit e0af9ac. Changes made:
Validation:
|
…ease notes Add a visible source comment above the Hierarchical FL Startup Stability section noting that its content depends on PR NVIDIA#4209 and should not be merged before that PR lands on 2.7. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Follow-up added in commit df7c06b:
Validation on updated test scope:
|
|
/build |
1 similar comment
|
/build |
|
/build |
…ctive exclusion, dead-detection debounce - _deploy_job(): treat reply=None (timeout) as deployment failure so timed-out clients are correctly evaluated against min_sites / required_sites, rather than silently counted as successfully deployed - check_client_replies(): strict mode now returns List[str] of timed-out clients instead of raising; explicit errors still raise; non-strict path uses dict-keyed lookup instead of fragile positional zip() - _start_run(): use returned timed-out list to selectively exclude stragglers; re-evaluate active count against job.min_sites before aborting - _sync_client_jobs(): change SYNC_CLIENT_JOBS_REQUIRE_PREVIOUS_REPORT default False->True so safe debounced behaviour is active without explicit config; move _reported_clients tracking to self._job_reported_clients on FederatedServer (out of job_info dict); purge stale entries when jobs leave run_processes - Add license header to tests/unit_test/private/fed/client/__init__.py - Tests: rewrite admin_test.py; extend job_runner_test.py and fed_server_test.py; add new job_runner_deploy_test.py (29 tests, all positive + negative cases) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Update _start_run to rebuild JOB_CLIENTS from active participants when timed-out clients are excluded, and add unit coverage to verify metadata consistency and deploy-to-start filtering behavior.
Add a start-run integration-style unit test that exercises the real timeout reply-check path and verifies JOB_CLIENTS metadata is reduced to active clients after timeout exclusion.
Drop the temporary hierarchical FL BERT-144 analysis markdown from the PR scope.
Add user-facing documentation for strict_start_job_reply_check and sync_client_jobs_require_previous_report, including defaults, usage guidance, and cross-reference from troubleshooting to the full timeout reference.
- admin.py: use missing_clients list directly in error message (no manual join) - admin.py: replace fragile zip() positional matching in non-strict path with dict-keyed lookup, consistent with strict mode - admin.py: add isinstance(r.reply.body, str) guard and use startswith() instead of `in` for ERROR_MSG_PREFIX check in non-strict path - fed_server.py: remove redundant `or []` from JOB_IDS header read; the isinstance check on the next line already handles None/invalid types Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… docs - admin.py: align strict-mode ERROR_MSG_PREFIX check to use startswith() - fed_server.py: add _job_reported_clients_lock for concurrent heartbeat safety - job_runner.py: add comment on timed-out clients and require_previous_report - admin_test.py: fix match pattern for list repr; add startswith semantics tests - fed_server_test.py: remove unused _make_server() helper - job_runner_deploy_test.py: remove no-op patch.object block - docs/timeouts.rst: clarify when to enable strict_start_job_reply_check Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- _start_run() now validates required_sites on timeout, consistent with _deploy_job(): if a timed-out client is in job.required_sites the job aborts even when active_count >= min_sites. - Remove redundant early metadata assignment (line 265); JOB_CLIENTS is now set once after timeout exclusion so it always reflects actual active participants. - Add two unit tests: required-site timeout aborts; non-required-site timeout proceeds and metadata is correct. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Restore JOB_CLIENTS metadata before start_client_job so client startup headers include participating clients. Add a unit test that asserts JOB_CLIENTS is present when start_client_job is invoked.
9f18728 to
1135840
Compare
|
/build |
|
/build |
Keep JOB_CLIENTS metadata aligned with actual client start replies when strict_start_reply_check is disabled by deriving active participants from non-empty replies. Add a unit test that reproduces non-strict timeout behavior and verifies timed-out clients are excluded from JOB_CLIENTS.
|
/build |
…ctive client exclusion, and dead-detection debounce (NVIDIA#4209) ## Problem Large-scale hierarchical FL jobs (e.g. BERT NER, 144 clients, 6 relays on Frontier) abort in Round 0 due to a cascading startup failure chain. The root sequence is: 1. F3 streaming HOL stall (PR NVIDIA#4206) delays deployment ACKs from relay-connected clients 2. **`_deploy_job()`** treats `reply=None` (timeout) as `"unknown"` — not a failure — so timed-out clients silently appear to have been deployed 3. **`_start_run()`** tries to start those clients; they again time out, and `check_client_replies()` ignores the `None` reply 4. **`_sync_client_jobs()`** fires dead-job notification on the very first heartbeat with no startup grace period 5. FedAvg requires 144/144 — one or two missing clients → abort 6. A late-starting CJ crashes with `TypeError: 'NoneType' object is not iterable` when `get_job_clients()` receives `None` metadata from an already-aborted job PRs NVIDIA#4206, NVIDIA#4204, NVIDIA#4174, NVIDIA#4172, NVIDIA#4186, NVIDIA#4211, NVIDIA#4210 (all merged in 2.7.2) address the transport layer. This PR addresses the remaining job lifecycle layer. --- ## Fixes Included ### 1 — `_deploy_job()`: Treat deployment timeout as failure (`job_runner.py`) **Root bug**: `reply=None` was logged as `"unknown"` and excluded from `failed_clients`, so timed-out clients counted as "successfully deployed" for the `min_sites` check. **Fix**: Add timed-out clients to `failed_clients` with a `"deployment timeout"` label. The existing `min_sites` / `required_sites` logic then correctly decides whether to abort. ### 2 — `check_client_replies()`: Return timed-out clients instead of raising (`admin.py`) **Root bug**: In strict mode, any timeout raised immediately, aborting the whole job even when the remaining active clients satisfied `min_sites`. **Fix**: In strict mode, collect timed-out clients into a return list rather than raising. Explicit errors (non-OK return code or error body) still raise. Also fixes the non-strict mode to use name-keyed dict lookup instead of fragile positional `zip()`. New signature: `check_client_replies(...) -> List[str]` (timed-out client names; empty = none). ### 3 — `_start_run()`: Selective exclusion with min_sites re-evaluation (`job_runner.py`) **Root bug**: A start-job timeout under strict mode aborted the entire job with no tolerance for stragglers within `min_sites` bounds. **Fix**: Use the returned timed-out list from `check_client_replies()`. If remaining active clients >= `min_sites`, log a warning and proceed. Only abort when below tolerance. ### 4 — `_sync_client_jobs()`: Require-prior-report default changed to `True` (`fed_server.py`) **Root bug**: `SYNC_CLIENT_JOBS_REQUIRE_PREVIOUS_REPORT` defaulted to `False`, meaning the bug fix was opt-in and the unsafe behaviour remained the default. **Fix**: Default changed to `True`. Operators who want the aggressive legacy detection can set it to `False` explicitly. ### 5 — `_sync_client_jobs()`: Move `_reported_clients` out of `job_info` dict (`fed_server.py`) **Root bug**: Positive-observation tracking was stored as `job_info["_reported_clients"]`, injecting algorithm state into a data dict with no corresponding `RunProcessKey` constant. **Fix**: Tracking moved to `self._job_reported_clients: Dict[str, set]` on `FederatedServer`. Stale entries are purged whenever a job is no longer in `run_processes`. ### 6 — `ClientRunManager.get_job_clients()`: Explicit meta validation (`client_run_manager.py`) Raises `RuntimeError` with a descriptive message instead of an opaque `TypeError` when `JOB_CLIENTS` is absent or the wrong type. --- ## Configuration Recommendations (No Code Change Needed) | Setting | Recommended value | Effect | |---|---|---| | `FedAvg(min_clients=...)` | 96-98% of `num_clients` | Tolerates a few startup stragglers | | `runner_sync_timeout` | `120` s | Allows Lustre-backed deployments time to complete | | `strict_start_job_reply_check` | `true` | Start-job timeouts surfaced, straggler clients excluded | | `sync_client_jobs_require_previous_report` | `true` (now the default) | Prevents premature dead-job from startup delay | | `SFM_CLOSE_STALLED_CONNECTION` (PR NVIDIA#4206) | `true` after staging | Disconnects stalled relay connections | --- ## Files Changed - `nvflare/private/fed/server/job_runner.py` — `_deploy_job()` timeout as failure; `_start_run()` selective exclusion - `nvflare/private/fed/server/admin.py` — `check_client_replies()` returns timed-out list; dict-keyed non-strict path - `nvflare/private/fed/server/fed_server.py` — `_sync_client_jobs()` default `True`; `_job_reported_clients` attr; stale cleanup - `nvflare/private/fed/client/client_run_manager.py` — explicit meta validation in `get_job_clients()` --- ## Test Coverage New and updated unit tests with both positive and negative cases: | File | Tests | What they cover | |---|---|---| | `admin_test.py` | 8 | Timeout returned not raised; dict lookup; error still raises; reorder OK | | `job_runner_test.py` | 4 | strict flag wiring; timeout within tolerance → warn; timeout below tolerance → raise | | `job_runner_deploy_test.py` | 9 (new file) | Timeout counted as failure; OK reply not failed; mixed outcomes; detail label; min_sites with timeouts; integration sequence | | `fed_server_test.py` | 5 | Default requires-prior-report; legacy explicit-False still fires; tracking in server attr not job_info; stale cleanup | All 29 targeted unit tests pass. ## Test Plan - [x] Unit tests for each changed function (positive + negative) - [x] New `job_runner_deploy_test.py` covering deployment timeout classification end-to-end - [x] All 29 targeted unit tests pass - [ ] Hierarchical staging run with all flags at default - [ ] Hierarchical staging run with `strict_start_job_reply_check=true` and reduced `min_clients` - [ ] Verify no regression on standard (non-hierarchical) FL jobs --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…ent, hierarchical startup stability Add three major new sections to flare_272.rst covering work merged after the initial 2.7.2 draft: Memory Management (restructured): - Zero Tensor Copy at CJ process via LazyDownloadRef pass-through (PR NVIDIA#4210) - Client-side memory management: malloc_trim, jemalloc, torch.cuda.empty_cache injected after flare.send() without training script changes (PR NVIDIA#4211) - Retain existing TensorDownloader and server-side cleanup content F3 Streaming Reliability and Performance (new section): - HOL stall mitigation: bounded send_frame() timeout, ACK watchdog, stall detection/recovery with recommended env-var settings (PR NVIDIA#4206) - Stream pool starvation fix: blob callbacks dispatched to dedicated thread pool, preventing stream worker exhaustion (PR NVIDIA#4171/NVIDIA#4172) - Streaming download retry with exponential backoff on timeout (PR NVIDIA#4167) - RxTask self-deadlock fix: stop() deferred until after map_lock released (PR NVIDIA#4204) - Lock contention reduction in produce_item() for concurrent model downloads (PR NVIDIA#4174) Hierarchical FL Startup Stability (new section): - Deployment timeout correctly classified as failure; min_sites check applied at deployment phase (PR NVIDIA#4209) - Startup grace period for dead-client detection (debounce default=true) (PR NVIDIA#4209) - Selective client exclusion on start-job timeout instead of full abort (PR NVIDIA#4209) - Hardened job metadata parsing: TypeError replaced with descriptive RuntimeError (PR NVIDIA#4209) - Recommended config snippets for HPC/Lustre environments (Frontier/ORNL scale) Bug Fixes section updated with all streaming and hierarchical startup fixes. Intro paragraph updated to reflect system hardening scope. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ease notes Add a visible source comment above the Hierarchical FL Startup Stability section noting that its content depends on PR NVIDIA#4209 and should not be merged before that PR lands on 2.7. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ent, hierarchical startup stability Add three major new sections to flare_272.rst covering work merged after the initial 2.7.2 draft: Memory Management (restructured): - Zero Tensor Copy at CJ process via LazyDownloadRef pass-through (PR NVIDIA#4210) - Client-side memory management: malloc_trim, jemalloc, torch.cuda.empty_cache injected after flare.send() without training script changes (PR NVIDIA#4211) - Retain existing TensorDownloader and server-side cleanup content F3 Streaming Reliability and Performance (new section): - HOL stall mitigation: bounded send_frame() timeout, ACK watchdog, stall detection/recovery with recommended env-var settings (PR NVIDIA#4206) - Stream pool starvation fix: blob callbacks dispatched to dedicated thread pool, preventing stream worker exhaustion (PR NVIDIA#4171/NVIDIA#4172) - Streaming download retry with exponential backoff on timeout (PR NVIDIA#4167) - RxTask self-deadlock fix: stop() deferred until after map_lock released (PR NVIDIA#4204) - Lock contention reduction in produce_item() for concurrent model downloads (PR NVIDIA#4174) Hierarchical FL Startup Stability (new section): - Deployment timeout correctly classified as failure; min_sites check applied at deployment phase (PR NVIDIA#4209) - Startup grace period for dead-client detection (debounce default=true) (PR NVIDIA#4209) - Selective client exclusion on start-job timeout instead of full abort (PR NVIDIA#4209) - Hardened job metadata parsing: TypeError replaced with descriptive RuntimeError (PR NVIDIA#4209) - Recommended config snippets for HPC/Lustre environments (Frontier/ORNL scale) Bug Fixes section updated with all streaming and hierarchical startup fixes. Intro paragraph updated to reflect system hardening scope. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ease notes Add a visible source comment above the Hierarchical FL Startup Stability section noting that its content depends on PR NVIDIA#4209 and should not be merged before that PR lands on 2.7. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ent, hierarchical startup stability [skip ci] (#4218) ## Merge Dependency >⚠️ **Depends on #4209** — The *Hierarchical FL Startup Stability* section documents changes introduced by PR #4209 (currently open). **Merge PR #4209 into `2.7` before merging this PR.** All other sections cover already-merged PRs. --- ## Summary This PR updates `docs/release_notes/flare_272.rst` to reflect all major changes merged into the 2.7.x line after the initial 2.7.2 draft, covering three new areas: - **Memory Management** — restructured and expanded with Zero Tensor Copy at CJ (PR #4210) and client-side memory lifecycle management (PR #4211) - **F3 Streaming Reliability and Performance** — new section covering HOL stall mitigation (PR #4206), stream pool starvation fix (PR #4171/#4172), streaming download retry (PR #4167), RxTask self-deadlock fix (PR #4204), and lock contention reduction (PR #4174) - **Hierarchical FL Startup Stability** — new section covering deployment timeout classification, startup grace period, selective client exclusion, and metadata hardening (PR #4209 — pending merge), with recommended config snippets for HPC/Lustre environments The Bug Fixes section and intro paragraph are also updated accordingly. A source-level RST comment has been added above the Hierarchical FL section in the file to alert future maintainers to the merge dependency. ## Merged PRs Documented | PR | Area | Status | |---|---|---| | #4171 / #4172 | Stream pool starvation fix | Merged | | #4174 | Lock contention reduction | Merged | | #4167 | Streaming download retry | Merged | | #4204 | RxTask self-deadlock fix | Merged | | #4206 | HOL stall mitigation | Merged | | #4210 | Zero tensor copy at CJ | Merged | | #4211 | Client-side memory management | Merged | | #4209 | Hierarchical FL startup stability | **Open — merge before this PR** | ## Changes ### Memory Management (restructured) - **Zero Tensor Copy at CJ** (`ClientAPILauncherExecutor`): CJ now holds `LazyDownloadRef` placeholders instead of materializing full tensors, eliminating the CJ as a memory bottleneck for LLM-scale models. - **Client-Side Memory Management**: `gc.collect()` + `malloc_trim(0)` / jemalloc purge / `torch.cuda.empty_cache()` injected after every `flare.send()`, configurable via `client_memory_gc_rounds`. - Existing TensorDownloader and server-side cleanup content retained. ### F3 Streaming Reliability and Performance (new section) - **HOL Stall Mitigation**: Bounded `send_frame()` timeout, ACK-progress watchdog, and stall detection/recovery. Includes recommended environment variable settings for large hierarchical deployments. - **Stream Pool Starvation Fix**: Blob callbacks dispatched to a dedicated `callback_thread_pool`, keeping stream workers free for concurrent downloads. - **Streaming Download Retry**: Exponential-backoff retry (up to 3 attempts, capped at 60 s) on `TIMEOUT` errors; abort-signal aware. - **RxTask Self-Deadlock Fix**: `stop()` deferred until after `map_lock` released, eliminating stream-error-triggered deadlock. - **Lock Contention Reduction**: `produce_item()` runs outside `self.lock`; compare-and-store for cache write. Reduces model-download latency under high client concurrency. ### Hierarchical FL Startup Stability (new section — pending PR #4209) - **Deployment Timeout as Failure**: `reply=None` correctly counted against `min_sites`; timed-out clients excluded before `start_client_job`. - **Startup Grace Period**: Dead-client detection debounced — client must be observed once before absence triggers dead-job notification. Default changed to `True`. - **Selective Client Exclusion**: Stragglers at start-job excluded rather than causing full abort, if remaining count ≥ `min_clients`. - **Metadata Hardening**: `TypeError` on absent job metadata replaced with descriptive `RuntimeError`. - Recommended `config_fed_server.json` / `config_fed_client.json` snippets for HPC (Frontier/ORNL) scale. ## Test plan - [ ] Sphinx build (`make html`) passes without RST warnings on the updated file - [ ] All new cross-references (`.. code-block::`, `.. note::`) render correctly in the docs build - [ ] Verify section hierarchy (underline characters) is consistent throughout the file - [ ] Confirm PR #4209 is merged before this PR is merged 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…meouts, selective client exclusion, and dead-detection debounce (#4209) (#4288) ## Problem Large-scale hierarchical FL jobs (e.g. BERT NER, 144 clients, 6 relays on Frontier) abort in Round 0 due to a cascading startup failure chain. The root sequence is: 1. F3 streaming HOL stall (PR #4206) delays deployment ACKs from relay-connected clients 2. **`_deploy_job()`** treats `reply=None` (timeout) as `"unknown"` — not a failure — so timed-out clients silently appear to have been deployed 3. **`_start_run()`** tries to start those clients; they again time out, and `check_client_replies()` ignores the `None` reply 4. **`_sync_client_jobs()`** fires dead-job notification on the very first heartbeat with no startup grace period 5. FedAvg requires 144/144 — one or two missing clients → abort 6. A late-starting CJ crashes with `TypeError: 'NoneType' object is not iterable` when `get_job_clients()` receives `None` metadata from an already-aborted job PRs #4206, #4204, #4174, #4172, #4186, #4211, #4210 (all merged in 2.7.2) address the transport layer. This PR addresses the remaining job lifecycle layer. --- ## Fixes Included ### 1 — `_deploy_job()`: Treat deployment timeout as failure (`job_runner.py`) **Root bug**: `reply=None` was logged as `"unknown"` and excluded from `failed_clients`, so timed-out clients counted as "successfully deployed" for the `min_sites` check. **Fix**: Add timed-out clients to `failed_clients` with a `"deployment timeout"` label. The existing `min_sites` / `required_sites` logic then correctly decides whether to abort. ### 2 — `check_client_replies()`: Return timed-out clients instead of raising (`admin.py`) **Root bug**: In strict mode, any timeout raised immediately, aborting the whole job even when the remaining active clients satisfied `min_sites`. **Fix**: In strict mode, collect timed-out clients into a return list rather than raising. Explicit errors (non-OK return code or error body) still raise. Also fixes the non-strict mode to use name-keyed dict lookup instead of fragile positional `zip()`. New signature: `check_client_replies(...) -> List[str]` (timed-out client names; empty = none). ### 3 — `_start_run()`: Selective exclusion with min_sites re-evaluation (`job_runner.py`) **Root bug**: A start-job timeout under strict mode aborted the entire job with no tolerance for stragglers within `min_sites` bounds. **Fix**: Use the returned timed-out list from `check_client_replies()`. If remaining active clients >= `min_sites`, log a warning and proceed. Only abort when below tolerance. ### 4 — `_sync_client_jobs()`: Require-prior-report default changed to `True` (`fed_server.py`) **Root bug**: `SYNC_CLIENT_JOBS_REQUIRE_PREVIOUS_REPORT` defaulted to `False`, meaning the bug fix was opt-in and the unsafe behaviour remained the default. **Fix**: Default changed to `True`. Operators who want the aggressive legacy detection can set it to `False` explicitly. ### 5 — `_sync_client_jobs()`: Move `_reported_clients` out of `job_info` dict (`fed_server.py`) **Root bug**: Positive-observation tracking was stored as `job_info["_reported_clients"]`, injecting algorithm state into a data dict with no corresponding `RunProcessKey` constant. **Fix**: Tracking moved to `self._job_reported_clients: Dict[str, set]` on `FederatedServer`. Stale entries are purged whenever a job is no longer in `run_processes`. ### 6 — `ClientRunManager.get_job_clients()`: Explicit meta validation (`client_run_manager.py`) Raises `RuntimeError` with a descriptive message instead of an opaque `TypeError` when `JOB_CLIENTS` is absent or the wrong type. --- ## Configuration Recommendations (No Code Change Needed) | Setting | Recommended value | Effect | |---|---|---| | `FedAvg(min_clients=...)` | 96-98% of `num_clients` | Tolerates a few startup stragglers | | `runner_sync_timeout` | `120` s | Allows Lustre-backed deployments time to complete | | `strict_start_job_reply_check` | `true` | Start-job timeouts surfaced, straggler clients excluded | | `sync_client_jobs_require_previous_report` | `true` (now the default) | Prevents premature dead-job from startup delay | | `SFM_CLOSE_STALLED_CONNECTION` (PR #4206) | `true` after staging | Disconnects stalled relay connections | --- ## Files Changed - `nvflare/private/fed/server/job_runner.py` — `_deploy_job()` timeout as failure; `_start_run()` selective exclusion - `nvflare/private/fed/server/admin.py` — `check_client_replies()` returns timed-out list; dict-keyed non-strict path - `nvflare/private/fed/server/fed_server.py` — `_sync_client_jobs()` default `True`; `_job_reported_clients` attr; stale cleanup - `nvflare/private/fed/client/client_run_manager.py` — explicit meta validation in `get_job_clients()` --- ## Test Coverage New and updated unit tests with both positive and negative cases: | File | Tests | What they cover | |---|---|---| | `admin_test.py` | 8 | Timeout returned not raised; dict lookup; error still raises; reorder OK | | `job_runner_test.py` | 4 | strict flag wiring; timeout within tolerance → warn; timeout below tolerance → raise | | `job_runner_deploy_test.py` | 9 (new file) | Timeout counted as failure; OK reply not failed; mixed outcomes; detail label; min_sites with timeouts; integration sequence | | `fed_server_test.py` | 5 | Default requires-prior-report; legacy explicit-False still fires; tracking in server attr not job_info; stale cleanup | All 29 targeted unit tests pass. ## Test Plan - [x] Unit tests for each changed function (positive + negative) - [x] New `job_runner_deploy_test.py` covering deployment timeout classification end-to-end - [x] All 29 targeted unit tests pass - [ ] Hierarchical staging run with all flags at default - [ ] Hierarchical staging run with `strict_start_job_reply_check=true` and reduced `min_clients` - [ ] Verify no regression on standard (non-hierarchical) FL jobs --------- Fixes # . ### Description A few sentences describing the changes proposed in this pull request. ### Types of changes <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Peter Cnudde <pcnudde@nvidia.com>
Problem
Large-scale hierarchical FL jobs (e.g. BERT NER, 144 clients, 6 relays on Frontier) abort in
Round 0 due to a cascading startup failure chain. The root sequence is:
_deploy_job()treatsreply=None(timeout) as"unknown"— not a failure — sotimed-out clients silently appear to have been deployed
_start_run()tries to start those clients; they again time out, andcheck_client_replies()ignores theNonereply_sync_client_jobs()fires dead-job notification on the very first heartbeat withno startup grace period
TypeError: 'NoneType' object is not iterablewhenget_job_clients()receivesNonemetadata from an already-aborted jobPRs #4206, #4204, #4174, #4172, #4186, #4211, #4210 (all merged in 2.7.2) address the
transport layer. This PR addresses the remaining job lifecycle layer.
Fixes Included
1 —
_deploy_job(): Treat deployment timeout as failure (job_runner.py)Root bug:
reply=Nonewas logged as"unknown"and excluded fromfailed_clients,so timed-out clients counted as "successfully deployed" for the
min_sitescheck.Fix: Add timed-out clients to
failed_clientswith a"deployment timeout"label.The existing
min_sites/required_siteslogic then correctly decides whether to abort.2 —
check_client_replies(): Return timed-out clients instead of raising (admin.py)Root bug: In strict mode, any timeout raised immediately, aborting the whole job even
when the remaining active clients satisfied
min_sites.Fix: In strict mode, collect timed-out clients into a return list rather than raising.
Explicit errors (non-OK return code or error body) still raise. Also fixes the non-strict
mode to use name-keyed dict lookup instead of fragile positional
zip().New signature:
check_client_replies(...) -> List[str](timed-out client names; empty = none).3 —
_start_run(): Selective exclusion with min_sites re-evaluation (job_runner.py)Root bug: A start-job timeout under strict mode aborted the entire job with no
tolerance for stragglers within
min_sitesbounds.Fix: Use the returned timed-out list from
check_client_replies(). If remainingactive clients >=
min_sites, log a warning and proceed. Only abort when below tolerance.4 —
_sync_client_jobs(): Require-prior-report default changed toTrue(fed_server.py)Root bug:
SYNC_CLIENT_JOBS_REQUIRE_PREVIOUS_REPORTdefaulted toFalse, meaningthe bug fix was opt-in and the unsafe behaviour remained the default.
Fix: Default changed to
True. Operators who want the aggressive legacy detectioncan set it to
Falseexplicitly.5 —
_sync_client_jobs(): Move_reported_clientsout ofjob_infodict (fed_server.py)Root bug: Positive-observation tracking was stored as
job_info["_reported_clients"],injecting algorithm state into a data dict with no corresponding
RunProcessKeyconstant.Fix: Tracking moved to
self._job_reported_clients: Dict[str, set]onFederatedServer.Stale entries are purged whenever a job is no longer in
run_processes.6 —
ClientRunManager.get_job_clients(): Explicit meta validation (client_run_manager.py)Raises
RuntimeErrorwith a descriptive message instead of an opaqueTypeErrorwhenJOB_CLIENTSis absent or the wrong type.Configuration Recommendations (No Code Change Needed)
FedAvg(min_clients=...)num_clientsrunner_sync_timeout120sstrict_start_job_reply_checktruesync_client_jobs_require_previous_reporttrue(now the default)SFM_CLOSE_STALLED_CONNECTION(PR #4206)trueafter stagingFiles Changed
nvflare/private/fed/server/job_runner.py—_deploy_job()timeout as failure;_start_run()selective exclusionnvflare/private/fed/server/admin.py—check_client_replies()returns timed-out list; dict-keyed non-strict pathnvflare/private/fed/server/fed_server.py—_sync_client_jobs()defaultTrue;_job_reported_clientsattr; stale cleanupnvflare/private/fed/client/client_run_manager.py— explicit meta validation inget_job_clients()Test Coverage
New and updated unit tests with both positive and negative cases:
admin_test.pyjob_runner_test.pyjob_runner_deploy_test.pyfed_server_test.pyAll 29 targeted unit tests pass.
Test Plan
job_runner_deploy_test.pycovering deployment timeout classification end-to-endstrict_start_job_reply_check=trueand reducedmin_clients