Skip to content

[2.7] Fix hierarchical FL startup failures: deployment timeouts, selective client exclusion, and dead-detection debounce#4209

Merged
chesterxgchen merged 12 commits intoNVIDIA:2.7from
chesterxgchen:2.7.3_hierarchical_startup_fix
Feb 24, 2026
Merged

[2.7] Fix hierarchical FL startup failures: deployment timeouts, selective client exclusion, and dead-detection debounce#4209
chesterxgchen merged 12 commits intoNVIDIA:2.7from
chesterxgchen:2.7.3_hierarchical_startup_fix

Conversation

@chesterxgchen
Copy link
Collaborator

@chesterxgchen chesterxgchen commented Feb 20, 2026

Problem

Large-scale hierarchical FL jobs (e.g. BERT NER, 144 clients, 6 relays on Frontier) abort in
Round 0 due to a cascading startup failure chain. The root sequence is:

  1. F3 streaming HOL stall (PR [2.7] Mitigate F3 streaming Head-of-Line (HOL) stalls and add guardrails #4206) delays deployment ACKs from relay-connected clients
  2. _deploy_job() treats reply=None (timeout) as "unknown" — not a failure — so
    timed-out clients silently appear to have been deployed
  3. _start_run() tries to start those clients; they again time out, and
    check_client_replies() ignores the None reply
  4. _sync_client_jobs() fires dead-job notification on the very first heartbeat with
    no startup grace period
  5. FedAvg requires 144/144 — one or two missing clients → abort
  6. A late-starting CJ crashes with TypeError: 'NoneType' object is not iterable when
    get_job_clients() receives None metadata from an already-aborted job

PRs #4206, #4204, #4174, #4172, #4186, #4211, #4210 (all merged in 2.7.2) address the
transport layer. This PR addresses the remaining job lifecycle layer.


Fixes Included

1 — _deploy_job(): Treat deployment timeout as failure (job_runner.py)

Root bug: reply=None was logged as "unknown" and excluded from failed_clients,
so timed-out clients counted as "successfully deployed" for the min_sites check.

Fix: Add timed-out clients to failed_clients with a "deployment timeout" label.
The existing min_sites / required_sites logic then correctly decides whether to abort.

2 — check_client_replies(): Return timed-out clients instead of raising (admin.py)

Root bug: In strict mode, any timeout raised immediately, aborting the whole job even
when the remaining active clients satisfied min_sites.

Fix: In strict mode, collect timed-out clients into a return list rather than raising.
Explicit errors (non-OK return code or error body) still raise. Also fixes the non-strict
mode to use name-keyed dict lookup instead of fragile positional zip().

New signature: check_client_replies(...) -> List[str] (timed-out client names; empty = none).

3 — _start_run(): Selective exclusion with min_sites re-evaluation (job_runner.py)

Root bug: A start-job timeout under strict mode aborted the entire job with no
tolerance for stragglers within min_sites bounds.

Fix: Use the returned timed-out list from check_client_replies(). If remaining
active clients >= min_sites, log a warning and proceed. Only abort when below tolerance.

4 — _sync_client_jobs(): Require-prior-report default changed to True (fed_server.py)

Root bug: SYNC_CLIENT_JOBS_REQUIRE_PREVIOUS_REPORT defaulted to False, meaning
the bug fix was opt-in and the unsafe behaviour remained the default.

Fix: Default changed to True. Operators who want the aggressive legacy detection
can set it to False explicitly.

5 — _sync_client_jobs(): Move _reported_clients out of job_info dict (fed_server.py)

Root bug: Positive-observation tracking was stored as job_info["_reported_clients"],
injecting algorithm state into a data dict with no corresponding RunProcessKey constant.

Fix: Tracking moved to self._job_reported_clients: Dict[str, set] on FederatedServer.
Stale entries are purged whenever a job is no longer in run_processes.

6 — ClientRunManager.get_job_clients(): Explicit meta validation (client_run_manager.py)

Raises RuntimeError with a descriptive message instead of an opaque TypeError when
JOB_CLIENTS is absent or the wrong type.


Configuration Recommendations (No Code Change Needed)

Setting Recommended value Effect
FedAvg(min_clients=...) 96-98% of num_clients Tolerates a few startup stragglers
runner_sync_timeout 120 s Allows Lustre-backed deployments time to complete
strict_start_job_reply_check true Start-job timeouts surfaced, straggler clients excluded
sync_client_jobs_require_previous_report true (now the default) Prevents premature dead-job from startup delay
SFM_CLOSE_STALLED_CONNECTION (PR #4206) true after staging Disconnects stalled relay connections

Files Changed

  • nvflare/private/fed/server/job_runner.py_deploy_job() timeout as failure; _start_run() selective exclusion
  • nvflare/private/fed/server/admin.pycheck_client_replies() returns timed-out list; dict-keyed non-strict path
  • nvflare/private/fed/server/fed_server.py_sync_client_jobs() default True; _job_reported_clients attr; stale cleanup
  • nvflare/private/fed/client/client_run_manager.py — explicit meta validation in get_job_clients()

Test Coverage

New and updated unit tests with both positive and negative cases:

File Tests What they cover
admin_test.py 8 Timeout returned not raised; dict lookup; error still raises; reorder OK
job_runner_test.py 4 strict flag wiring; timeout within tolerance → warn; timeout below tolerance → raise
job_runner_deploy_test.py 9 (new file) Timeout counted as failure; OK reply not failed; mixed outcomes; detail label; min_sites with timeouts; integration sequence
fed_server_test.py 5 Default requires-prior-report; legacy explicit-False still fires; tracking in server attr not job_info; stale cleanup

All 29 targeted unit tests pass.

Test Plan

  • Unit tests for each changed function (positive + negative)
  • New job_runner_deploy_test.py covering deployment timeout classification end-to-end
  • All 29 targeted unit tests pass
  • Hierarchical staging run with all flags at default
  • Hierarchical staging run with strict_start_job_reply_check=true and reduced min_clients
  • Verify no regression on standard (non-hierarchical) FL jobs

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 20, 2026

Greptile Summary

This PR addresses cascading startup failures in large-scale hierarchical FL by fixing timeout handling and client exclusion logic across the job lifecycle layer.

Key Changes:

  • _deploy_job() now treats deployment timeouts (reply=None) as failures rather than unknown status, enabling proper min_sites/required_sites validation
  • check_client_replies() refactored to return timed-out client list in strict mode instead of raising, allowing selective exclusion within tolerance bounds; also switched from fragile zip() to dict-based lookup
  • _start_run() implements selective client exclusion with required_sites and min_sites re-evaluation after timeouts, and correctly handles metadata in both strict and non-strict modes
  • _sync_client_jobs() default changed to require prior positive heartbeat before firing dead-job notifications (prevents false positives during startup), and tracking moved from job_info dict to clean instance variable with stale entry cleanup
  • get_job_clients() now validates metadata explicitly to prevent opaque TypeError crashes

All previous review concerns addressed:

  • Non-strict mode now correctly excludes timed-out clients from JOB_CLIENTS metadata (lines 307-320 in job_runner.py)
  • required_sites validation added to start phase, consistent with deployment phase
  • Double metadata assignment is intentional: set before start_client_job for serialization, then updated after timeout resolution

Test Coverage: 29 new/updated unit tests with comprehensive positive and negative cases validate all fixes.

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk - all fixes are well-architected, thoroughly tested, and address real production failures
  • Score reflects comprehensive fix of critical production issue with excellent test coverage (29 tests), all previous review concerns properly addressed, clean refactoring that improves maintainability, and no breaking changes to existing functionality
  • No files require special attention - all changes are well-tested and properly validated

Important Files Changed

Filename Overview
nvflare/private/fed/server/job_runner.py Deployment timeout classification, selective client exclusion with min_sites/required_sites validation, and proper JOB_CLIENTS metadata management in both strict and non-strict modes
nvflare/private/fed/server/admin.py Refactored check_client_replies to return timed-out clients list instead of raising, with dict-based lookup and proper type checking
nvflare/private/fed/server/fed_server.py Changed default for require_previous_report to True, moved _job_reported_clients tracking to instance variable with proper cleanup, preventing premature dead-job detection
nvflare/private/fed/client/client_run_manager.py Added explicit validation for JOB_CLIENTS metadata to prevent TypeError when metadata is None or wrong type

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    Start[Job Deployment Start] --> Deploy[_deploy_job: Send to clients]
    Deploy --> CheckDeploy{Check replies}
    CheckDeploy -->|reply=None| Timeout1[Add to failed_clients with 'deployment timeout']
    CheckDeploy -->|reply.rc!=OK| Fail1[Add to failed_clients with error detail]
    CheckDeploy -->|reply.rc=OK| Success1[Mark as deployed OK]
    
    Timeout1 --> Validate1{Validate min_sites/<br/>required_sites}
    Fail1 --> Validate1
    Success1 --> Validate1
    
    Validate1 -->|Failed client in<br/>required_sites| Abort1[Abort job]
    Validate1 -->|Active < min_sites| Abort1
    Validate1 -->|Within tolerance| StartPhase[_start_run: Start job]
    
    StartPhase --> SetMeta1[Set JOB_CLIENTS metadata<br/>before start_client_job]
    SetMeta1 --> StartClients[Call start_client_job]
    StartClients --> CheckStart{check_client_replies<br/>strict mode?}
    
    CheckStart -->|strict=True| StrictPath[Returns timed_out list]
    CheckStart -->|strict=False| NonStrictPath[Returns empty list]
    
    StrictPath --> HasTimeout{timed_out<br/>non-empty?}
    HasTimeout -->|Yes| CheckRequired2{Timed-out client<br/>in required_sites?}
    CheckRequired2 -->|Yes| Abort2[Abort job]
    CheckRequired2 -->|No| CheckMin2{Active >= min_sites?}
    CheckMin2 -->|No| Abort2
    CheckMin2 -->|Yes| Warn[Log warning,<br/>exclude timed-out clients]
    HasTimeout -->|No| UpdateMeta
    
    NonStrictPath --> RebuildFromReplies[Rebuild active_client_sites<br/>from actual replies]
    RebuildFromReplies --> UpdateMeta[Update JOB_CLIENTS metadata<br/>with active clients only]
    Warn --> UpdateMeta
    
    UpdateMeta --> Running[Job Running]
    Running --> Heartbeat[Client heartbeat]
    Heartbeat --> SyncJobs{_sync_client_jobs}
    
    SyncJobs --> CheckPrior{require_previous_report<br/>default=True}
    CheckPrior -->|Job on server<br/>but not client| HasReported{Client reported<br/>this job before?}
    HasReported -->|Yes| DeadJob[Fire dead-job notification]
    HasReported -->|No| SkipNotify[Skip notification<br/>still starting up]
    
    CheckPrior -->|Job on client<br/>but not server| AbortClient[Tell client to abort job]
Loading

Last reviewed commit: 71d200e

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

11 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@chesterxgchen chesterxgchen changed the title [2.7.3] Stabilize hierarchical startup checks for BERT-144 [2.7] Stabilize hierarchical startup checks for BERT-144 Feb 22, 2026
@chesterxgchen chesterxgchen changed the title [2.7] Stabilize hierarchical startup checks for BERT-144 [2.7] Fix hierarchical FL startup failures: deployment timeouts, selective client exclusion, and dead-detection debounce Feb 22, 2026
@chesterxgchen chesterxgchen force-pushed the 2.7.3_hierarchical_startup_fix branch from 1e8b489 to 47bb81a Compare February 22, 2026 01:56
@chesterxgchen
Copy link
Collaborator Author

/build

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

12 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 22, 2026

Additional Comments (1)

nvflare/private/fed/server/job_runner.py
job.meta[JobMetaKey.JOB_CLIENTS] is populated with all clients from client_sites (line 264-265), but after timeout exclusion (line 295), some clients may be removed from the active set. This creates a mismatch: timed-out clients are in JOB_CLIENTS metadata but won't actually start. Downstream code expecting JOB_CLIENTS to reflect active participants may break.

        # After timeout exclusion, rebuild job.meta[JobMetaKey.JOB_CLIENTS] with only active clients
        if timed_out:
            active_count = len(client_sites_names) - len(timed_out)
            if job.min_sites and active_count < job.min_sites:
                raise RuntimeError(
                    f"start job ({job_id}): {len(timed_out)} client(s) timed out and remaining "
                    f"{active_count} < min_sites {job.min_sites}: {timed_out}"
                )
            self.log_warning(
                fl_ctx,
                f"start job ({job_id}): {len(timed_out)} client(s) timed out at start-job: {timed_out}; "
                f"{active_count} of {len(client_sites_names)} clients started successfully.",
            )
            client_sites_names = [c for c in client_sites_names if c not in timed_out]
            # Update JOB_CLIENTS to reflect only active participants
            active_clients = [c.to_dict() for token, c in job_clients.items() 
                            if c.name in client_sites_names]
            job.meta[JobMetaKey.JOB_CLIENTS] = active_clients

chesterxgchen added a commit to chesterxgchen/NVFlare that referenced this pull request Feb 22, 2026
…ent, hierarchical startup stability

Add three major new sections to flare_272.rst covering work merged after the
initial 2.7.2 draft:

Memory Management (restructured):
- Zero Tensor Copy at CJ process via LazyDownloadRef pass-through (PR NVIDIA#4210)
- Client-side memory management: malloc_trim, jemalloc, torch.cuda.empty_cache
  injected after flare.send() without training script changes (PR NVIDIA#4211)
- Retain existing TensorDownloader and server-side cleanup content

F3 Streaming Reliability and Performance (new section):
- HOL stall mitigation: bounded send_frame() timeout, ACK watchdog, stall
  detection/recovery with recommended env-var settings (PR NVIDIA#4206)
- Stream pool starvation fix: blob callbacks dispatched to dedicated thread
  pool, preventing stream worker exhaustion (PR NVIDIA#4171/NVIDIA#4172)
- Streaming download retry with exponential backoff on timeout (PR NVIDIA#4167)
- RxTask self-deadlock fix: stop() deferred until after map_lock released (PR NVIDIA#4204)
- Lock contention reduction in produce_item() for concurrent model downloads (PR NVIDIA#4174)

Hierarchical FL Startup Stability (new section):
- Deployment timeout correctly classified as failure; min_sites check applied
  at deployment phase (PR NVIDIA#4209)
- Startup grace period for dead-client detection (debounce default=true) (PR NVIDIA#4209)
- Selective client exclusion on start-job timeout instead of full abort (PR NVIDIA#4209)
- Hardened job metadata parsing: TypeError replaced with descriptive RuntimeError (PR NVIDIA#4209)
- Recommended config snippets for HPC/Lustre environments (Frontier/ORNL scale)

Bug Fixes section updated with all streaming and hierarchical startup fixes.
Intro paragraph updated to reflect system hardening scope.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@chesterxgchen
Copy link
Collaborator Author

chesterxgchen commented Feb 22, 2026

Addressed the metadata consistency feedback in commit e0af9ac.

Changes made:

  • JobRunner._start_run() now rebuilds job.meta[JobMetaKey.JOB_CLIENTS] after timeout exclusion so metadata reflects active participants only.
  • Added unit tests to cover:
    • metadata update when clients time out (test_start_run_updates_job_clients_meta_after_timeout_exclusion)
    • metadata unchanged when no timeouts (test_start_run_keeps_job_clients_meta_when_no_timeouts)
    • deploy->start filtering path now actually executes _start_run() and verifies only deployable clients are started.

Validation:

  • pytest -q tests/unit_test/private/fed/server/job_runner_test.py
  • pytest -q tests/unit_test/private/fed/server/job_runner_deploy_test.py

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

12 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

chesterxgchen added a commit to chesterxgchen/NVFlare that referenced this pull request Feb 22, 2026
…ease notes

Add a visible source comment above the Hierarchical FL Startup Stability
section noting that its content depends on PR NVIDIA#4209 and should not be
merged before that PR lands on 2.7.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@chesterxgchen
Copy link
Collaborator Author

chesterxgchen commented Feb 22, 2026

Follow-up added in commit df7c06b:

  • Added an integration-style _start_run test that uses the real check_client_replies timeout path (no mock for reply-check behavior) and verifies JOB_CLIENTS is reduced to active clients after timeout exclusion.

Validation on updated test scope:

  • flake8 tests/unit_test/private/fed/server/job_runner_test.py
  • pytest -q tests/unit_test/private/fed/server/job_runner_test.py tests/unit_test/private/fed/server/job_runner_deploy_test.py (16 passed)

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

12 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@chesterxgchen
Copy link
Collaborator Author

/build

1 similar comment
@chesterxgchen
Copy link
Collaborator Author

/build

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

11 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@chesterxgchen
Copy link
Collaborator Author

/build

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

13 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

chesterxgchen and others added 10 commits February 24, 2026 14:47
…ctive exclusion, dead-detection debounce

- _deploy_job(): treat reply=None (timeout) as deployment failure so timed-out
  clients are correctly evaluated against min_sites / required_sites, rather than
  silently counted as successfully deployed
- check_client_replies(): strict mode now returns List[str] of timed-out clients
  instead of raising; explicit errors still raise; non-strict path uses dict-keyed
  lookup instead of fragile positional zip()
- _start_run(): use returned timed-out list to selectively exclude stragglers;
  re-evaluate active count against job.min_sites before aborting
- _sync_client_jobs(): change SYNC_CLIENT_JOBS_REQUIRE_PREVIOUS_REPORT default
  False->True so safe debounced behaviour is active without explicit config;
  move _reported_clients tracking to self._job_reported_clients on FederatedServer
  (out of job_info dict); purge stale entries when jobs leave run_processes
- Add license header to tests/unit_test/private/fed/client/__init__.py
- Tests: rewrite admin_test.py; extend job_runner_test.py and fed_server_test.py;
  add new job_runner_deploy_test.py (29 tests, all positive + negative cases)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Update _start_run to rebuild JOB_CLIENTS from active participants when timed-out clients are excluded, and add unit coverage to verify metadata consistency and deploy-to-start filtering behavior.
Add a start-run integration-style unit test that exercises the real timeout reply-check path and verifies JOB_CLIENTS metadata is reduced to active clients after timeout exclusion.
Drop the temporary hierarchical FL BERT-144 analysis markdown from the PR scope.
Add user-facing documentation for strict_start_job_reply_check and sync_client_jobs_require_previous_report, including defaults, usage guidance, and cross-reference from troubleshooting to the full timeout reference.
- admin.py: use missing_clients list directly in error message (no manual join)
- admin.py: replace fragile zip() positional matching in non-strict path with
  dict-keyed lookup, consistent with strict mode
- admin.py: add isinstance(r.reply.body, str) guard and use startswith() instead
  of `in` for ERROR_MSG_PREFIX check in non-strict path
- fed_server.py: remove redundant `or []` from JOB_IDS header read; the
  isinstance check on the next line already handles None/invalid types

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… docs

- admin.py: align strict-mode ERROR_MSG_PREFIX check to use startswith()
- fed_server.py: add _job_reported_clients_lock for concurrent heartbeat safety
- job_runner.py: add comment on timed-out clients and require_previous_report
- admin_test.py: fix match pattern for list repr; add startswith semantics tests
- fed_server_test.py: remove unused _make_server() helper
- job_runner_deploy_test.py: remove no-op patch.object block
- docs/timeouts.rst: clarify when to enable strict_start_job_reply_check

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- _start_run() now validates required_sites on timeout, consistent with
  _deploy_job(): if a timed-out client is in job.required_sites the job
  aborts even when active_count >= min_sites.
- Remove redundant early metadata assignment (line 265); JOB_CLIENTS is
  now set once after timeout exclusion so it always reflects actual
  active participants.
- Add two unit tests: required-site timeout aborts; non-required-site
  timeout proceeds and metadata is correct.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Restore JOB_CLIENTS metadata before start_client_job so client startup headers include participating clients. Add a unit test that asserts JOB_CLIENTS is present when start_client_job is invoked.
@chesterxgchen chesterxgchen force-pushed the 2.7.3_hierarchical_startup_fix branch from 9f18728 to 1135840 Compare February 24, 2026 22:47
@chesterxgchen
Copy link
Collaborator Author

/build

pcnudde
pcnudde previously approved these changes Feb 24, 2026
@chesterxgchen
Copy link
Collaborator Author

/build

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

13 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Keep JOB_CLIENTS metadata aligned with actual client start replies when strict_start_reply_check is disabled by deriving active participants from non-empty replies. Add a unit test that reproduces non-strict timeout behavior and verifies timed-out clients are excluded from JOB_CLIENTS.
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

13 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@chesterxgchen
Copy link
Collaborator Author

/build

@chesterxgchen chesterxgchen merged commit 572990d into NVIDIA:2.7 Feb 24, 2026
18 of 19 checks passed
@chesterxgchen chesterxgchen deleted the 2.7.3_hierarchical_startup_fix branch February 24, 2026 23:27
chesterxgchen added a commit to chesterxgchen/NVFlare that referenced this pull request Feb 24, 2026
…ctive client exclusion, and dead-detection debounce (NVIDIA#4209)

## Problem

Large-scale hierarchical FL jobs (e.g. BERT NER, 144 clients, 6 relays
on Frontier) abort in
Round 0 due to a cascading startup failure chain.  The root sequence is:

1. F3 streaming HOL stall (PR NVIDIA#4206) delays deployment ACKs from
relay-connected clients
2. **`_deploy_job()`** treats `reply=None` (timeout) as `"unknown"` —
not a failure — so
   timed-out clients silently appear to have been deployed
3. **`_start_run()`** tries to start those clients; they again time out,
and
   `check_client_replies()` ignores the `None` reply
4. **`_sync_client_jobs()`** fires dead-job notification on the very
first heartbeat with
   no startup grace period
5. FedAvg requires 144/144 — one or two missing clients → abort
6. A late-starting CJ crashes with `TypeError: 'NoneType' object is not
iterable` when
`get_job_clients()` receives `None` metadata from an already-aborted job

PRs NVIDIA#4206, NVIDIA#4204, NVIDIA#4174, NVIDIA#4172, NVIDIA#4186, NVIDIA#4211, NVIDIA#4210 (all merged in
2.7.2) address the
transport layer.  This PR addresses the remaining job lifecycle layer.

---

## Fixes Included

### 1 — `_deploy_job()`: Treat deployment timeout as failure
(`job_runner.py`)

**Root bug**: `reply=None` was logged as `"unknown"` and excluded from
`failed_clients`,
so timed-out clients counted as "successfully deployed" for the
`min_sites` check.

**Fix**: Add timed-out clients to `failed_clients` with a `"deployment
timeout"` label.
The existing `min_sites` / `required_sites` logic then correctly decides
whether to abort.

### 2 — `check_client_replies()`: Return timed-out clients instead of
raising (`admin.py`)

**Root bug**: In strict mode, any timeout raised immediately, aborting
the whole job even
when the remaining active clients satisfied `min_sites`.

**Fix**: In strict mode, collect timed-out clients into a return list
rather than raising.
Explicit errors (non-OK return code or error body) still raise. Also
fixes the non-strict
mode to use name-keyed dict lookup instead of fragile positional
`zip()`.

New signature: `check_client_replies(...) -> List[str]` (timed-out
client names; empty = none).

### 3 — `_start_run()`: Selective exclusion with min_sites re-evaluation
(`job_runner.py`)

**Root bug**: A start-job timeout under strict mode aborted the entire
job with no
tolerance for stragglers within `min_sites` bounds.

**Fix**: Use the returned timed-out list from `check_client_replies()`.
If remaining
active clients >= `min_sites`, log a warning and proceed. Only abort
when below tolerance.

### 4 — `_sync_client_jobs()`: Require-prior-report default changed to
`True` (`fed_server.py`)

**Root bug**: `SYNC_CLIENT_JOBS_REQUIRE_PREVIOUS_REPORT` defaulted to
`False`, meaning
the bug fix was opt-in and the unsafe behaviour remained the default.

**Fix**: Default changed to `True`. Operators who want the aggressive
legacy detection
can set it to `False` explicitly.

### 5 — `_sync_client_jobs()`: Move `_reported_clients` out of
`job_info` dict (`fed_server.py`)

**Root bug**: Positive-observation tracking was stored as
`job_info["_reported_clients"]`,
injecting algorithm state into a data dict with no corresponding
`RunProcessKey` constant.

**Fix**: Tracking moved to `self._job_reported_clients: Dict[str, set]`
on `FederatedServer`.
Stale entries are purged whenever a job is no longer in `run_processes`.

### 6 — `ClientRunManager.get_job_clients()`: Explicit meta validation
(`client_run_manager.py`)

Raises `RuntimeError` with a descriptive message instead of an opaque
`TypeError` when
`JOB_CLIENTS` is absent or the wrong type.

---

## Configuration Recommendations (No Code Change Needed)

| Setting | Recommended value | Effect |
|---|---|---|
| `FedAvg(min_clients=...)` | 96-98% of `num_clients` | Tolerates a few
startup stragglers |
| `runner_sync_timeout` | `120` s | Allows Lustre-backed deployments
time to complete |
| `strict_start_job_reply_check` | `true` | Start-job timeouts surfaced,
straggler clients excluded |
| `sync_client_jobs_require_previous_report` | `true` (now the default)
| Prevents premature dead-job from startup delay |
| `SFM_CLOSE_STALLED_CONNECTION` (PR NVIDIA#4206) | `true` after staging |
Disconnects stalled relay connections |

---

## Files Changed

- `nvflare/private/fed/server/job_runner.py` — `_deploy_job()` timeout
as failure; `_start_run()` selective exclusion
- `nvflare/private/fed/server/admin.py` — `check_client_replies()`
returns timed-out list; dict-keyed non-strict path
- `nvflare/private/fed/server/fed_server.py` — `_sync_client_jobs()`
default `True`; `_job_reported_clients` attr; stale cleanup
- `nvflare/private/fed/client/client_run_manager.py` — explicit meta
validation in `get_job_clients()`

---

## Test Coverage

New and updated unit tests with both positive and negative cases:

| File | Tests | What they cover |
|---|---|---|
| `admin_test.py` | 8 | Timeout returned not raised; dict lookup; error
still raises; reorder OK |
| `job_runner_test.py` | 4 | strict flag wiring; timeout within
tolerance → warn; timeout below tolerance → raise |
| `job_runner_deploy_test.py` | 9 (new file) | Timeout counted as
failure; OK reply not failed; mixed outcomes; detail label; min_sites
with timeouts; integration sequence |
| `fed_server_test.py` | 5 | Default requires-prior-report; legacy
explicit-False still fires; tracking in server attr not job_info; stale
cleanup |

All 29 targeted unit tests pass.

## Test Plan

- [x] Unit tests for each changed function (positive + negative)
- [x] New `job_runner_deploy_test.py` covering deployment timeout
classification end-to-end
- [x] All 29 targeted unit tests pass
- [ ] Hierarchical staging run with all flags at default
- [ ] Hierarchical staging run with `strict_start_job_reply_check=true`
and reduced `min_clients`
- [ ] Verify no regression on standard (non-hierarchical) FL jobs

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
chesterxgchen added a commit to chesterxgchen/NVFlare that referenced this pull request Feb 24, 2026
…ent, hierarchical startup stability

Add three major new sections to flare_272.rst covering work merged after the
initial 2.7.2 draft:

Memory Management (restructured):
- Zero Tensor Copy at CJ process via LazyDownloadRef pass-through (PR NVIDIA#4210)
- Client-side memory management: malloc_trim, jemalloc, torch.cuda.empty_cache
  injected after flare.send() without training script changes (PR NVIDIA#4211)
- Retain existing TensorDownloader and server-side cleanup content

F3 Streaming Reliability and Performance (new section):
- HOL stall mitigation: bounded send_frame() timeout, ACK watchdog, stall
  detection/recovery with recommended env-var settings (PR NVIDIA#4206)
- Stream pool starvation fix: blob callbacks dispatched to dedicated thread
  pool, preventing stream worker exhaustion (PR NVIDIA#4171/NVIDIA#4172)
- Streaming download retry with exponential backoff on timeout (PR NVIDIA#4167)
- RxTask self-deadlock fix: stop() deferred until after map_lock released (PR NVIDIA#4204)
- Lock contention reduction in produce_item() for concurrent model downloads (PR NVIDIA#4174)

Hierarchical FL Startup Stability (new section):
- Deployment timeout correctly classified as failure; min_sites check applied
  at deployment phase (PR NVIDIA#4209)
- Startup grace period for dead-client detection (debounce default=true) (PR NVIDIA#4209)
- Selective client exclusion on start-job timeout instead of full abort (PR NVIDIA#4209)
- Hardened job metadata parsing: TypeError replaced with descriptive RuntimeError (PR NVIDIA#4209)
- Recommended config snippets for HPC/Lustre environments (Frontier/ORNL scale)

Bug Fixes section updated with all streaming and hierarchical startup fixes.
Intro paragraph updated to reflect system hardening scope.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
chesterxgchen added a commit to chesterxgchen/NVFlare that referenced this pull request Feb 24, 2026
…ease notes

Add a visible source comment above the Hierarchical FL Startup Stability
section noting that its content depends on PR NVIDIA#4209 and should not be
merged before that PR lands on 2.7.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
chesterxgchen added a commit to chesterxgchen/NVFlare that referenced this pull request Feb 25, 2026
…ent, hierarchical startup stability

Add three major new sections to flare_272.rst covering work merged after the
initial 2.7.2 draft:

Memory Management (restructured):
- Zero Tensor Copy at CJ process via LazyDownloadRef pass-through (PR NVIDIA#4210)
- Client-side memory management: malloc_trim, jemalloc, torch.cuda.empty_cache
  injected after flare.send() without training script changes (PR NVIDIA#4211)
- Retain existing TensorDownloader and server-side cleanup content

F3 Streaming Reliability and Performance (new section):
- HOL stall mitigation: bounded send_frame() timeout, ACK watchdog, stall
  detection/recovery with recommended env-var settings (PR NVIDIA#4206)
- Stream pool starvation fix: blob callbacks dispatched to dedicated thread
  pool, preventing stream worker exhaustion (PR NVIDIA#4171/NVIDIA#4172)
- Streaming download retry with exponential backoff on timeout (PR NVIDIA#4167)
- RxTask self-deadlock fix: stop() deferred until after map_lock released (PR NVIDIA#4204)
- Lock contention reduction in produce_item() for concurrent model downloads (PR NVIDIA#4174)

Hierarchical FL Startup Stability (new section):
- Deployment timeout correctly classified as failure; min_sites check applied
  at deployment phase (PR NVIDIA#4209)
- Startup grace period for dead-client detection (debounce default=true) (PR NVIDIA#4209)
- Selective client exclusion on start-job timeout instead of full abort (PR NVIDIA#4209)
- Hardened job metadata parsing: TypeError replaced with descriptive RuntimeError (PR NVIDIA#4209)
- Recommended config snippets for HPC/Lustre environments (Frontier/ORNL scale)

Bug Fixes section updated with all streaming and hierarchical startup fixes.
Intro paragraph updated to reflect system hardening scope.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
chesterxgchen added a commit to chesterxgchen/NVFlare that referenced this pull request Feb 25, 2026
…ease notes

Add a visible source comment above the Hierarchical FL Startup Stability
section noting that its content depends on PR NVIDIA#4209 and should not be
merged before that PR lands on 2.7.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
chesterxgchen added a commit that referenced this pull request Feb 25, 2026
…ent, hierarchical startup stability [skip ci] (#4218)

## Merge Dependency

> ⚠️ **Depends on #4209** — The *Hierarchical FL Startup Stability*
section documents changes introduced by PR #4209 (currently open).
**Merge PR #4209 into `2.7` before merging this PR.** All other sections
cover already-merged PRs.

---

## Summary

This PR updates `docs/release_notes/flare_272.rst` to reflect all major
changes merged into the 2.7.x line after the initial 2.7.2 draft,
covering three new areas:

- **Memory Management** — restructured and expanded with Zero Tensor
Copy at CJ (PR #4210) and client-side memory lifecycle management (PR
#4211)
- **F3 Streaming Reliability and Performance** — new section covering
HOL stall mitigation (PR #4206), stream pool starvation fix (PR
#4171/#4172), streaming download retry (PR #4167), RxTask self-deadlock
fix (PR #4204), and lock contention reduction (PR #4174)
- **Hierarchical FL Startup Stability** — new section covering
deployment timeout classification, startup grace period, selective
client exclusion, and metadata hardening (PR #4209 — pending merge),
with recommended config snippets for HPC/Lustre environments

The Bug Fixes section and intro paragraph are also updated accordingly.

A source-level RST comment has been added above the Hierarchical FL
section in the file to alert future maintainers to the merge dependency.

## Merged PRs Documented

| PR | Area | Status |
|---|---|---|
| #4171 / #4172 | Stream pool starvation fix | Merged |
| #4174 | Lock contention reduction | Merged |
| #4167 | Streaming download retry | Merged |
| #4204 | RxTask self-deadlock fix | Merged |
| #4206 | HOL stall mitigation | Merged |
| #4210 | Zero tensor copy at CJ | Merged |
| #4211 | Client-side memory management | Merged |
| #4209 | Hierarchical FL startup stability | **Open — merge before this
PR** |

## Changes

### Memory Management (restructured)

- **Zero Tensor Copy at CJ** (`ClientAPILauncherExecutor`): CJ now holds
`LazyDownloadRef` placeholders instead of materializing full tensors,
eliminating the CJ as a memory bottleneck for LLM-scale models.
- **Client-Side Memory Management**: `gc.collect()` + `malloc_trim(0)` /
jemalloc purge / `torch.cuda.empty_cache()` injected after every
`flare.send()`, configurable via `client_memory_gc_rounds`.
- Existing TensorDownloader and server-side cleanup content retained.

### F3 Streaming Reliability and Performance (new section)

- **HOL Stall Mitigation**: Bounded `send_frame()` timeout, ACK-progress
watchdog, and stall detection/recovery. Includes recommended environment
variable settings for large hierarchical deployments.
- **Stream Pool Starvation Fix**: Blob callbacks dispatched to a
dedicated `callback_thread_pool`, keeping stream workers free for
concurrent downloads.
- **Streaming Download Retry**: Exponential-backoff retry (up to 3
attempts, capped at 60 s) on `TIMEOUT` errors; abort-signal aware.
- **RxTask Self-Deadlock Fix**: `stop()` deferred until after `map_lock`
released, eliminating stream-error-triggered deadlock.
- **Lock Contention Reduction**: `produce_item()` runs outside
`self.lock`; compare-and-store for cache write. Reduces model-download
latency under high client concurrency.

### Hierarchical FL Startup Stability (new section — pending PR #4209)

- **Deployment Timeout as Failure**: `reply=None` correctly counted
against `min_sites`; timed-out clients excluded before
`start_client_job`.
- **Startup Grace Period**: Dead-client detection debounced — client
must be observed once before absence triggers dead-job notification.
Default changed to `True`.
- **Selective Client Exclusion**: Stragglers at start-job excluded
rather than causing full abort, if remaining count ≥ `min_clients`.
- **Metadata Hardening**: `TypeError` on absent job metadata replaced
with descriptive `RuntimeError`.
- Recommended `config_fed_server.json` / `config_fed_client.json`
snippets for HPC (Frontier/ORNL) scale.

## Test plan

- [ ] Sphinx build (`make html`) passes without RST warnings on the
updated file
- [ ] All new cross-references (`.. code-block::`, `.. note::`) render
correctly in the docs build
- [ ] Verify section hierarchy (underline characters) is consistent
throughout the file
- [ ] Confirm PR #4209 is merged before this PR is merged

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
chesterxgchen added a commit that referenced this pull request Mar 12, 2026
…meouts, selective client exclusion, and dead-detection debounce (#4209) (#4288)

## Problem

Large-scale hierarchical FL jobs (e.g. BERT NER, 144 clients, 6 relays
on Frontier) abort in
Round 0 due to a cascading startup failure chain.  The root sequence is:

1. F3 streaming HOL stall (PR #4206) delays deployment ACKs from
relay-connected clients
2. **`_deploy_job()`** treats `reply=None` (timeout) as `"unknown"` —
not a failure — so
   timed-out clients silently appear to have been deployed
3. **`_start_run()`** tries to start those clients; they again time out,
and
   `check_client_replies()` ignores the `None` reply
4. **`_sync_client_jobs()`** fires dead-job notification on the very
first heartbeat with
   no startup grace period
5. FedAvg requires 144/144 — one or two missing clients → abort
6. A late-starting CJ crashes with `TypeError: 'NoneType' object is not
iterable` when
`get_job_clients()` receives `None` metadata from an already-aborted job

PRs #4206, #4204, #4174, #4172, #4186, #4211, #4210 (all merged in
2.7.2) address the
transport layer.  This PR addresses the remaining job lifecycle layer.

---

## Fixes Included

### 1 — `_deploy_job()`: Treat deployment timeout as failure
(`job_runner.py`)

**Root bug**: `reply=None` was logged as `"unknown"` and excluded from
`failed_clients`,
so timed-out clients counted as "successfully deployed" for the
`min_sites` check.

**Fix**: Add timed-out clients to `failed_clients` with a `"deployment
timeout"` label.
The existing `min_sites` / `required_sites` logic then correctly decides
whether to abort.

### 2 — `check_client_replies()`: Return timed-out clients instead of
raising (`admin.py`)

**Root bug**: In strict mode, any timeout raised immediately, aborting
the whole job even
when the remaining active clients satisfied `min_sites`.

**Fix**: In strict mode, collect timed-out clients into a return list
rather than raising.
Explicit errors (non-OK return code or error body) still raise. Also
fixes the non-strict
mode to use name-keyed dict lookup instead of fragile positional
`zip()`.

New signature: `check_client_replies(...) -> List[str]` (timed-out
client names; empty = none).

### 3 — `_start_run()`: Selective exclusion with min_sites re-evaluation
(`job_runner.py`)

**Root bug**: A start-job timeout under strict mode aborted the entire
job with no
tolerance for stragglers within `min_sites` bounds.

**Fix**: Use the returned timed-out list from `check_client_replies()`.
If remaining
active clients >= `min_sites`, log a warning and proceed. Only abort
when below tolerance.

### 4 — `_sync_client_jobs()`: Require-prior-report default changed to
`True` (`fed_server.py`)

**Root bug**: `SYNC_CLIENT_JOBS_REQUIRE_PREVIOUS_REPORT` defaulted to
`False`, meaning
the bug fix was opt-in and the unsafe behaviour remained the default.

**Fix**: Default changed to `True`. Operators who want the aggressive
legacy detection
can set it to `False` explicitly.

### 5 — `_sync_client_jobs()`: Move `_reported_clients` out of
`job_info` dict (`fed_server.py`)

**Root bug**: Positive-observation tracking was stored as
`job_info["_reported_clients"]`,
injecting algorithm state into a data dict with no corresponding
`RunProcessKey` constant.

**Fix**: Tracking moved to `self._job_reported_clients: Dict[str, set]`
on `FederatedServer`.
Stale entries are purged whenever a job is no longer in `run_processes`.

### 6 — `ClientRunManager.get_job_clients()`: Explicit meta validation
(`client_run_manager.py`)

Raises `RuntimeError` with a descriptive message instead of an opaque
`TypeError` when
`JOB_CLIENTS` is absent or the wrong type.

---

## Configuration Recommendations (No Code Change Needed)

| Setting | Recommended value | Effect |
|---|---|---|
| `FedAvg(min_clients=...)` | 96-98% of `num_clients` | Tolerates a few
startup stragglers |
| `runner_sync_timeout` | `120` s | Allows Lustre-backed deployments
time to complete |
| `strict_start_job_reply_check` | `true` | Start-job timeouts surfaced,
straggler clients excluded |
| `sync_client_jobs_require_previous_report` | `true` (now the default)
| Prevents premature dead-job from startup delay | |
`SFM_CLOSE_STALLED_CONNECTION` (PR #4206) | `true` after staging |
Disconnects stalled relay connections |

---

## Files Changed

- `nvflare/private/fed/server/job_runner.py` — `_deploy_job()` timeout
as failure; `_start_run()` selective exclusion
- `nvflare/private/fed/server/admin.py` — `check_client_replies()`
returns timed-out list; dict-keyed non-strict path
- `nvflare/private/fed/server/fed_server.py` — `_sync_client_jobs()`
default `True`; `_job_reported_clients` attr; stale cleanup
- `nvflare/private/fed/client/client_run_manager.py` — explicit meta
validation in `get_job_clients()`

---

## Test Coverage

New and updated unit tests with both positive and negative cases:

| File | Tests | What they cover |
|---|---|---|
| `admin_test.py` | 8 | Timeout returned not raised; dict lookup; error
still raises; reorder OK |
| `job_runner_test.py` | 4 | strict flag wiring; timeout within
tolerance → warn; timeout below tolerance → raise | |
`job_runner_deploy_test.py` | 9 (new file) | Timeout counted as failure;
OK reply not failed; mixed outcomes; detail label; min_sites with
timeouts; integration sequence |
| `fed_server_test.py` | 5 | Default requires-prior-report; legacy
explicit-False still fires; tracking in server attr not job_info; stale
cleanup |

All 29 targeted unit tests pass.

## Test Plan

- [x] Unit tests for each changed function (positive + negative)
- [x] New `job_runner_deploy_test.py` covering deployment timeout
classification end-to-end
- [x] All 29 targeted unit tests pass
- [ ] Hierarchical staging run with all flags at default
- [ ] Hierarchical staging run with `strict_start_job_reply_check=true`
and reduced `min_clients`
- [ ] Verify no regression on standard (non-hierarchical) FL jobs

---------

Fixes # .

### Description

A few sentences describing the changes proposed in this pull request.

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Peter Cnudde <pcnudde@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants