Skip to content

[2.7] Add end-to-end download starvation test for stream pool fix#4172

Merged
YuanTingHsieh merged 7 commits intoNVIDIA:2.7from
chesterxgchen:threadpool_starving_test
Feb 11, 2026
Merged

[2.7] Add end-to-end download starvation test for stream pool fix#4172
YuanTingHsieh merged 7 commits intoNVIDIA:2.7from
chesterxgchen:threadpool_starving_test

Conversation

@chesterxgchen
Copy link
Collaborator

Summary

Adds an end-to-end integration test that validates the P0 stream thread pool starvation fix introduced in #4171. Unlike the existing unit test (pool_starvation_test.py) which uses mocks, this test exercises the full download_object() code path through real TCP-connected Cell objects and blob streaming.

What the test does

Creates two real Cell objects connected via TCP (server + client) and runs 8 concurrent download_object() calls, each producing ~200 chunks (256 bytes each, 50KB total) through the complete flow:

download_object()
  -> Cell.send_request()
    -> Cell.__getattr__() routes to _send_request() (stream channel)
      -> _send_one_request()
        -> send_blob() [request]
          -> BlobStreamer -> ByteStreamer -> TxTask._transmit_task (stream_thread_pool)
          -> ByteReceiver._callback_wrapper (stream_thread_pool)
            -> BlobHandler.handle_blob_cb
              -> _read_stream (stream_thread_pool)
              -> _run_blob_cb (callback_thread_pool) <-- THE FIX
                -> Adapter.call()
                  -> future.result() [waits for _read_stream]
                  -> DownloadService._handle_download()
                    -> Downloadable.produce()
                  -> send_blob() [reply]
        -> _process_reply()
          -> Consumer.consume()

Two test variants

Test class Behavior Expected result
TestDownloadWithFix Uses real production code: blob_cb dispatched to separate callback_thread_pool All 8/8 downloads succeed
TestDownloadPreFixStarvation Patches BlobHandler.handle_blob_cb to call blob_cb synchronously on pool worker (pre-fix behavior) + wraps _read_stream with 2s delay (simulating large blob transfer) + uses 4-worker shared pool 0/8 succeed -- deadlock confirmed

How the pre-fix starvation is reproduced

The pre-fix test creates cells after patching so the registered callbacks capture the patched behavior:

  1. BlobHandler.handle_blob_cb is monkeypatched to call blob_cb synchronously (instead of submitting to callback_thread_pool)
  2. _read_stream is wrapped with a 2s time.sleep() to simulate the time a real multi-MB model tensor takes to transfer via many transport chunks
  3. All streaming modules (blob_streamer, byte_receiver, byte_streamer) use a tiny 4-worker shared pool

Deadlock mechanism:

  • ByteReceiver._callback_wrapper runs on pool worker, calls handle_blob_cb
  • handle_blob_cb submits slow _read_stream to pool, then calls blob_cb synchronously
  • blob_cb (Adapter.call) blocks on future.result(), waiting for _read_stream to complete
  • With 4+ concurrent requests, all 4 workers are stuck in blob_cb, _read_stream tasks queued but cannot execute -- permanent deadlock

Test results

TestDownloadWithFix::test_parallel_downloads_with_fix
  [WITH FIX] 8 ok, 0 failed, 0 timed out / 8    PASSED

TestDownloadPreFixStarvation::test_parallel_downloads_simulating_pre_fix_starvation
  [PRE-FIX SIM] 0 ok, 8 failed, 0 timed out / 8
  [PRE-FIX SIM] Confirmed starvation: only 0/8 succeeded    PASSED

Test plan

  • pytest tests/unit_test/fuel/f3/streaming/download_starvation_test.py -v -s -- both tests pass
  • Test 1 confirms fix works: 8/8 parallel downloads succeed
  • Test 2 confirms pre-fix bug: 0/8 succeed due to deadlock
  • No changes to production code -- test-only PR
  • CI premerge passes

@chesterxgchen
Copy link
Collaborator Author

/build

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 11, 2026

Greptile Overview

Greptile Summary

Added comprehensive end-to-end integration test that validates the P0 stream thread pool starvation fix from #4171. The test uses real TCP-connected Cell objects and full blob streaming to exercise the complete download_object() code path, including two test variants:

  • TestDownloadWithFix: Validates production code where blob callbacks are dispatched to a separate callback_thread_pool (expects 8/8 downloads to succeed)
  • TestDownloadPreFixStarvation: Simulates pre-fix behavior by monkeypatching BlobHandler.handle_blob_cb to call blob_cb synchronously with a 0.2s delay on a 4-worker shared pool (expects 0/8 to succeed, confirming deadlock)

The test addresses several issues identified in prior review rounds, including:

  • Fixed test data generation to produce exactly TOTAL_SIZE bytes using slice operation (line 101)
  • Tightened pre-fix assertion to expect exactly 0 successes (line 329) rather than just < NUM_PARALLEL
  • Improved cleanup with proper tiny_pool.shutdown() call and restoration of globals in a finally block
  • Added try/except guard around CPython-specific _threads_queues cleanup (lines 289-296)

The test verifies that the production fix successfully prevents deadlock while confirming the pre-fix behavior would have caused complete starvation.

Confidence Score: 4/5

  • Safe to merge with minor concerns around CI robustness and test execution time
  • Test-only PR that adds comprehensive validation for a critical bug fix. Multiple rounds of PR review have addressed most significant issues (data generation, cleanup, assertions). Remaining concerns are around CI execution characteristics (TCP networking, timing dependencies, long timeouts) rather than correctness issues. The test properly validates both the fix and the pre-fix starvation scenario through careful monkeypatching and cleanup.
  • No files require special attention - this is a single well-reviewed test file with previous feedback already incorporated

Important Files Changed

Filename Overview
tests/unit_test/fuel/f3/streaming/download_starvation_test.py New end-to-end integration test validating stream pool starvation fix via real Cell objects and blob streaming; addresses multiple previously identified issues including test data generation, assertion strictness, and cleanup robustness

greptile-apps[bot]

This comment was marked as outdated.

@chesterxgchen
Copy link
Collaborator Author

/build

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

…IDIA#4171)

Add an integration-level test that exercises the full download_object()
flow through real Cell objects and blob streaming to validate the P0
thread pool starvation fix introduced in NVIDIA#4171.

The test creates real TCP-connected Cell pairs and runs 8 concurrent
download_object() calls, each producing ~200 chunks through the full
path: Cell.send_request() -> send_blob() -> BlobHandler ->
stream_thread_pool / callback_thread_pool -> Adapter.call() ->
DownloadService._handle_download() -> Downloadable.produce() ->
Consumer.consume().

Two variants:
- TestDownloadWithFix: verifies all 8 parallel downloads succeed with
  the fix (callback_thread_pool separates blob_cb from stream workers).
- TestDownloadPreFixStarvation: patches BlobHandler.handle_blob_cb to
  reproduce the pre-fix synchronous blob_cb behavior with a slow
  _read_stream (simulating large blob transfer), confirms deadlock
  when all pool workers block on future.result().
- Remove "NOT intended for git" statement from docstring
- Add pytest.mark.timeout(60/120) to both test classes for CI safety
- Replace tiny_pool.stopped=True with tiny_pool.shutdown(wait=False)
  to properly terminate worker threads
- Wrap fixture teardown in try/finally to ensure globals are restored
  even if the test fails
- Remove "NOT intended for git" from docstring
- Reduce _read_stream delay from 2s to 0.2s to minimize test runtime
- Add pytest.mark.timeout for CI safety
- Use cancellable Event-based delay for clean teardown
- Wrap fixture teardowns in try/finally
- Properly shutdown tiny pool and remove from atexit tracking
- Total test time: ~10s with clean process exit
- Increase cell connection wait from 1s to 2s (CELL_CONNECT_TIMEOUT) for
  slow CI runner robustness
- Wrap _threads_queues atexit cleanup in try/except for portability
  across non-CPython implementations and future Python versions
- Remove redundant self.data in ChunkedDownloadable, use base_obj from parent
- Remove no-op 'if resume: pass' in _pre_fix_handle_blob_cb
- Add comment explaining daemon=True on download threads
@chesterxgchen chesterxgchen force-pushed the threadpool_starving_test branch from 3df46fe to c54f036 Compare February 11, 2026 04:46
@chesterxgchen
Copy link
Collaborator Author

/build

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

- Assert succeeded == 0 (deterministic deadlock) instead of < NUM_PARALLEL
- Generate exactly TOTAL_SIZE bytes in _make_test_data() via slice
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

@chesterxgchen
Copy link
Collaborator Author

/build

Copy link
Collaborator

@IsaacYangSLA IsaacYangSLA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding more tests is always good. Thanks.

@IsaacYangSLA
Copy link
Collaborator

I clicked the 'resolve' on outdated comments.

@YuanTingHsieh YuanTingHsieh merged commit 22af24d into NVIDIA:2.7 Feb 11, 2026
19 checks passed
chesterxgchen added a commit to chesterxgchen/NVFlare that referenced this pull request Feb 22, 2026
…ent, hierarchical startup stability

Add three major new sections to flare_272.rst covering work merged after the
initial 2.7.2 draft:

Memory Management (restructured):
- Zero Tensor Copy at CJ process via LazyDownloadRef pass-through (PR NVIDIA#4210)
- Client-side memory management: malloc_trim, jemalloc, torch.cuda.empty_cache
  injected after flare.send() without training script changes (PR NVIDIA#4211)
- Retain existing TensorDownloader and server-side cleanup content

F3 Streaming Reliability and Performance (new section):
- HOL stall mitigation: bounded send_frame() timeout, ACK watchdog, stall
  detection/recovery with recommended env-var settings (PR NVIDIA#4206)
- Stream pool starvation fix: blob callbacks dispatched to dedicated thread
  pool, preventing stream worker exhaustion (PR NVIDIA#4171/NVIDIA#4172)
- Streaming download retry with exponential backoff on timeout (PR NVIDIA#4167)
- RxTask self-deadlock fix: stop() deferred until after map_lock released (PR NVIDIA#4204)
- Lock contention reduction in produce_item() for concurrent model downloads (PR NVIDIA#4174)

Hierarchical FL Startup Stability (new section):
- Deployment timeout correctly classified as failure; min_sites check applied
  at deployment phase (PR NVIDIA#4209)
- Startup grace period for dead-client detection (debounce default=true) (PR NVIDIA#4209)
- Selective client exclusion on start-job timeout instead of full abort (PR NVIDIA#4209)
- Hardened job metadata parsing: TypeError replaced with descriptive RuntimeError (PR NVIDIA#4209)
- Recommended config snippets for HPC/Lustre environments (Frontier/ORNL scale)

Bug Fixes section updated with all streaming and hierarchical startup fixes.
Intro paragraph updated to reflect system hardening scope.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
chesterxgchen added a commit that referenced this pull request Feb 24, 2026
…ctive client exclusion, and dead-detection debounce (#4209)

## Problem

Large-scale hierarchical FL jobs (e.g. BERT NER, 144 clients, 6 relays
on Frontier) abort in
Round 0 due to a cascading startup failure chain.  The root sequence is:

1. F3 streaming HOL stall (PR #4206) delays deployment ACKs from
relay-connected clients
2. **`_deploy_job()`** treats `reply=None` (timeout) as `"unknown"` —
not a failure — so
   timed-out clients silently appear to have been deployed
3. **`_start_run()`** tries to start those clients; they again time out,
and
   `check_client_replies()` ignores the `None` reply
4. **`_sync_client_jobs()`** fires dead-job notification on the very
first heartbeat with
   no startup grace period
5. FedAvg requires 144/144 — one or two missing clients → abort
6. A late-starting CJ crashes with `TypeError: 'NoneType' object is not
iterable` when
`get_job_clients()` receives `None` metadata from an already-aborted job

PRs #4206, #4204, #4174, #4172, #4186, #4211, #4210 (all merged in
2.7.2) address the
transport layer.  This PR addresses the remaining job lifecycle layer.

---

## Fixes Included

### 1 — `_deploy_job()`: Treat deployment timeout as failure
(`job_runner.py`)

**Root bug**: `reply=None` was logged as `"unknown"` and excluded from
`failed_clients`,
so timed-out clients counted as "successfully deployed" for the
`min_sites` check.

**Fix**: Add timed-out clients to `failed_clients` with a `"deployment
timeout"` label.
The existing `min_sites` / `required_sites` logic then correctly decides
whether to abort.

### 2 — `check_client_replies()`: Return timed-out clients instead of
raising (`admin.py`)

**Root bug**: In strict mode, any timeout raised immediately, aborting
the whole job even
when the remaining active clients satisfied `min_sites`.

**Fix**: In strict mode, collect timed-out clients into a return list
rather than raising.
Explicit errors (non-OK return code or error body) still raise. Also
fixes the non-strict
mode to use name-keyed dict lookup instead of fragile positional
`zip()`.

New signature: `check_client_replies(...) -> List[str]` (timed-out
client names; empty = none).

### 3 — `_start_run()`: Selective exclusion with min_sites re-evaluation
(`job_runner.py`)

**Root bug**: A start-job timeout under strict mode aborted the entire
job with no
tolerance for stragglers within `min_sites` bounds.

**Fix**: Use the returned timed-out list from `check_client_replies()`.
If remaining
active clients >= `min_sites`, log a warning and proceed. Only abort
when below tolerance.

### 4 — `_sync_client_jobs()`: Require-prior-report default changed to
`True` (`fed_server.py`)

**Root bug**: `SYNC_CLIENT_JOBS_REQUIRE_PREVIOUS_REPORT` defaulted to
`False`, meaning
the bug fix was opt-in and the unsafe behaviour remained the default.

**Fix**: Default changed to `True`. Operators who want the aggressive
legacy detection
can set it to `False` explicitly.

### 5 — `_sync_client_jobs()`: Move `_reported_clients` out of
`job_info` dict (`fed_server.py`)

**Root bug**: Positive-observation tracking was stored as
`job_info["_reported_clients"]`,
injecting algorithm state into a data dict with no corresponding
`RunProcessKey` constant.

**Fix**: Tracking moved to `self._job_reported_clients: Dict[str, set]`
on `FederatedServer`.
Stale entries are purged whenever a job is no longer in `run_processes`.

### 6 — `ClientRunManager.get_job_clients()`: Explicit meta validation
(`client_run_manager.py`)

Raises `RuntimeError` with a descriptive message instead of an opaque
`TypeError` when
`JOB_CLIENTS` is absent or the wrong type.

---

## Configuration Recommendations (No Code Change Needed)

| Setting | Recommended value | Effect |
|---|---|---|
| `FedAvg(min_clients=...)` | 96-98% of `num_clients` | Tolerates a few
startup stragglers |
| `runner_sync_timeout` | `120` s | Allows Lustre-backed deployments
time to complete |
| `strict_start_job_reply_check` | `true` | Start-job timeouts surfaced,
straggler clients excluded |
| `sync_client_jobs_require_previous_report` | `true` (now the default)
| Prevents premature dead-job from startup delay |
| `SFM_CLOSE_STALLED_CONNECTION` (PR #4206) | `true` after staging |
Disconnects stalled relay connections |

---

## Files Changed

- `nvflare/private/fed/server/job_runner.py` — `_deploy_job()` timeout
as failure; `_start_run()` selective exclusion
- `nvflare/private/fed/server/admin.py` — `check_client_replies()`
returns timed-out list; dict-keyed non-strict path
- `nvflare/private/fed/server/fed_server.py` — `_sync_client_jobs()`
default `True`; `_job_reported_clients` attr; stale cleanup
- `nvflare/private/fed/client/client_run_manager.py` — explicit meta
validation in `get_job_clients()`

---

## Test Coverage

New and updated unit tests with both positive and negative cases:

| File | Tests | What they cover |
|---|---|---|
| `admin_test.py` | 8 | Timeout returned not raised; dict lookup; error
still raises; reorder OK |
| `job_runner_test.py` | 4 | strict flag wiring; timeout within
tolerance → warn; timeout below tolerance → raise |
| `job_runner_deploy_test.py` | 9 (new file) | Timeout counted as
failure; OK reply not failed; mixed outcomes; detail label; min_sites
with timeouts; integration sequence |
| `fed_server_test.py` | 5 | Default requires-prior-report; legacy
explicit-False still fires; tracking in server attr not job_info; stale
cleanup |

All 29 targeted unit tests pass.

## Test Plan

- [x] Unit tests for each changed function (positive + negative)
- [x] New `job_runner_deploy_test.py` covering deployment timeout
classification end-to-end
- [x] All 29 targeted unit tests pass
- [ ] Hierarchical staging run with all flags at default
- [ ] Hierarchical staging run with `strict_start_job_reply_check=true`
and reduced `min_clients`
- [ ] Verify no regression on standard (non-hierarchical) FL jobs

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
chesterxgchen added a commit to chesterxgchen/NVFlare that referenced this pull request Feb 24, 2026
…ctive client exclusion, and dead-detection debounce (NVIDIA#4209)

## Problem

Large-scale hierarchical FL jobs (e.g. BERT NER, 144 clients, 6 relays
on Frontier) abort in
Round 0 due to a cascading startup failure chain.  The root sequence is:

1. F3 streaming HOL stall (PR NVIDIA#4206) delays deployment ACKs from
relay-connected clients
2. **`_deploy_job()`** treats `reply=None` (timeout) as `"unknown"` —
not a failure — so
   timed-out clients silently appear to have been deployed
3. **`_start_run()`** tries to start those clients; they again time out,
and
   `check_client_replies()` ignores the `None` reply
4. **`_sync_client_jobs()`** fires dead-job notification on the very
first heartbeat with
   no startup grace period
5. FedAvg requires 144/144 — one or two missing clients → abort
6. A late-starting CJ crashes with `TypeError: 'NoneType' object is not
iterable` when
`get_job_clients()` receives `None` metadata from an already-aborted job

PRs NVIDIA#4206, NVIDIA#4204, NVIDIA#4174, NVIDIA#4172, NVIDIA#4186, NVIDIA#4211, NVIDIA#4210 (all merged in
2.7.2) address the
transport layer.  This PR addresses the remaining job lifecycle layer.

---

## Fixes Included

### 1 — `_deploy_job()`: Treat deployment timeout as failure
(`job_runner.py`)

**Root bug**: `reply=None` was logged as `"unknown"` and excluded from
`failed_clients`,
so timed-out clients counted as "successfully deployed" for the
`min_sites` check.

**Fix**: Add timed-out clients to `failed_clients` with a `"deployment
timeout"` label.
The existing `min_sites` / `required_sites` logic then correctly decides
whether to abort.

### 2 — `check_client_replies()`: Return timed-out clients instead of
raising (`admin.py`)

**Root bug**: In strict mode, any timeout raised immediately, aborting
the whole job even
when the remaining active clients satisfied `min_sites`.

**Fix**: In strict mode, collect timed-out clients into a return list
rather than raising.
Explicit errors (non-OK return code or error body) still raise. Also
fixes the non-strict
mode to use name-keyed dict lookup instead of fragile positional
`zip()`.

New signature: `check_client_replies(...) -> List[str]` (timed-out
client names; empty = none).

### 3 — `_start_run()`: Selective exclusion with min_sites re-evaluation
(`job_runner.py`)

**Root bug**: A start-job timeout under strict mode aborted the entire
job with no
tolerance for stragglers within `min_sites` bounds.

**Fix**: Use the returned timed-out list from `check_client_replies()`.
If remaining
active clients >= `min_sites`, log a warning and proceed. Only abort
when below tolerance.

### 4 — `_sync_client_jobs()`: Require-prior-report default changed to
`True` (`fed_server.py`)

**Root bug**: `SYNC_CLIENT_JOBS_REQUIRE_PREVIOUS_REPORT` defaulted to
`False`, meaning
the bug fix was opt-in and the unsafe behaviour remained the default.

**Fix**: Default changed to `True`. Operators who want the aggressive
legacy detection
can set it to `False` explicitly.

### 5 — `_sync_client_jobs()`: Move `_reported_clients` out of
`job_info` dict (`fed_server.py`)

**Root bug**: Positive-observation tracking was stored as
`job_info["_reported_clients"]`,
injecting algorithm state into a data dict with no corresponding
`RunProcessKey` constant.

**Fix**: Tracking moved to `self._job_reported_clients: Dict[str, set]`
on `FederatedServer`.
Stale entries are purged whenever a job is no longer in `run_processes`.

### 6 — `ClientRunManager.get_job_clients()`: Explicit meta validation
(`client_run_manager.py`)

Raises `RuntimeError` with a descriptive message instead of an opaque
`TypeError` when
`JOB_CLIENTS` is absent or the wrong type.

---

## Configuration Recommendations (No Code Change Needed)

| Setting | Recommended value | Effect |
|---|---|---|
| `FedAvg(min_clients=...)` | 96-98% of `num_clients` | Tolerates a few
startup stragglers |
| `runner_sync_timeout` | `120` s | Allows Lustre-backed deployments
time to complete |
| `strict_start_job_reply_check` | `true` | Start-job timeouts surfaced,
straggler clients excluded |
| `sync_client_jobs_require_previous_report` | `true` (now the default)
| Prevents premature dead-job from startup delay |
| `SFM_CLOSE_STALLED_CONNECTION` (PR NVIDIA#4206) | `true` after staging |
Disconnects stalled relay connections |

---

## Files Changed

- `nvflare/private/fed/server/job_runner.py` — `_deploy_job()` timeout
as failure; `_start_run()` selective exclusion
- `nvflare/private/fed/server/admin.py` — `check_client_replies()`
returns timed-out list; dict-keyed non-strict path
- `nvflare/private/fed/server/fed_server.py` — `_sync_client_jobs()`
default `True`; `_job_reported_clients` attr; stale cleanup
- `nvflare/private/fed/client/client_run_manager.py` — explicit meta
validation in `get_job_clients()`

---

## Test Coverage

New and updated unit tests with both positive and negative cases:

| File | Tests | What they cover |
|---|---|---|
| `admin_test.py` | 8 | Timeout returned not raised; dict lookup; error
still raises; reorder OK |
| `job_runner_test.py` | 4 | strict flag wiring; timeout within
tolerance → warn; timeout below tolerance → raise |
| `job_runner_deploy_test.py` | 9 (new file) | Timeout counted as
failure; OK reply not failed; mixed outcomes; detail label; min_sites
with timeouts; integration sequence |
| `fed_server_test.py` | 5 | Default requires-prior-report; legacy
explicit-False still fires; tracking in server attr not job_info; stale
cleanup |

All 29 targeted unit tests pass.

## Test Plan

- [x] Unit tests for each changed function (positive + negative)
- [x] New `job_runner_deploy_test.py` covering deployment timeout
classification end-to-end
- [x] All 29 targeted unit tests pass
- [ ] Hierarchical staging run with all flags at default
- [ ] Hierarchical staging run with `strict_start_job_reply_check=true`
and reduced `min_clients`
- [ ] Verify no regression on standard (non-hierarchical) FL jobs

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
chesterxgchen added a commit to chesterxgchen/NVFlare that referenced this pull request Feb 24, 2026
…ent, hierarchical startup stability

Add three major new sections to flare_272.rst covering work merged after the
initial 2.7.2 draft:

Memory Management (restructured):
- Zero Tensor Copy at CJ process via LazyDownloadRef pass-through (PR NVIDIA#4210)
- Client-side memory management: malloc_trim, jemalloc, torch.cuda.empty_cache
  injected after flare.send() without training script changes (PR NVIDIA#4211)
- Retain existing TensorDownloader and server-side cleanup content

F3 Streaming Reliability and Performance (new section):
- HOL stall mitigation: bounded send_frame() timeout, ACK watchdog, stall
  detection/recovery with recommended env-var settings (PR NVIDIA#4206)
- Stream pool starvation fix: blob callbacks dispatched to dedicated thread
  pool, preventing stream worker exhaustion (PR NVIDIA#4171/NVIDIA#4172)
- Streaming download retry with exponential backoff on timeout (PR NVIDIA#4167)
- RxTask self-deadlock fix: stop() deferred until after map_lock released (PR NVIDIA#4204)
- Lock contention reduction in produce_item() for concurrent model downloads (PR NVIDIA#4174)

Hierarchical FL Startup Stability (new section):
- Deployment timeout correctly classified as failure; min_sites check applied
  at deployment phase (PR NVIDIA#4209)
- Startup grace period for dead-client detection (debounce default=true) (PR NVIDIA#4209)
- Selective client exclusion on start-job timeout instead of full abort (PR NVIDIA#4209)
- Hardened job metadata parsing: TypeError replaced with descriptive RuntimeError (PR NVIDIA#4209)
- Recommended config snippets for HPC/Lustre environments (Frontier/ORNL scale)

Bug Fixes section updated with all streaming and hierarchical startup fixes.
Intro paragraph updated to reflect system hardening scope.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
chesterxgchen added a commit to chesterxgchen/NVFlare that referenced this pull request Feb 25, 2026
…ent, hierarchical startup stability

Add three major new sections to flare_272.rst covering work merged after the
initial 2.7.2 draft:

Memory Management (restructured):
- Zero Tensor Copy at CJ process via LazyDownloadRef pass-through (PR NVIDIA#4210)
- Client-side memory management: malloc_trim, jemalloc, torch.cuda.empty_cache
  injected after flare.send() without training script changes (PR NVIDIA#4211)
- Retain existing TensorDownloader and server-side cleanup content

F3 Streaming Reliability and Performance (new section):
- HOL stall mitigation: bounded send_frame() timeout, ACK watchdog, stall
  detection/recovery with recommended env-var settings (PR NVIDIA#4206)
- Stream pool starvation fix: blob callbacks dispatched to dedicated thread
  pool, preventing stream worker exhaustion (PR NVIDIA#4171/NVIDIA#4172)
- Streaming download retry with exponential backoff on timeout (PR NVIDIA#4167)
- RxTask self-deadlock fix: stop() deferred until after map_lock released (PR NVIDIA#4204)
- Lock contention reduction in produce_item() for concurrent model downloads (PR NVIDIA#4174)

Hierarchical FL Startup Stability (new section):
- Deployment timeout correctly classified as failure; min_sites check applied
  at deployment phase (PR NVIDIA#4209)
- Startup grace period for dead-client detection (debounce default=true) (PR NVIDIA#4209)
- Selective client exclusion on start-job timeout instead of full abort (PR NVIDIA#4209)
- Hardened job metadata parsing: TypeError replaced with descriptive RuntimeError (PR NVIDIA#4209)
- Recommended config snippets for HPC/Lustre environments (Frontier/ORNL scale)

Bug Fixes section updated with all streaming and hierarchical startup fixes.
Intro paragraph updated to reflect system hardening scope.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
chesterxgchen added a commit that referenced this pull request Feb 25, 2026
…ent, hierarchical startup stability [skip ci] (#4218)

## Merge Dependency

> ⚠️ **Depends on #4209** — The *Hierarchical FL Startup Stability*
section documents changes introduced by PR #4209 (currently open).
**Merge PR #4209 into `2.7` before merging this PR.** All other sections
cover already-merged PRs.

---

## Summary

This PR updates `docs/release_notes/flare_272.rst` to reflect all major
changes merged into the 2.7.x line after the initial 2.7.2 draft,
covering three new areas:

- **Memory Management** — restructured and expanded with Zero Tensor
Copy at CJ (PR #4210) and client-side memory lifecycle management (PR
#4211)
- **F3 Streaming Reliability and Performance** — new section covering
HOL stall mitigation (PR #4206), stream pool starvation fix (PR
#4171/#4172), streaming download retry (PR #4167), RxTask self-deadlock
fix (PR #4204), and lock contention reduction (PR #4174)
- **Hierarchical FL Startup Stability** — new section covering
deployment timeout classification, startup grace period, selective
client exclusion, and metadata hardening (PR #4209 — pending merge),
with recommended config snippets for HPC/Lustre environments

The Bug Fixes section and intro paragraph are also updated accordingly.

A source-level RST comment has been added above the Hierarchical FL
section in the file to alert future maintainers to the merge dependency.

## Merged PRs Documented

| PR | Area | Status |
|---|---|---|
| #4171 / #4172 | Stream pool starvation fix | Merged |
| #4174 | Lock contention reduction | Merged |
| #4167 | Streaming download retry | Merged |
| #4204 | RxTask self-deadlock fix | Merged |
| #4206 | HOL stall mitigation | Merged |
| #4210 | Zero tensor copy at CJ | Merged |
| #4211 | Client-side memory management | Merged |
| #4209 | Hierarchical FL startup stability | **Open — merge before this
PR** |

## Changes

### Memory Management (restructured)

- **Zero Tensor Copy at CJ** (`ClientAPILauncherExecutor`): CJ now holds
`LazyDownloadRef` placeholders instead of materializing full tensors,
eliminating the CJ as a memory bottleneck for LLM-scale models.
- **Client-Side Memory Management**: `gc.collect()` + `malloc_trim(0)` /
jemalloc purge / `torch.cuda.empty_cache()` injected after every
`flare.send()`, configurable via `client_memory_gc_rounds`.
- Existing TensorDownloader and server-side cleanup content retained.

### F3 Streaming Reliability and Performance (new section)

- **HOL Stall Mitigation**: Bounded `send_frame()` timeout, ACK-progress
watchdog, and stall detection/recovery. Includes recommended environment
variable settings for large hierarchical deployments.
- **Stream Pool Starvation Fix**: Blob callbacks dispatched to a
dedicated `callback_thread_pool`, keeping stream workers free for
concurrent downloads.
- **Streaming Download Retry**: Exponential-backoff retry (up to 3
attempts, capped at 60 s) on `TIMEOUT` errors; abort-signal aware.
- **RxTask Self-Deadlock Fix**: `stop()` deferred until after `map_lock`
released, eliminating stream-error-triggered deadlock.
- **Lock Contention Reduction**: `produce_item()` runs outside
`self.lock`; compare-and-store for cache write. Reduces model-download
latency under high client concurrency.

### Hierarchical FL Startup Stability (new section — pending PR #4209)

- **Deployment Timeout as Failure**: `reply=None` correctly counted
against `min_sites`; timed-out clients excluded before
`start_client_job`.
- **Startup Grace Period**: Dead-client detection debounced — client
must be observed once before absence triggers dead-job notification.
Default changed to `True`.
- **Selective Client Exclusion**: Stragglers at start-job excluded
rather than causing full abort, if remaining count ≥ `min_clients`.
- **Metadata Hardening**: `TypeError` on absent job metadata replaced
with descriptive `RuntimeError`.
- Recommended `config_fed_server.json` / `config_fed_client.json`
snippets for HPC (Frontier/ORNL) scale.

## Test plan

- [ ] Sphinx build (`make html`) passes without RST warnings on the
updated file
- [ ] All new cross-references (`.. code-block::`, `.. note::`) render
correctly in the docs build
- [ ] Verify section hierarchy (underline characters) is consistent
throughout the file
- [ ] Confirm PR #4209 is merged before this PR is merged

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
chesterxgchen added a commit that referenced this pull request Mar 12, 2026
…meouts, selective client exclusion, and dead-detection debounce (#4209) (#4288)

## Problem

Large-scale hierarchical FL jobs (e.g. BERT NER, 144 clients, 6 relays
on Frontier) abort in
Round 0 due to a cascading startup failure chain.  The root sequence is:

1. F3 streaming HOL stall (PR #4206) delays deployment ACKs from
relay-connected clients
2. **`_deploy_job()`** treats `reply=None` (timeout) as `"unknown"` —
not a failure — so
   timed-out clients silently appear to have been deployed
3. **`_start_run()`** tries to start those clients; they again time out,
and
   `check_client_replies()` ignores the `None` reply
4. **`_sync_client_jobs()`** fires dead-job notification on the very
first heartbeat with
   no startup grace period
5. FedAvg requires 144/144 — one or two missing clients → abort
6. A late-starting CJ crashes with `TypeError: 'NoneType' object is not
iterable` when
`get_job_clients()` receives `None` metadata from an already-aborted job

PRs #4206, #4204, #4174, #4172, #4186, #4211, #4210 (all merged in
2.7.2) address the
transport layer.  This PR addresses the remaining job lifecycle layer.

---

## Fixes Included

### 1 — `_deploy_job()`: Treat deployment timeout as failure
(`job_runner.py`)

**Root bug**: `reply=None` was logged as `"unknown"` and excluded from
`failed_clients`,
so timed-out clients counted as "successfully deployed" for the
`min_sites` check.

**Fix**: Add timed-out clients to `failed_clients` with a `"deployment
timeout"` label.
The existing `min_sites` / `required_sites` logic then correctly decides
whether to abort.

### 2 — `check_client_replies()`: Return timed-out clients instead of
raising (`admin.py`)

**Root bug**: In strict mode, any timeout raised immediately, aborting
the whole job even
when the remaining active clients satisfied `min_sites`.

**Fix**: In strict mode, collect timed-out clients into a return list
rather than raising.
Explicit errors (non-OK return code or error body) still raise. Also
fixes the non-strict
mode to use name-keyed dict lookup instead of fragile positional
`zip()`.

New signature: `check_client_replies(...) -> List[str]` (timed-out
client names; empty = none).

### 3 — `_start_run()`: Selective exclusion with min_sites re-evaluation
(`job_runner.py`)

**Root bug**: A start-job timeout under strict mode aborted the entire
job with no
tolerance for stragglers within `min_sites` bounds.

**Fix**: Use the returned timed-out list from `check_client_replies()`.
If remaining
active clients >= `min_sites`, log a warning and proceed. Only abort
when below tolerance.

### 4 — `_sync_client_jobs()`: Require-prior-report default changed to
`True` (`fed_server.py`)

**Root bug**: `SYNC_CLIENT_JOBS_REQUIRE_PREVIOUS_REPORT` defaulted to
`False`, meaning
the bug fix was opt-in and the unsafe behaviour remained the default.

**Fix**: Default changed to `True`. Operators who want the aggressive
legacy detection
can set it to `False` explicitly.

### 5 — `_sync_client_jobs()`: Move `_reported_clients` out of
`job_info` dict (`fed_server.py`)

**Root bug**: Positive-observation tracking was stored as
`job_info["_reported_clients"]`,
injecting algorithm state into a data dict with no corresponding
`RunProcessKey` constant.

**Fix**: Tracking moved to `self._job_reported_clients: Dict[str, set]`
on `FederatedServer`.
Stale entries are purged whenever a job is no longer in `run_processes`.

### 6 — `ClientRunManager.get_job_clients()`: Explicit meta validation
(`client_run_manager.py`)

Raises `RuntimeError` with a descriptive message instead of an opaque
`TypeError` when
`JOB_CLIENTS` is absent or the wrong type.

---

## Configuration Recommendations (No Code Change Needed)

| Setting | Recommended value | Effect |
|---|---|---|
| `FedAvg(min_clients=...)` | 96-98% of `num_clients` | Tolerates a few
startup stragglers |
| `runner_sync_timeout` | `120` s | Allows Lustre-backed deployments
time to complete |
| `strict_start_job_reply_check` | `true` | Start-job timeouts surfaced,
straggler clients excluded |
| `sync_client_jobs_require_previous_report` | `true` (now the default)
| Prevents premature dead-job from startup delay | |
`SFM_CLOSE_STALLED_CONNECTION` (PR #4206) | `true` after staging |
Disconnects stalled relay connections |

---

## Files Changed

- `nvflare/private/fed/server/job_runner.py` — `_deploy_job()` timeout
as failure; `_start_run()` selective exclusion
- `nvflare/private/fed/server/admin.py` — `check_client_replies()`
returns timed-out list; dict-keyed non-strict path
- `nvflare/private/fed/server/fed_server.py` — `_sync_client_jobs()`
default `True`; `_job_reported_clients` attr; stale cleanup
- `nvflare/private/fed/client/client_run_manager.py` — explicit meta
validation in `get_job_clients()`

---

## Test Coverage

New and updated unit tests with both positive and negative cases:

| File | Tests | What they cover |
|---|---|---|
| `admin_test.py` | 8 | Timeout returned not raised; dict lookup; error
still raises; reorder OK |
| `job_runner_test.py` | 4 | strict flag wiring; timeout within
tolerance → warn; timeout below tolerance → raise | |
`job_runner_deploy_test.py` | 9 (new file) | Timeout counted as failure;
OK reply not failed; mixed outcomes; detail label; min_sites with
timeouts; integration sequence |
| `fed_server_test.py` | 5 | Default requires-prior-report; legacy
explicit-False still fires; tracking in server attr not job_info; stale
cleanup |

All 29 targeted unit tests pass.

## Test Plan

- [x] Unit tests for each changed function (positive + negative)
- [x] New `job_runner_deploy_test.py` covering deployment timeout
classification end-to-end
- [x] All 29 targeted unit tests pass
- [ ] Hierarchical staging run with all flags at default
- [ ] Hierarchical staging run with `strict_start_job_reply_check=true`
and reduced `min_clients`
- [ ] Verify no regression on standard (non-hierarchical) FL jobs

---------

Fixes # .

### Description

A few sentences describing the changes proposed in this pull request.

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Quick tests passed locally by running `./runtest.sh`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated.

Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Peter Cnudde <pcnudde@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants