[2.7] Pass-Through: Zero Tensor Copy at CJ for Large-Model Federated Training by chesterxgchen · Pull Request #4210 · NVIDIA/NVFlare

chesterxgchen · 2026-02-20T06:31:55Z

Summary

This PR introduces the pass-through architecture for ClientAPILauncherExecutor, eliminating tensor materialisation at the CJ (Client Job) process when large models are exchanged between the FL server and a subprocess agent.

In large-model federated learning (e.g., 7B–70B LLM fine-tuning), the CJ process today acts as a blind relay that fully deserializes and re-serializes every tensor it receives from the FL server before forwarding to the subprocess. For a 70B float16 model, this consumes ~140 GB of CJ memory and requires two complete network transfers. B1 pass-through removes both costs.

Problem Description

NVFlare's multi-hop execution path for launch_external_process=True looks like:

FL Server ──serialize──▶ CJ process ──re-serialize──▶ Subprocess agent

Before this PR

Each tensor in the global model is handled as follows at CJ:

Server serializes the model and creates a download transaction (tensor data lives on the server).
CJ fully downloads every tensor from the server into its own heap, materialising the complete model in CJ memory.
CJ re-serializes the model for the subprocess, creating a new download transaction — the subprocess then downloads from CJ.

For large models, this means:

CJ peak memory = full model size (potentially 100s of GB).
Two full network transfers: server → CJ, then CJ → subprocess.
CJ becomes a throughput bottleneck and an OOM risk for any model that doesn't fit in the CJ process's memory.

This is the reason why workflows are infeasible for large models beyond what the CJ machine can hold.

Proposed Architecture and Solution

Pass-Through Architecture

With FOBSContextKey.PASS_THROUGH enabled on CJ's cell FOBS context, the data path becomes:

FL Server ──stream──▶ CJ (LazyDownloadRef only, no tensor data)
                          └──forward ref──▶ Subprocess
                                               └──download──▶ FL Server

CJ holds only lightweight placeholders (< 100 bytes per tensor). The subprocess downloads each tensor directly from the FL server — CJ is never involved in the tensor data path.

Key Components

1. `FOBSContextKey.PASS_THROUGH` (`nvflare/fuel/utils/fobs/init.py`)

A new context key that signals ViaDownloaderDecomposer to skip the download step and create lazy placeholders instead.

2. `LazyDownloadRef` (`via_downloader.py`)

A small sentinel object (four fields: fqcn, ref_id, item_id, dot) created by recompose() in PASS_THROUGH mode. It carries the original FL server's FQCN, batch ref_id, intra-batch item ID, and Datum Object Type — everything the subprocess needs to download the tensor directly.

3. `_LazyBatchInfo` (`via_downloader.py`)

A named sentinel stored in fobs_ctx[items_key] during PASS_THROUGH receive. Using a typed class (rather than a plain tuple) makes the PASS_THROUGH branch unambiguous and immune to accidental type collisions with real item dicts.

4. `LazyDownloadRefDecomposer` (`via_downloader.py`)

A new auto-registered FOBS decomposer for LazyDownloadRef. When CJ re-serializes a task containing LazyDownloadRef objects:

decompose() delegates to get_dot_handler(lazy.dot) — the original ViaDownloaderDecomposer subclass (e.g., TensorDecomposer, NumpyArrayDecomposer). That handler's _finalize_lazy_batch post-callback re-emits the original server datum (fqcn + ref_id + DOT) so the subprocess knows exactly where to download from. lazy_dot is appended to the encoding dict for routing on the receive side.
recompose() uses lazy_dot to look up the handler and delegates to handler.recompose(), which retrieves the real tensor from fobs_ctx[handler.items_key] (populated by process_datum() when the subprocess received the forwarded datum).

The dot (Datum Object Type) field on both LazyDownloadRef and _LazyBatchInfo ensures that numpy arrays stay with NumpyArrayDecomposer and PyTorch tensors stay with TensorDecomposer, preserving type safety through the full pass-through hop.

5. `ClientAPILauncherExecutor.initialize()` (`client_api_launcher_executor.py`)

On startup, the executor enables PASS_THROUGH on the engine cell's FOBS context:

cell.core_cell.update_fobs_context({FOBSContextKey.PASS_THROUGH: True})

This single line activates the full B1 architecture for every job that uses launch_external_process=True — including llm_hf and any recipe that calls ScriptRunner(launch_external_process=True).

Analysis

Why not intercept at the pipe level?

The pipe (CellPipe) operates on already-serialized bytes. Intercepting at the pipe level would require parsing FOBS binary format, re-writing datum references, and re-assembling the byte stream — fragile and tightly coupled to the wire format.

Intercepting at the FOBS decomposer level is the natural extension point: decomposers already control exactly when and how data is materialised. PASS_THROUGH simply adds a "don't materialise" branch to that existing mechanism.

Why `dot` (Datum Object Type) on `LazyDownloadRef`?

The subprocess must know which ViaDownloaderDecomposer subclass owns the downloaded data so it can store it in the correct fobs_ctx[items_key] and route recompose() correctly. The dot field, set when the server originally serialized the tensor, carries this type information through the pass-through hop without any type-switching logic.

Memory profile at CJ

Stage	Before (tensor materialised)	After (B1 pass-through)
CJ receive	Full model size (e.g., 140 GB)	~100 bytes per tensor
CJ forward	Creates new download tx	Re-emits original server datum
Subprocess receive	Downloads from CJ	Downloads directly from FL server

Benefits

Zero tensor copy at CJ — CJ memory footprint is independent of model size.
One network transfer instead of two — tensors travel server → subprocess directly.
No CJ OOM risk for large models regardless of CJ machine memory capacity.
Transparent to job authors — no changes to job configs, training scripts, or recipe APIs; launch_external_process=True automatically activates B1.
Type-safe — dot propagation preserves tensor type (numpy / pytorch) through the hop without any if/elif type switching.

Backward Compatibility

All existing jobs using launch_external_process=True automatically benefit. No config or script changes required.
Jobs using launch_external_process=False (in-process executor) are completely unaffected — ClientAPILauncherExecutor.initialize() is not called.
For models smaller than the ViaDownloaderDecomposer streaming threshold (2 MB per array), FOBS uses native (inline) serialization regardless of PASS_THROUGH — behaviour is identical to before.
LazyDownloadRefDecomposer is auto-registered via the existing register_folder mechanism; no explicit registration call is needed by any caller.

Changed Files

File	Change
`nvflare/fuel/utils/fobs/__init__.py`	Add `FOBSContextKey.PASS_THROUGH`
`nvflare/fuel/utils/fobs/decomposers/via_downloader.py`	Add `LazyDownloadRef`, `_LazyBatchInfo`, PASS_THROUGH branches in `process_datum()` / `recompose()`, `LazyDownloadRefDecomposer`, `_finalize_lazy_batch` post-callback
`nvflare/app_common/executors/client_api_launcher_executor.py`	`initialize()` enables PASS_THROUGH on engine cell

Test Strategy

Unit tests — `tests/unit_test/fuel/utils/fobs/test_pass_through.py` (22 tests)

Test class	What is verified
`TestLazyDownloadRef`	Construction, `__slots__`, per-item distinctness
`TestLazyBatchInfo`	Construction, `__slots__`, `isinstance` reliability vs plain tuple
`TestProcessDatumPassThrough`	PASS_THROUGH stores `_LazyBatchInfo`, never calls `_download_from_remote_cell`; normal mode calls download
`TestRecomposePassThrough`	Returns `LazyDownloadRef` with correct `fqcn`, `ref_id`, `item_id` from `_LazyBatchInfo`
`TestDecomposeWithLazyDownloadRef`	Returns REF encoding; `_finalize_lazy_batch` post-CB registered once per batch regardless of item count; emitted datum has correct fqcn/ref_id/DOT
`TestNoMemoryAccumulation`	`_CtxKey.OBJECTS` absent after PASS_THROUGH (no download transaction opened); `DownloadService._tx_table` unchanged; 50-cycle repeat produces no state bleed

End-to-end tests — `tests/unit_test/fuel/f3/streaming/test_pass_through_e2e.py` (5 tests, real TCP Cells)

Test	What is verified
`test_arrays_survive_pass_through_hop`	Full round-trip: server → CJ (PASS_THROUGH) → subprocess, arrays arrive bit-exact
`test_cj_holds_only_lazy_refs_not_tensor_data`	After PASS_THROUGH deserialization, CJ holds only `LazyDownloadRef`, never `np.ndarray`
`test_cj_creates_no_download_transaction`	`DownloadService._tx_table` is unchanged during PASS_THROUGH + re-serialization
`test_forwarded_payload_carries_original_server_ref`	Forwarded datum contains original server `fqcn` and `ref_id` — subprocess downloads from server, not CJ
`test_multiple_array_roundtrip`	8-array batch all survive with bit-exact values

Integration test — `tests/integration_test/data/jobs/pt_large_model_pass_through/`

Full-stack integration test using PTClientAPILauncherExecutor with launch_once=True (the pattern used by llm_hf):

Model: LargeNet — 3-layer MLP with ~8 MB of float32 parameters, well above the 2 MB ViaDownloaderDecomposer streaming threshold. This forces the real B1 code path (streaming + PASS_THROUGH) rather than the native inline path used by small models.
Client script: Mirrors llm_hf/client.py structure (while flare.is_running() loop, receive / train / send). Uses CPU-only synthetic data — no dataset download required in CI.
Added to client_api.yml as "run pt-large-model-pass-through".

🤖 Generated with Claude Code

greptile-apps · 2026-02-20T06:34:22Z

Greptile Summary

Introduces pass-through architecture that eliminates tensor materialization at the CJ (Client Job) process for large-model federated learning. Instead of downloading and re-serializing model tensors at the CJ (consuming 140+ GB for a 70B model), the CJ now creates lightweight LazyDownloadRef placeholders (~100 bytes each) that preserve the original FL server's FQCN and batch reference. The subprocess agent downloads tensors directly from the FL server, bypassing the CJ entirely.

Key Changes:

Added FOBSContextKey.PASS_THROUGH flag and LazyDownloadRef/_LazyBatchInfo sentinels to FOBS serialization
ViaDownloaderDecomposer.process_datum() skips download when PASS_THROUGH=True, storing batch metadata instead
LazyDownloadRefDecomposer re-emits original server datum during CJ re-serialization via _finalize_lazy_batch callback
ClientAPILauncherExecutor.initialize() enables PASS_THROUGH on engine cell, with proper restore in finalize()
Comprehensive test coverage: 22 unit tests, 5 E2E tests with real cells, full integration test with 8 MB model

Impact:

Zero tensor copy at CJ — memory footprint independent of model size
One network transfer instead of two (server → subprocess directly)
Transparent to job authors — automatic for launch_external_process=True
Type-safe via dot (Datum Object Type) propagation through the hop

Confidence Score: 5/5

This PR is safe to merge with minimal risk
The implementation is architecturally sound with excellent test coverage (27 tests total), proper exception handling, backward compatibility guarantees, and clear documentation. The pass-through logic is well-isolated to the FOBS decomposer layer with type-safe sentinel classes. Auto-registration via register_folder ensures the LazyDownloadRefDecomposer is available. The initialize/finalize restoration logic prevents state leakage.
No files require special attention

Important Files Changed

Filename	Overview
nvflare/fuel/utils/fobs/decomposers/via_downloader.py	Added LazyDownloadRef, _LazyBatchInfo sentinels, LazyDownloadRefDecomposer, and PASS_THROUGH logic to enable zero-copy tensor forwarding at CJ
nvflare/fuel/utils/fobs/init.py	Added FOBSContextKey.PASS_THROUGH constant to enable pass-through mode in FOBS context
nvflare/app_common/executors/client_api_launcher_executor.py	Enable PASS_THROUGH on engine cell in initialize(), restore in finalize(), with proper exception handling
tests/unit_test/fuel/utils/fobs/test_pass_through.py	Comprehensive unit tests (22 tests) covering LazyDownloadRef, _LazyBatchInfo, process_datum, recompose, decompose, and memory accumulation
tests/unit_test/fuel/f3/streaming/test_pass_through_e2e.py	E2E tests (5 tests) with real TCP cells validating round-trip correctness, CJ memory isolation, and download service transaction handling

Sequence Diagram

sequenceDiagram
    participant Server as FL Server
    participant CJ as CJ Process
    participant Sub as Subprocess Agent

    Note over Server,Sub: Before This PR (Tensor Materialization)
    Server->>CJ: serialize tensors
    Note over CJ: Downloads & materializes<br/>full model (e.g., 140 GB)
    CJ->>CJ: Re-serialize for subprocess
    CJ->>Sub: New download transaction
    Sub->>CJ: Download tensors
    Note over CJ: CJ memory = full model size

    Note over Server,Sub: After This PR (Pass-Through)
    Server->>CJ: serialize tensors with ref
    Note over CJ: PASS_THROUGH enabled<br/>Creates LazyDownloadRef<br/>(~100 bytes per tensor)
    CJ->>Sub: Forward LazyDownloadRef<br/>(original server FQCN + ref_id)
    Sub->>Server: Download tensors directly
    Note over CJ: CJ memory ≈ 0 bytes<br/>(no tensor data)

_{Last reviewed commit: 741b460}

greptile-apps

_{11 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

chesterxgchen · 2026-02-20T06:39:22Z

/build

greptile-apps

_{13 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

tests/memory_profile/fedavg/mprofile_20260124221746.dat

chesterxgchen · 2026-02-20T06:43:24Z

/build

greptile-apps

_{14 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

tests/memory_profile/fedavg/mprofile_20260124221746.dat

greptile-apps

_{12 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

pcnudde

Are there any possible incompatibilities with things like filters or other users code that expects real tensors instead of the the lazy references.

nvflare/app_common/executors/client_api_launcher_executor.py

Replace CellPipe with LocalCellPipe as the default pipe transport between the CJ (Client Job) process and the trainer subprocess when launch_external_process=True. ## What changes * **New `nvflare/fuel/utils/pipe/local_cell_pipe.py`**: a `Pipe` subclass that uses a direct local Cellnet Cell connection (tcp://127.0.0.1:<port>) instead of routing through the external FL network. - PASSIVE side (CJ): creates a server-child Cell with an internal listener on a dynamically assigned localhost port. - ACTIVE side (subprocess): creates a Cell that connects to the PASSIVE cell via `parent_url=<bound-port-url>` exported at job start. - Both task and metric pipes share one Cell per side via a class-level `_cells_info` cache, multiplexed over distinct channel names. * **Modified `nvflare/job_config/script_runner.py`**: adds `_create_local_cell_pipe()` and uses it (instead of `_create_cell_pipe()`) as the default pipe factory in `add_to_fed_job()`. * **New `tests/unit_test/fuel/utils/pipe/test_local_cell_pipe.py`**: 33 unit tests covering FQCN helpers, placeholder guard, cell sharing, listener binding, bidirectional messaging, two-channel multiplexing, lifecycle / error cases, and concurrent sends. ## Why The existing CellPipe requires both the CJ process and the subprocess to join the external FL cellnet (with auth tokens and TLS certs) just to exchange model updates on the same host. LocalCellPipe eliminates that overhead: no auth credentials, no relay round-trips, sub-millisecond local TCP latency. It also lays the architectural foundation for a future B1 pass-through optimisation (CJ forwards raw FOBS ref-ids directly to subprocess so tensor download happens once, at the subprocess, reducing peak memory usage across training rounds). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

## Problem With the previous CellPipe / LocalCellPipe transport, the CJ (Client Job) process was always a full intermediate stop for model tensors: 1. CJ receives task from server → ViaDownloaderDecomposer downloads every tensor → CJ holds the full model in memory (30–70 GB for large models). 2. CJ re-serialises the model for the subprocess pipe → new download refs pointing to *CJ's* cell → subprocess downloads from CJ. Peak CJ memory = model size. For 30–70 GB models this is a hard blocker. ## Solution — B1 pass-through architecture ### FOBS layer (`nvflare/fuel/utils/fobs/`) * **`FOBSContextKey.PASS_THROUGH`** — new flag. When set in a cell's fobs_ctx, `ViaDownloaderDecomposer` does **not** download tensors on receipt. Instead it creates `LazyDownloadRef` placeholders (tiny Python objects, just FQCN + ref_id + item_id) for every tensor in the received batch. * **`LazyDownloadRef`** — new class in `via_downloader.py`. Carries the *original source* FQCN and batch ref_id. * **`ViaDownloaderDecomposer.process_datum()`** — PASS_THROUGH branch: store `(fqcn, ref_id)` tuple instead of downloading. * **`ViaDownloaderDecomposer.recompose()`** — returns `LazyDownloadRef` objects when `items_key` holds a tuple (PASS_THROUGH mode). * **`ViaDownloaderDecomposer.decompose()`** — detects `LazyDownloadRef` targets and re-emits the *original* server datum verbatim via `_finalize_lazy_batch()` post-callback, instead of creating a new download transaction on the CJ cell. ### Executor (`nvflare/app_common/executors/client_api_launcher_executor.py`) `initialize()` now sets `PASS_THROUGH=True` on the engine's communication cell so all incoming FL tasks arrive at CJ without tensor downloads. ### Pipe defaults (`nvflare/job_config/script_runner.py`) * **Task pipe** → `CellPipe` (FL-network connected). The subprocess agent's pipe cell can reach any FQCN in the FL network, including the server cell that owns the original download transaction. Each tensor is downloaded individually by `ViaDownloaderDecomposer` at the subprocess — no single-message 2 GB cap applies. * **Metric pipe** → `LocalCellPipe` (direct local TCP). Metric payloads are scalars/strings; local TCP is faster and needs no FL auth. ## Data flow after this change ``` Server ──(ViaDownloader refs)──► CJ ──(same refs forwarded)──► subprocess no tensor download at CJ downloads from server directly ``` CJ peak memory: O(number of tensor refs) instead of O(model size). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… check - Remove LocalCellPipe and its tests (dead code: CellPipe covers the same use-case with acceptable overhead and no extra Cell infrastructure) - Revert script_runner.py metric pipe back to CellPipe (both pipes now use CellPipe, same as before the LocalCellPipe experiment) - Replace fragile isinstance(items, tuple) guard in ViaDownloaderDecomposer with a named _LazyBatchInfo sentinel class, making the PASS_THROUGH path unambiguous and robust against accidental type collisions Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…accumulation) Tests cover: - LazyDownloadRef and _LazyBatchInfo construction and slot integrity - process_datum() in PASS_THROUGH mode: creates _LazyBatchInfo sentinel, does NOT invoke the network download path - recompose() returns LazyDownloadRef with correct fqcn/ref_id/item_id when _LazyBatchInfo is in fobs_ctx - decompose() on a LazyDownloadRef: re-emits the original server datum via _finalize_lazy_batch post-CB; CB registered only once per batch regardless of item count; emitted datum carries correct fqcn/ref_id/DOT - No memory accumulation: _CtxKey.OBJECTS absent from fobs_ctx after PASS_THROUGH round, no DownloadService transaction created at CJ, 50 repeated cycles do not accumulate state Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…n + E2E tests Fix a real serialization bug in the PASS_THROUGH path and add end-to-end tests that validate the full B1 architecture round-trip. Problem: When CJ deserialized a task in PASS_THROUGH mode, numpy arrays became LazyDownloadRef objects. When CJ then re-serialized those objects for the subprocess via the task pipe, FOBS had no decomposer for LazyDownloadRef and raised: RuntimeError: <class 'LazyDownloadRef'> can't be serialized by FOBS without a decomposer This made the entire B1 pass-through architecture non-functional. Changes: via_downloader.py • Add `dot` field to LazyDownloadRef and _LazyBatchInfo so each placeholder carries the Datum Object Type of the originating ViaDownloaderDecomposer subclass (e.g. NUMPY_DOWNLOAD). • process_datum() PASS_THROUGH branch: pass datum.dot → _LazyBatchInfo.dot • recompose() PASS_THROUGH branch: pass items.dot → LazyDownloadRef.dot • New LazyDownloadRefDecomposer (auto-registered via register_folder): - decompose(): delegates to get_dot_handler(lazy.dot).decompose(), which re-emits the original server datum via _finalize_lazy_batch post-CB, then appends lazy_dot to the returned encoding dict. - recompose(): uses lazy_dot to look up the handler and delegates to handler.recompose(), which retrieves the real tensor from fobs_ctx (populated by process_datum() at the subprocess). New tests: tests/unit_test/fuel/f3/streaming/test_pass_through_e2e.py Five end-to-end tests using real TCP-connected Cell pairs: 1. test_arrays_survive_pass_through_hop — full round-trip correctness 2. test_cj_holds_only_lazy_refs_not_tensor_data — no tensors at CJ 3. test_cj_creates_no_download_transaction — DownloadService unchanged 4. test_forwarded_payload_carries_original_server_ref — subprocess downloads from server, not CJ 5. test_multiple_array_roundtrip — 8-array batch, bit-exact values All 27 B1 tests pass (22 unit + 5 E2E). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add a new integration test job that exercises the full B1 PASS_THROUGH stack with a PyTorch model large enough (≈ 8 MB) to trigger the ViaDownloaderDecomposer streaming path (threshold: 2 MB per array). Why this test: Existing pt_client_api_launch_once tests use the CIFAR-10 CNN (~240 KB of parameters) which falls below the 2 MB streaming threshold and therefore serialises tensors natively, bypassing the download/PASS_THROUGH path entirely. The new job forces the real B1 code path: FL Server ──stream──▶ CJ (PASS_THROUGH, LazyDownloadRef only) └──forward──▶ Subprocess └──download──▶ FL Server New files: tests/integration_test/data/jobs/pt_large_model_pass_through/ ├── meta.conf └── app/ ├── config/ │ ├── config_fed_server.conf (PTFileModelPersistor + ScatterAndGather) │ └── config_fed_client.conf (PTClientAPILauncherExecutor, launch_once=true) └── custom/ ├── large_model_net.py (LargeNet: 3-layer MLP, ≈ 8 MB float32) └── large_model_train.py (llm_hf-style while-loop, synthetic data, CPU-only — no dataset download needed) Test suite: tests/integration_test/data/test_configs/standalone_job/client_api.yml + "run pt-large-model-pass-through" entry (no setup/teardown needed) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Remove three unused imports flagged by CI flake8: - test_pass_through_e2e.py: `from nvflare.fuel.utils import fobs` - test_pass_through_e2e.py: `from nvflare.fuel.utils.fobs import get_dot_handler` - test_pass_through.py: `_LAZY_BATCH_CTX_SUFFIX` All 27 B1 tests still pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Preserve and restore the prior PASS_THROUGH fobs context value so this executor pass-through mode does not leak process-wide beyond the run, including when initialize fails.

Update memory management documentation to reflect current behavior: jemalloc preload is enabled only when NVFLARE_ENABLE_JEMALLOC_PRELOAD is set true, rather than auto-enabled by default.

chesterxgchen · 2026-02-20T22:51:27Z

/build

greptile-apps

_{14 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

chesterxgchen · 2026-02-20T23:02:56Z

Are there any possible incompatibilities with things like filters or other users code that expects real tensors instead of the the lazy references.
There is no filter between Client Job Process and Client Agent (trainer). No other components.

chesterxgchen · 2026-02-20T23:06:13Z

/build

greptile-apps

_{14 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Extend the fake FLContext in client_api_launcher_executor tests to provide identity and job context accessors used by FLComponent logging helpers.

chesterxgchen · 2026-02-20T23:21:15Z

/build

greptile-apps

_{14 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Use unique server and subprocess FQCNs per test fixture and clean up CoreCell registry entries on teardown to prevent duplicate cell name failures across test runs.

greptile-apps

_{14 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

chesterxgchen · 2026-02-20T23:34:50Z

/build

…ent, hierarchical startup stability Add three major new sections to flare_272.rst covering work merged after the initial 2.7.2 draft: Memory Management (restructured): - Zero Tensor Copy at CJ process via LazyDownloadRef pass-through (PR NVIDIA#4210) - Client-side memory management: malloc_trim, jemalloc, torch.cuda.empty_cache injected after flare.send() without training script changes (PR NVIDIA#4211) - Retain existing TensorDownloader and server-side cleanup content F3 Streaming Reliability and Performance (new section): - HOL stall mitigation: bounded send_frame() timeout, ACK watchdog, stall detection/recovery with recommended env-var settings (PR NVIDIA#4206) - Stream pool starvation fix: blob callbacks dispatched to dedicated thread pool, preventing stream worker exhaustion (PR NVIDIA#4171/NVIDIA#4172) - Streaming download retry with exponential backoff on timeout (PR NVIDIA#4167) - RxTask self-deadlock fix: stop() deferred until after map_lock released (PR NVIDIA#4204) - Lock contention reduction in produce_item() for concurrent model downloads (PR NVIDIA#4174) Hierarchical FL Startup Stability (new section): - Deployment timeout correctly classified as failure; min_sites check applied at deployment phase (PR NVIDIA#4209) - Startup grace period for dead-client detection (debounce default=true) (PR NVIDIA#4209) - Selective client exclusion on start-job timeout instead of full abort (PR NVIDIA#4209) - Hardened job metadata parsing: TypeError replaced with descriptive RuntimeError (PR NVIDIA#4209) - Recommended config snippets for HPC/Lustre environments (Frontier/ORNL scale) Bug Fixes section updated with all streaming and hierarchical startup fixes. Intro paragraph updated to reflect system hardening scope. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ctive client exclusion, and dead-detection debounce (#4209) ## Problem Large-scale hierarchical FL jobs (e.g. BERT NER, 144 clients, 6 relays on Frontier) abort in Round 0 due to a cascading startup failure chain. The root sequence is: 1. F3 streaming HOL stall (PR #4206) delays deployment ACKs from relay-connected clients 2. **`_deploy_job()`** treats `reply=None` (timeout) as `"unknown"` — not a failure — so timed-out clients silently appear to have been deployed 3. **`_start_run()`** tries to start those clients; they again time out, and `check_client_replies()` ignores the `None` reply 4. **`_sync_client_jobs()`** fires dead-job notification on the very first heartbeat with no startup grace period 5. FedAvg requires 144/144 — one or two missing clients → abort 6. A late-starting CJ crashes with `TypeError: 'NoneType' object is not iterable` when `get_job_clients()` receives `None` metadata from an already-aborted job PRs #4206, #4204, #4174, #4172, #4186, #4211, #4210 (all merged in 2.7.2) address the transport layer. This PR addresses the remaining job lifecycle layer. --- ## Fixes Included ### 1 — `_deploy_job()`: Treat deployment timeout as failure (`job_runner.py`) **Root bug**: `reply=None` was logged as `"unknown"` and excluded from `failed_clients`, so timed-out clients counted as "successfully deployed" for the `min_sites` check. **Fix**: Add timed-out clients to `failed_clients` with a `"deployment timeout"` label. The existing `min_sites` / `required_sites` logic then correctly decides whether to abort. ### 2 — `check_client_replies()`: Return timed-out clients instead of raising (`admin.py`) **Root bug**: In strict mode, any timeout raised immediately, aborting the whole job even when the remaining active clients satisfied `min_sites`. **Fix**: In strict mode, collect timed-out clients into a return list rather than raising. Explicit errors (non-OK return code or error body) still raise. Also fixes the non-strict mode to use name-keyed dict lookup instead of fragile positional `zip()`. New signature: `check_client_replies(...) -> List[str]` (timed-out client names; empty = none). ### 3 — `_start_run()`: Selective exclusion with min_sites re-evaluation (`job_runner.py`) **Root bug**: A start-job timeout under strict mode aborted the entire job with no tolerance for stragglers within `min_sites` bounds. **Fix**: Use the returned timed-out list from `check_client_replies()`. If remaining active clients >= `min_sites`, log a warning and proceed. Only abort when below tolerance. ### 4 — `_sync_client_jobs()`: Require-prior-report default changed to `True` (`fed_server.py`) **Root bug**: `SYNC_CLIENT_JOBS_REQUIRE_PREVIOUS_REPORT` defaulted to `False`, meaning the bug fix was opt-in and the unsafe behaviour remained the default. **Fix**: Default changed to `True`. Operators who want the aggressive legacy detection can set it to `False` explicitly. ### 5 — `_sync_client_jobs()`: Move `_reported_clients` out of `job_info` dict (`fed_server.py`) **Root bug**: Positive-observation tracking was stored as `job_info["_reported_clients"]`, injecting algorithm state into a data dict with no corresponding `RunProcessKey` constant. **Fix**: Tracking moved to `self._job_reported_clients: Dict[str, set]` on `FederatedServer`. Stale entries are purged whenever a job is no longer in `run_processes`. ### 6 — `ClientRunManager.get_job_clients()`: Explicit meta validation (`client_run_manager.py`) Raises `RuntimeError` with a descriptive message instead of an opaque `TypeError` when `JOB_CLIENTS` is absent or the wrong type. --- ## Configuration Recommendations (No Code Change Needed) | Setting | Recommended value | Effect | |---|---|---| | `FedAvg(min_clients=...)` | 96-98% of `num_clients` | Tolerates a few startup stragglers | | `runner_sync_timeout` | `120` s | Allows Lustre-backed deployments time to complete | | `strict_start_job_reply_check` | `true` | Start-job timeouts surfaced, straggler clients excluded | | `sync_client_jobs_require_previous_report` | `true` (now the default) | Prevents premature dead-job from startup delay | | `SFM_CLOSE_STALLED_CONNECTION` (PR #4206) | `true` after staging | Disconnects stalled relay connections | --- ## Files Changed - `nvflare/private/fed/server/job_runner.py` — `_deploy_job()` timeout as failure; `_start_run()` selective exclusion - `nvflare/private/fed/server/admin.py` — `check_client_replies()` returns timed-out list; dict-keyed non-strict path - `nvflare/private/fed/server/fed_server.py` — `_sync_client_jobs()` default `True`; `_job_reported_clients` attr; stale cleanup - `nvflare/private/fed/client/client_run_manager.py` — explicit meta validation in `get_job_clients()` --- ## Test Coverage New and updated unit tests with both positive and negative cases: | File | Tests | What they cover | |---|---|---| | `admin_test.py` | 8 | Timeout returned not raised; dict lookup; error still raises; reorder OK | | `job_runner_test.py` | 4 | strict flag wiring; timeout within tolerance → warn; timeout below tolerance → raise | | `job_runner_deploy_test.py` | 9 (new file) | Timeout counted as failure; OK reply not failed; mixed outcomes; detail label; min_sites with timeouts; integration sequence | | `fed_server_test.py` | 5 | Default requires-prior-report; legacy explicit-False still fires; tracking in server attr not job_info; stale cleanup | All 29 targeted unit tests pass. ## Test Plan - [x] Unit tests for each changed function (positive + negative) - [x] New `job_runner_deploy_test.py` covering deployment timeout classification end-to-end - [x] All 29 targeted unit tests pass - [ ] Hierarchical staging run with all flags at default - [ ] Hierarchical staging run with `strict_start_job_reply_check=true` and reduced `min_clients` - [ ] Verify no regression on standard (non-hierarchical) FL jobs --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

…ctive client exclusion, and dead-detection debounce (NVIDIA#4209) ## Problem Large-scale hierarchical FL jobs (e.g. BERT NER, 144 clients, 6 relays on Frontier) abort in Round 0 due to a cascading startup failure chain. The root sequence is: 1. F3 streaming HOL stall (PR NVIDIA#4206) delays deployment ACKs from relay-connected clients 2. **`_deploy_job()`** treats `reply=None` (timeout) as `"unknown"` — not a failure — so timed-out clients silently appear to have been deployed 3. **`_start_run()`** tries to start those clients; they again time out, and `check_client_replies()` ignores the `None` reply 4. **`_sync_client_jobs()`** fires dead-job notification on the very first heartbeat with no startup grace period 5. FedAvg requires 144/144 — one or two missing clients → abort 6. A late-starting CJ crashes with `TypeError: 'NoneType' object is not iterable` when `get_job_clients()` receives `None` metadata from an already-aborted job PRs NVIDIA#4206, NVIDIA#4204, NVIDIA#4174, NVIDIA#4172, NVIDIA#4186, NVIDIA#4211, NVIDIA#4210 (all merged in 2.7.2) address the transport layer. This PR addresses the remaining job lifecycle layer. --- ## Fixes Included ### 1 — `_deploy_job()`: Treat deployment timeout as failure (`job_runner.py`) **Root bug**: `reply=None` was logged as `"unknown"` and excluded from `failed_clients`, so timed-out clients counted as "successfully deployed" for the `min_sites` check. **Fix**: Add timed-out clients to `failed_clients` with a `"deployment timeout"` label. The existing `min_sites` / `required_sites` logic then correctly decides whether to abort. ### 2 — `check_client_replies()`: Return timed-out clients instead of raising (`admin.py`) **Root bug**: In strict mode, any timeout raised immediately, aborting the whole job even when the remaining active clients satisfied `min_sites`. **Fix**: In strict mode, collect timed-out clients into a return list rather than raising. Explicit errors (non-OK return code or error body) still raise. Also fixes the non-strict mode to use name-keyed dict lookup instead of fragile positional `zip()`. New signature: `check_client_replies(...) -> List[str]` (timed-out client names; empty = none). ### 3 — `_start_run()`: Selective exclusion with min_sites re-evaluation (`job_runner.py`) **Root bug**: A start-job timeout under strict mode aborted the entire job with no tolerance for stragglers within `min_sites` bounds. **Fix**: Use the returned timed-out list from `check_client_replies()`. If remaining active clients >= `min_sites`, log a warning and proceed. Only abort when below tolerance. ### 4 — `_sync_client_jobs()`: Require-prior-report default changed to `True` (`fed_server.py`) **Root bug**: `SYNC_CLIENT_JOBS_REQUIRE_PREVIOUS_REPORT` defaulted to `False`, meaning the bug fix was opt-in and the unsafe behaviour remained the default. **Fix**: Default changed to `True`. Operators who want the aggressive legacy detection can set it to `False` explicitly. ### 5 — `_sync_client_jobs()`: Move `_reported_clients` out of `job_info` dict (`fed_server.py`) **Root bug**: Positive-observation tracking was stored as `job_info["_reported_clients"]`, injecting algorithm state into a data dict with no corresponding `RunProcessKey` constant. **Fix**: Tracking moved to `self._job_reported_clients: Dict[str, set]` on `FederatedServer`. Stale entries are purged whenever a job is no longer in `run_processes`. ### 6 — `ClientRunManager.get_job_clients()`: Explicit meta validation (`client_run_manager.py`) Raises `RuntimeError` with a descriptive message instead of an opaque `TypeError` when `JOB_CLIENTS` is absent or the wrong type. --- ## Configuration Recommendations (No Code Change Needed) | Setting | Recommended value | Effect | |---|---|---| | `FedAvg(min_clients=...)` | 96-98% of `num_clients` | Tolerates a few startup stragglers | | `runner_sync_timeout` | `120` s | Allows Lustre-backed deployments time to complete | | `strict_start_job_reply_check` | `true` | Start-job timeouts surfaced, straggler clients excluded | | `sync_client_jobs_require_previous_report` | `true` (now the default) | Prevents premature dead-job from startup delay | | `SFM_CLOSE_STALLED_CONNECTION` (PR NVIDIA#4206) | `true` after staging | Disconnects stalled relay connections | --- ## Files Changed - `nvflare/private/fed/server/job_runner.py` — `_deploy_job()` timeout as failure; `_start_run()` selective exclusion - `nvflare/private/fed/server/admin.py` — `check_client_replies()` returns timed-out list; dict-keyed non-strict path - `nvflare/private/fed/server/fed_server.py` — `_sync_client_jobs()` default `True`; `_job_reported_clients` attr; stale cleanup - `nvflare/private/fed/client/client_run_manager.py` — explicit meta validation in `get_job_clients()` --- ## Test Coverage New and updated unit tests with both positive and negative cases: | File | Tests | What they cover | |---|---|---| | `admin_test.py` | 8 | Timeout returned not raised; dict lookup; error still raises; reorder OK | | `job_runner_test.py` | 4 | strict flag wiring; timeout within tolerance → warn; timeout below tolerance → raise | | `job_runner_deploy_test.py` | 9 (new file) | Timeout counted as failure; OK reply not failed; mixed outcomes; detail label; min_sites with timeouts; integration sequence | | `fed_server_test.py` | 5 | Default requires-prior-report; legacy explicit-False still fires; tracking in server attr not job_info; stale cleanup | All 29 targeted unit tests pass. ## Test Plan - [x] Unit tests for each changed function (positive + negative) - [x] New `job_runner_deploy_test.py` covering deployment timeout classification end-to-end - [x] All 29 targeted unit tests pass - [ ] Hierarchical staging run with all flags at default - [ ] Hierarchical staging run with `strict_start_job_reply_check=true` and reduced `min_clients` - [ ] Verify no regression on standard (non-hierarchical) FL jobs --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

…ent, hierarchical startup stability Add three major new sections to flare_272.rst covering work merged after the initial 2.7.2 draft: Memory Management (restructured): - Zero Tensor Copy at CJ process via LazyDownloadRef pass-through (PR NVIDIA#4210) - Client-side memory management: malloc_trim, jemalloc, torch.cuda.empty_cache injected after flare.send() without training script changes (PR NVIDIA#4211) - Retain existing TensorDownloader and server-side cleanup content F3 Streaming Reliability and Performance (new section): - HOL stall mitigation: bounded send_frame() timeout, ACK watchdog, stall detection/recovery with recommended env-var settings (PR NVIDIA#4206) - Stream pool starvation fix: blob callbacks dispatched to dedicated thread pool, preventing stream worker exhaustion (PR NVIDIA#4171/NVIDIA#4172) - Streaming download retry with exponential backoff on timeout (PR NVIDIA#4167) - RxTask self-deadlock fix: stop() deferred until after map_lock released (PR NVIDIA#4204) - Lock contention reduction in produce_item() for concurrent model downloads (PR NVIDIA#4174) Hierarchical FL Startup Stability (new section): - Deployment timeout correctly classified as failure; min_sites check applied at deployment phase (PR NVIDIA#4209) - Startup grace period for dead-client detection (debounce default=true) (PR NVIDIA#4209) - Selective client exclusion on start-job timeout instead of full abort (PR NVIDIA#4209) - Hardened job metadata parsing: TypeError replaced with descriptive RuntimeError (PR NVIDIA#4209) - Recommended config snippets for HPC/Lustre environments (Frontier/ORNL scale) Bug Fixes section updated with all streaming and hierarchical startup fixes. Intro paragraph updated to reflect system hardening scope. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ent, hierarchical startup stability [skip ci] (#4218) ## Merge Dependency > ⚠️ **Depends on #4209** — The *Hierarchical FL Startup Stability* section documents changes introduced by PR #4209 (currently open). **Merge PR #4209 into `2.7` before merging this PR.** All other sections cover already-merged PRs. --- ## Summary This PR updates `docs/release_notes/flare_272.rst` to reflect all major changes merged into the 2.7.x line after the initial 2.7.2 draft, covering three new areas: - **Memory Management** — restructured and expanded with Zero Tensor Copy at CJ (PR #4210) and client-side memory lifecycle management (PR #4211) - **F3 Streaming Reliability and Performance** — new section covering HOL stall mitigation (PR #4206), stream pool starvation fix (PR #4171/#4172), streaming download retry (PR #4167), RxTask self-deadlock fix (PR #4204), and lock contention reduction (PR #4174) - **Hierarchical FL Startup Stability** — new section covering deployment timeout classification, startup grace period, selective client exclusion, and metadata hardening (PR #4209 — pending merge), with recommended config snippets for HPC/Lustre environments The Bug Fixes section and intro paragraph are also updated accordingly. A source-level RST comment has been added above the Hierarchical FL section in the file to alert future maintainers to the merge dependency. ## Merged PRs Documented | PR | Area | Status | |---|---|---| | #4171 / #4172 | Stream pool starvation fix | Merged | | #4174 | Lock contention reduction | Merged | | #4167 | Streaming download retry | Merged | | #4204 | RxTask self-deadlock fix | Merged | | #4206 | HOL stall mitigation | Merged | | #4210 | Zero tensor copy at CJ | Merged | | #4211 | Client-side memory management | Merged | | #4209 | Hierarchical FL startup stability | **Open — merge before this PR** | ## Changes ### Memory Management (restructured) - **Zero Tensor Copy at CJ** (`ClientAPILauncherExecutor`): CJ now holds `LazyDownloadRef` placeholders instead of materializing full tensors, eliminating the CJ as a memory bottleneck for LLM-scale models. - **Client-Side Memory Management**: `gc.collect()` + `malloc_trim(0)` / jemalloc purge / `torch.cuda.empty_cache()` injected after every `flare.send()`, configurable via `client_memory_gc_rounds`. - Existing TensorDownloader and server-side cleanup content retained. ### F3 Streaming Reliability and Performance (new section) - **HOL Stall Mitigation**: Bounded `send_frame()` timeout, ACK-progress watchdog, and stall detection/recovery. Includes recommended environment variable settings for large hierarchical deployments. - **Stream Pool Starvation Fix**: Blob callbacks dispatched to a dedicated `callback_thread_pool`, keeping stream workers free for concurrent downloads. - **Streaming Download Retry**: Exponential-backoff retry (up to 3 attempts, capped at 60 s) on `TIMEOUT` errors; abort-signal aware. - **RxTask Self-Deadlock Fix**: `stop()` deferred until after `map_lock` released, eliminating stream-error-triggered deadlock. - **Lock Contention Reduction**: `produce_item()` runs outside `self.lock`; compare-and-store for cache write. Reduces model-download latency under high client concurrency. ### Hierarchical FL Startup Stability (new section — pending PR #4209) - **Deployment Timeout as Failure**: `reply=None` correctly counted against `min_sites`; timed-out clients excluded before `start_client_job`. - **Startup Grace Period**: Dead-client detection debounced — client must be observed once before absence triggers dead-job notification. Default changed to `True`. - **Selective Client Exclusion**: Stragglers at start-job excluded rather than causing full abort, if remaining count ≥ `min_clients`. - **Metadata Hardening**: `TypeError` on absent job metadata replaced with descriptive `RuntimeError`. - Recommended `config_fed_server.json` / `config_fed_client.json` snippets for HPC (Frontier/ORNL) scale. ## Test plan - [ ] Sphinx build (`make html`) passes without RST warnings on the updated file - [ ] All new cross-references (`.. code-block::`, `.. note::`) render correctly in the docs build - [ ] Verify section hierarchy (underline characters) is consistent throughout the file - [ ] Confirm PR #4209 is merged before this PR is merged 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

…meouts, selective client exclusion, and dead-detection debounce (#4209) (#4288) ## Problem Large-scale hierarchical FL jobs (e.g. BERT NER, 144 clients, 6 relays on Frontier) abort in Round 0 due to a cascading startup failure chain. The root sequence is: 1. F3 streaming HOL stall (PR #4206) delays deployment ACKs from relay-connected clients 2. **`_deploy_job()`** treats `reply=None` (timeout) as `"unknown"` — not a failure — so timed-out clients silently appear to have been deployed 3. **`_start_run()`** tries to start those clients; they again time out, and `check_client_replies()` ignores the `None` reply 4. **`_sync_client_jobs()`** fires dead-job notification on the very first heartbeat with no startup grace period 5. FedAvg requires 144/144 — one or two missing clients → abort 6. A late-starting CJ crashes with `TypeError: 'NoneType' object is not iterable` when `get_job_clients()` receives `None` metadata from an already-aborted job PRs #4206, #4204, #4174, #4172, #4186, #4211, #4210 (all merged in 2.7.2) address the transport layer. This PR addresses the remaining job lifecycle layer. --- ## Fixes Included ### 1 — `_deploy_job()`: Treat deployment timeout as failure (`job_runner.py`) **Root bug**: `reply=None` was logged as `"unknown"` and excluded from `failed_clients`, so timed-out clients counted as "successfully deployed" for the `min_sites` check. **Fix**: Add timed-out clients to `failed_clients` with a `"deployment timeout"` label. The existing `min_sites` / `required_sites` logic then correctly decides whether to abort. ### 2 — `check_client_replies()`: Return timed-out clients instead of raising (`admin.py`) **Root bug**: In strict mode, any timeout raised immediately, aborting the whole job even when the remaining active clients satisfied `min_sites`. **Fix**: In strict mode, collect timed-out clients into a return list rather than raising. Explicit errors (non-OK return code or error body) still raise. Also fixes the non-strict mode to use name-keyed dict lookup instead of fragile positional `zip()`. New signature: `check_client_replies(...) -> List[str]` (timed-out client names; empty = none). ### 3 — `_start_run()`: Selective exclusion with min_sites re-evaluation (`job_runner.py`) **Root bug**: A start-job timeout under strict mode aborted the entire job with no tolerance for stragglers within `min_sites` bounds. **Fix**: Use the returned timed-out list from `check_client_replies()`. If remaining active clients >= `min_sites`, log a warning and proceed. Only abort when below tolerance. ### 4 — `_sync_client_jobs()`: Require-prior-report default changed to `True` (`fed_server.py`) **Root bug**: `SYNC_CLIENT_JOBS_REQUIRE_PREVIOUS_REPORT` defaulted to `False`, meaning the bug fix was opt-in and the unsafe behaviour remained the default. **Fix**: Default changed to `True`. Operators who want the aggressive legacy detection can set it to `False` explicitly. ### 5 — `_sync_client_jobs()`: Move `_reported_clients` out of `job_info` dict (`fed_server.py`) **Root bug**: Positive-observation tracking was stored as `job_info["_reported_clients"]`, injecting algorithm state into a data dict with no corresponding `RunProcessKey` constant. **Fix**: Tracking moved to `self._job_reported_clients: Dict[str, set]` on `FederatedServer`. Stale entries are purged whenever a job is no longer in `run_processes`. ### 6 — `ClientRunManager.get_job_clients()`: Explicit meta validation (`client_run_manager.py`) Raises `RuntimeError` with a descriptive message instead of an opaque `TypeError` when `JOB_CLIENTS` is absent or the wrong type. --- ## Configuration Recommendations (No Code Change Needed) | Setting | Recommended value | Effect | |---|---|---| | `FedAvg(min_clients=...)` | 96-98% of `num_clients` | Tolerates a few startup stragglers | | `runner_sync_timeout` | `120` s | Allows Lustre-backed deployments time to complete | | `strict_start_job_reply_check` | `true` | Start-job timeouts surfaced, straggler clients excluded | | `sync_client_jobs_require_previous_report` | `true` (now the default) | Prevents premature dead-job from startup delay | | `SFM_CLOSE_STALLED_CONNECTION` (PR #4206) | `true` after staging | Disconnects stalled relay connections | --- ## Files Changed - `nvflare/private/fed/server/job_runner.py` — `_deploy_job()` timeout as failure; `_start_run()` selective exclusion - `nvflare/private/fed/server/admin.py` — `check_client_replies()` returns timed-out list; dict-keyed non-strict path - `nvflare/private/fed/server/fed_server.py` — `_sync_client_jobs()` default `True`; `_job_reported_clients` attr; stale cleanup - `nvflare/private/fed/client/client_run_manager.py` — explicit meta validation in `get_job_clients()` --- ## Test Coverage New and updated unit tests with both positive and negative cases: | File | Tests | What they cover | |---|---|---| | `admin_test.py` | 8 | Timeout returned not raised; dict lookup; error still raises; reorder OK | | `job_runner_test.py` | 4 | strict flag wiring; timeout within tolerance → warn; timeout below tolerance → raise | | `job_runner_deploy_test.py` | 9 (new file) | Timeout counted as failure; OK reply not failed; mixed outcomes; detail label; min_sites with timeouts; integration sequence | | `fed_server_test.py` | 5 | Default requires-prior-report; legacy explicit-False still fires; tracking in server attr not job_info; stale cleanup | All 29 targeted unit tests pass. ## Test Plan - [x] Unit tests for each changed function (positive + negative) - [x] New `job_runner_deploy_test.py` covering deployment timeout classification end-to-end - [x] All 29 targeted unit tests pass - [ ] Hierarchical staging run with all flags at default - [ ] Hierarchical staging run with `strict_start_job_reply_check=true` and reduced `min_clients` - [ ] Verify no regression on standard (non-hierarchical) FL jobs --------- Fixes # . ### Description A few sentences describing the changes proposed in this pull request. ### Types of changes  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated. Co-authored-by: Chester Chen <512707+chesterxgchen@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Peter Cnudde <pcnudde@nvidia.com>

chesterxgchen changed the title ~~[2.7] B1 Pass-Through: Zero Tensor Copy at CJ for Large-Model Federated Training~~ [2.7] Pass-Through: Zero Tensor Copy at CJ for Large-Model Federated Training Feb 20, 2026

greptile-apps bot reviewed Feb 20, 2026

View reviewed changes

chesterxgchen force-pushed the 2.7_client_api branch from d6c659a to 5f6958c Compare February 20, 2026 06:38

greptile-apps bot reviewed Feb 20, 2026

View reviewed changes

tests/memory_profile/fedavg/mprofile_20260124221746.dat Outdated Show resolved Hide resolved

greptile-apps bot reviewed Feb 20, 2026

View reviewed changes

tests/memory_profile/fedavg/mprofile_20260124221746.dat Outdated Show resolved Hide resolved

greptile-apps bot reviewed Feb 20, 2026

View reviewed changes

pcnudde reviewed Feb 20, 2026

View reviewed changes

nvflare/app_common/executors/client_api_launcher_executor.py Show resolved Hide resolved

chesterxgchen and others added 12 commits February 20, 2026 14:46

reformat files

7263503

[2.7] Add memory profile .dat files to .gitignore

e974093

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

[2.7] Remove memory profile .dat output files from version control

c7c1bd1

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

[2.7] Scope PASS_THROUGH to client API launcher lifecycle

b62bfd1

Preserve and restore the prior PASS_THROUGH fobs context value so this executor pass-through mode does not leak process-wide beyond the run, including when initialize fails.

[2.7] Clarify jemalloc preload requires explicit opt in

b80c037

Update memory management documentation to reflect current behavior: jemalloc preload is enabled only when NVFLARE_ENABLE_JEMALLOC_PRELOAD is set true, rather than auto-enabled by default.

chesterxgchen force-pushed the 2.7_client_api branch from 1f7674d to b80c037 Compare February 20, 2026 22:50

greptile-apps bot reviewed Feb 20, 2026

View reviewed changes

code style format

b925923

greptile-apps bot reviewed Feb 20, 2026

View reviewed changes

[2.7] Fix launcher pass-through tests for FLContext logging API

bdc5568

Extend the fake FLContext in client_api_launcher_executor tests to provide identity and job context accessors used by FLComponent logging helpers.

greptile-apps bot reviewed Feb 20, 2026

View reviewed changes

[2.7] Avoid CoreCell name collisions in pass-through E2E test

741b460

Use unique server and subprocess FQCNs per test fixture and clean up CoreCell registry entries on teardown to prevent duplicate cell name failures across test runs.

greptile-apps bot reviewed Feb 20, 2026

View reviewed changes

pcnudde approved these changes Feb 20, 2026

View reviewed changes

chesterxgchen merged commit ebcb3ec into NVIDIA:2.7 Feb 20, 2026
19 checks passed

chesterxgchen deleted the 2.7_client_api branch February 20, 2026 23:58

chesterxgchen mentioned this pull request Feb 22, 2026

[2.7] Fix hierarchical FL startup failures: deployment timeouts, selective client exclusion, and dead-detection debounce #4209

Merged

6 tasks

chesterxgchen mentioned this pull request Feb 22, 2026

[2.7] Update 2.7.2 release notes: streaming hardening, memory management, hierarchical startup stability [skip ci] #4218

Merged

4 tasks

YuanTingHsieh mentioned this pull request Mar 11, 2026

Cherry-pick [2.7] Fix hierarchical FL startup failures: deployment timeouts, selective client exclusion, and dead-detection debounce (#4209) #4288

Merged

12 tasks

greptile-apps bot mentioned this pull request Mar 11, 2026

Cherry-pick [2.7] Pass-Through: Zero Tensor Copy at CJ for Large-Model Federated Training (#4210) #4289

Open

6 tasks

Conversation

chesterxgchen commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem Description

Before this PR

Proposed Architecture and Solution

Pass-Through Architecture

Key Components

1. FOBSContextKey.PASS_THROUGH (nvflare/fuel/utils/fobs/__init__.py)

2. LazyDownloadRef (via_downloader.py)

3. _LazyBatchInfo (via_downloader.py)

4. LazyDownloadRefDecomposer (via_downloader.py)

5. ClientAPILauncherExecutor.initialize() (client_api_launcher_executor.py)

Analysis

Why not intercept at the pipe level?

Why dot (Datum Object Type) on LazyDownloadRef?

Memory profile at CJ

Benefits

Backward Compatibility

Changed Files

Test Strategy

Unit tests — tests/unit_test/fuel/utils/fobs/test_pass_through.py (22 tests)

End-to-end tests — tests/unit_test/fuel/f3/streaming/test_pass_through_e2e.py (5 tests, real TCP Cells)

Integration test — tests/integration_test/data/jobs/pt_large_model_pass_through/

Uh oh!

greptile-apps bot commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

chesterxgchen commented Feb 20, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chesterxgchen commented Feb 20, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

pcnudde left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chesterxgchen commented Feb 20, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

chesterxgchen commented Feb 20, 2026

Uh oh!

chesterxgchen commented Feb 20, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

chesterxgchen commented Feb 20, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

chesterxgchen commented Feb 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

chesterxgchen commented Feb 20, 2026 •

edited

Loading

1. `FOBSContextKey.PASS_THROUGH` (`nvflare/fuel/utils/fobs/init.py`)

2. `LazyDownloadRef` (`via_downloader.py`)

3. `_LazyBatchInfo` (`via_downloader.py`)

4. `LazyDownloadRefDecomposer` (`via_downloader.py`)

5. `ClientAPILauncherExecutor.initialize()` (`client_api_launcher_executor.py`)

Why `dot` (Datum Object Type) on `LazyDownloadRef`?

Unit tests — `tests/unit_test/fuel/utils/fobs/test_pass_through.py` (22 tests)

End-to-end tests — `tests/unit_test/fuel/f3/streaming/test_pass_through_e2e.py` (5 tests, real TCP Cells)

Integration test — `tests/integration_test/data/jobs/pt_large_model_pass_through/`

greptile-apps bot commented Feb 20, 2026 •

edited

Loading