[2.7] Fix Swarm deadlock: _executing guard prevents pipe handler replacement mid-transaction#4314
Conversation
…acement mid-transaction BEFORE_TASK_EXECUTION is a global broadcast to all FLComponents. In a CCWF Swarm aggregator node, receiving swarm_report_learn_result aux tasks from other sites while the local subprocess is training fires this event concurrently, causing TaskExchanger.handle_event() to stop and recreate the PipeHandler while execute() is blocked in its polling loop. The new handler's queue is empty so get_next() returns None forever — a silent deadlock with no error or timeout. Fix: add threading.Event _executing as a guard flag. handle_event skips pipe handler replacement when the flag is set. TaskExchanger.execute() uses an ownership pattern (acquired = not is_set()) so it only clears the flag if it set it, preserving the flag across the super().execute() call from LauncherExecutor. LauncherExecutor.execute() sets the flag unconditionally at the top — before _initialize_external_execution() — covering the up-to-60 s _wait_external_setup() window that the base-class guard would miss. Affects all pipe types (FilePipe, CellPipe); independent of the FilePipe TOCTOU fix in NVIDIA#4296. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Fixes a Swarm/CCWF deadlock where concurrent BEFORE_TASK_EXECUTION events could reset TaskExchanger’s PipeHandler mid-transaction, orphaning the handler the subprocess is writing to and causing the executor to poll an empty handler forever.
Changes:
- Add a
_executing(threading.Event) guard toTaskExchangerto skip pipe-handler reset whileexecute()is in progress, and refactorexecute()to use an ownership pattern. - Update
LauncherExecutor.execute()to set_executingat the very start (covering external init/setup) and clear it in afinally. - Add unit tests covering
_executinglifecycle, handler-reset suppression, and initialization-time concurrency behavior.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
nvflare/app_common/executors/task_exchanger.py |
Introduces _executing guard + execute() ownership refactor to prevent handler reset during execution. |
nvflare/app_common/executors/launcher_executor.py |
Sets _executing at the start of execute() (before external init) and clears it at the end. |
tests/unit_test/app_common/executors/client_api_launcher_executor_test.py |
Adds regression/unit tests for _executing behavior and concurrent BEFORE_TASK_EXECUTION handling. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
tests/unit_test/app_common/executors/client_api_launcher_executor_test.py
Show resolved
Hide resolved
tests/unit_test/app_common/executors/client_api_launcher_executor_test.py
Outdated
Show resolved
Hide resolved
tests/unit_test/app_common/executors/client_api_launcher_executor_test.py
Outdated
Show resolved
Hide resolved
Greptile SummaryThis PR fixes a silent Swarm deadlock where Key changes:
One style observation: In Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant FW as FL Framework
participant LE as LauncherExecutor.execute()
participant TE as TaskExchanger.execute()
participant HE as handle_event(BEFORE_TASK_EXECUTION)
participant PH as PipeHandler
Note over FW: Local training task assigned
FW->>HE: fire BEFORE_TASK_EXECUTION (local task)
HE->>HE: _executing_lock: is_set()=False → skip=False
HE->>PH: stop(old_handler, close_pipe=False)
HE->>PH: _create_pipe_handler() → new_handler
HE->>PH: new_handler.start()
FW->>LE: execute(train_task)
LE->>LE: _executing.set() [no lock needed]
LE->>LE: _initialize_external_execution() [up to 60s]
Note over FW: Concurrent aux task from swarm peer
FW-->>HE: fire BEFORE_TASK_EXECUTION (aux task)
HE->>HE: _executing_lock: is_set()=True → skip=True
HE-->>HE: log_debug + return (handler NOT replaced ✓)
LE->>TE: super().execute()
TE->>TE: _executing_lock: acquired=False (already set)
TE->>PH: _do_execute() — polls new_handler
PH-->>TE: result from subprocess
TE-->>LE: return result
LE->>LE: _finalize_external_execution()
LE->>LE: finally: _executing.clear()
|
tests/unit_test/app_common/executors/client_api_launcher_executor_test.py
Outdated
Show resolved
Hide resolved
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The bare read-then-write on threading.Event was not atomic: a concurrent BEFORE_TASK_EXECUTION on the CellNet thread could pass the is_set() check between steps (1) read and (2) set in TaskExchanger.execute(), allowing the handler to be replaced in exactly the window the guard was meant to prevent. Add threading.Lock _executing_lock and hold it for both the check-and-set in execute() and the guard check in handle_event(BEFORE_TASK_EXECUTION), making the two operations mutually exclusive. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
tests/unit_test/app_common/executors/client_api_launcher_executor_test.py
Show resolved
Hide resolved
tests/unit_test/app_common/executors/client_api_launcher_executor_test.py
Show resolved
Hide resolved
|
/build |
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
tests/unit_test/app_common/executors/client_api_launcher_executor_test.py
Show resolved
Hide resolved
…tor_test.py Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
|
/build |
|
/build |
|
/build |
|
/build |
Problem
BEFORE_TASK_EXECUTION is a global event broadcast to all FLComponents. In a CCWF Swarm aggregator node, receiving swarm_report_learn_result aux tasks from other sites while the local subprocess is training fires this event concurrently. This causes TaskExchanger.handle_event() to stop and recreate the PipeHandler while execute() is blocked in its polling loop — the old handler (that the subprocess is writing to) is orphaned, and the polling loop reads from the new empty handler forever. Silent deadlock, no error, no PEER_GONE, no timeout.
Affects all pipe types (FilePipe, CellPipe). Independent of the FilePipe TOCTOU fix in #4296.
Fix
Add threading.Event _executing as a guard flag:
handle_event(BEFORE_TASK_EXECUTION) skips handler replacement when _executing.is_set()
TaskExchanger.execute() uses an ownership pattern so super().execute() from LauncherExecutor does not prematurely clear the flag
LauncherExecutor.execute() sets the flag at the very top — before _initialize_external_execution() — covering the up-to-60 s _wait_external_setup() window
Types of changes
./runtest.sh.