Fix server hanging on aborted TCP comms (flaky test_RetireWorker_stress)#9315
Conversation
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 40 files ± 0 40 suites ±0 14h 33m 32s ⏱️ + 18m 7s For more details on these failures, see this check. Results for commit 93329ec. ± Comparison against base commit 2d43f7f. This pull request removes 2 and adds 1 tests. Note that renamed tests count towards both.♻️ This comment has been updated with latest results. |
6d5dd30 to
c68944e
Compare
If Listener.stop() runs after a connection has been accepted but before
_handle_stream gets to run, _handle_stream would crash with
ValueError('invalid operation on non-started TCPListener') when reading
self.contact_address, abandoning the accepted stream without closing it.
Since the comm handshake in connect() is deliberately not subject to the
connect timeout (dask#7698), the client would then hang forever waiting for the
server's handshake reply. abort_handshaking_comms() cannot help, as the comm
never reached on_connection().
This is what deadlocks test_RetireWorker_stress: a worker's gather_dep from a
closing worker gets stuck forever in the handshake, pinning the key in flight
state; the AMM RetireWorker policy then re-suggests the same transfer every
interval, but the recipient ignores it because the key is already in flight,
so retire_workers never completes. It also leaked one file descriptor (the
abandoned socket) every time the race hit without deadlocking.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Tasks that complete on a retiring worker between the moment the AMM RetireWorker policy observes that no unique keys are left on it and the moment the worker is removed are lost and recomputed elsewhere. This is a documented design decision (see RetireWorker.done), so the stress test must not treat these recomputes as failures. Count them through the remove-worker events and loosen the transition_log assertion accordingly. Fixes the rare `assert 1641 == 1638` flavour of CI failures, distinct from the deadlock fixed in the previous commit. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This reverts commit e4594f8.
| # would hang forever in the comm handshake, which is deliberately not | ||
| # subject to timeouts (see distributed.comm.core.connect()). | ||
| stream.close() | ||
| return |
There was a problem hiding this comment.
Additional commentary by Claude:
If Listener.stop() runs after a connection has been accepted but before
_handle_stream gets to run, _handle_stream would crash with
ValueError('invalid operation on non-started TCPListener') when reading
self.contact_address, abandoning the accepted stream without closing it.Since the comm handshake in connect() is deliberately not subject to the
connect timeout (#7698), the client would then hang forever waiting for the
server's handshake reply. abort_handshaking_comms() cannot help, as the comm
never reached on_connection().This is what deadlocks test_RetireWorker_stress: a worker's gather_dep from a
closing worker gets stuck forever in the handshake, pinning the key in flight
state; the AMM RetireWorker policy then re-suggests the same transfer every
interval, but the recipient ignores it because the key is already in flight,
so retire_workers never completes. It also leaked one file descriptor (the
abandoned socket) every time the race hit without deadlocking.
| if msg["action"] == "remove-worker" and msg["expected"] | ||
| ) | ||
| actual = sum(t.start == "memory" for t in s.transition_log) | ||
| assert expected_tasks <= actual <= expected_tasks + lost |
There was a problem hiding this comment.
Additional commentary by Claude:
Tasks that complete on a retiring worker between the moment the AMM
RetireWorker policy observes that no unique keys are left on it and the moment
the worker is removed are lost and recomputed elsewhere. This is a documented
design decision (see RetireWorker.done), so the stress test must not treat
these recomputes as failures. Count them through the remove-worker events and
loosen the transition_log assertion accordingly.This change fixes the rare
assert 1641 == 1638flavour of CI failures, distinct from the
deadlock fixed in the previous commit.
test_RetireWorker_stresstest_RetireWorker_stress)
Fix race condition that would cause a server to hang when a TCP connection is shut down while it is halfway through being opened (see below for details).