Skip to content

Fix server hanging on aborted TCP comms (flaky test_RetireWorker_stress)#9315

Merged
crusaderky merged 6 commits into
dask:mainfrom
crusaderky:test_RetireWorker_stress
Jul 2, 2026
Merged

Fix server hanging on aborted TCP comms (flaky test_RetireWorker_stress)#9315
crusaderky merged 6 commits into
dask:mainfrom
crusaderky:test_RetireWorker_stress

Conversation

@crusaderky

@crusaderky crusaderky commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

Fix race condition that would cause a server to hang when a TCP connection is shut down while it is halfway through being opened (see below for details).

@crusaderky crusaderky requested a review from fjetter as a code owner July 1, 2026 21:45
@crusaderky crusaderky added flaky test Intermittent failures on CI. tests Unit tests and/or continuous integration labels Jul 1, 2026
@crusaderky crusaderky marked this pull request as draft July 1, 2026 21:46
@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

    40 files  ±    0      40 suites  ±0   14h 33m 32s ⏱️ + 18m 7s
 4 154 tests  -     1   3 973 ✅ +    3    178 💤  -  2  3 ❌  - 2 
80 841 runs  +1 421  76 597 ✅ +1 385  4 241 💤 +38  3 ❌  - 2 

For more details on these failures, see this check.

Results for commit 93329ec. ± Comparison against base commit 2d43f7f.

This pull request removes 2 and adds 1 tests. Note that renamed tests count towards both.
distributed.cli.tests.test_dask_worker.test_listen_address_ipv6[tcp:..[ ‑ 1]:---nanny]
distributed.cli.tests.test_dask_worker.test_listen_address_ipv6[tcp:..[ ‑ 1]:---no-nanny]
distributed.comm.tests.test_comms ‑ test_stop_listener_during_handle_stream[tornado]

♻️ This comment has been updated with latest results.

@crusaderky crusaderky force-pushed the test_RetireWorker_stress branch from 6d5dd30 to c68944e Compare July 2, 2026 11:19
crusaderky and others added 4 commits July 2, 2026 13:36
If Listener.stop() runs after a connection has been accepted but before
_handle_stream gets to run, _handle_stream would crash with
ValueError('invalid operation on non-started TCPListener') when reading
self.contact_address, abandoning the accepted stream without closing it.

Since the comm handshake in connect() is deliberately not subject to the
connect timeout (dask#7698), the client would then hang forever waiting for the
server's handshake reply. abort_handshaking_comms() cannot help, as the comm
never reached on_connection().

This is what deadlocks test_RetireWorker_stress: a worker's gather_dep from a
closing worker gets stuck forever in the handshake, pinning the key in flight
state; the AMM RetireWorker policy then re-suggests the same transfer every
interval, but the recipient ignores it because the key is already in flight,
so retire_workers never completes. It also leaked one file descriptor (the
abandoned socket) every time the race hit without deadlocking.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Tasks that complete on a retiring worker between the moment the AMM
RetireWorker policy observes that no unique keys are left on it and the moment
the worker is removed are lost and recomputed elsewhere. This is a documented
design decision (see RetireWorker.done), so the stress test must not treat
these recomputes as failures. Count them through the remove-worker events and
loosen the transition_log assertion accordingly.

Fixes the rare `assert 1641 == 1638` flavour of CI failures, distinct from the
deadlock fixed in the previous commit.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This reverts commit e4594f8.
Comment thread distributed/comm/tcp.py
# would hang forever in the comm handshake, which is deliberately not
# subject to timeouts (see distributed.comm.core.connect()).
stream.close()
return

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional commentary by Claude:

If Listener.stop() runs after a connection has been accepted but before
_handle_stream gets to run, _handle_stream would crash with
ValueError('invalid operation on non-started TCPListener') when reading
self.contact_address, abandoning the accepted stream without closing it.

Since the comm handshake in connect() is deliberately not subject to the
connect timeout (#7698), the client would then hang forever waiting for the
server's handshake reply. abort_handshaking_comms() cannot help, as the comm
never reached on_connection().

This is what deadlocks test_RetireWorker_stress: a worker's gather_dep from a
closing worker gets stuck forever in the handshake, pinning the key in flight
state; the AMM RetireWorker policy then re-suggests the same transfer every
interval, but the recipient ignores it because the key is already in flight,
so retire_workers never completes. It also leaked one file descriptor (the
abandoned socket) every time the race hit without deadlocking.

if msg["action"] == "remove-worker" and msg["expected"]
)
actual = sum(t.start == "memory" for t in s.transition_log)
assert expected_tasks <= actual <= expected_tasks + lost

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional commentary by Claude:

Tasks that complete on a retiring worker between the moment the AMM
RetireWorker policy observes that no unique keys are left on it and the moment
the worker is removed are lost and recomputed elsewhere. This is a documented
design decision (see RetireWorker.done), so the stress test must not treat
these recomputes as failures. Count them through the remove-worker events and
loosen the transition_log assertion accordingly.

This change fixes the rare assert 1641 == 1638 flavour of CI failures, distinct from the
deadlock fixed in the previous commit.

@crusaderky crusaderky marked this pull request as ready for review July 2, 2026 13:46
@crusaderky crusaderky changed the title Fix flaky test_RetireWorker_stress Fix server hanging on aborted TCP comms (flaky test_RetireWorker_stress) Jul 2, 2026
@crusaderky crusaderky added stability Issue or feature related to cluster stability (e.g. deadlock) bug Something is broken p2 Affects more than a few users but doesn't prevent core functions labels Jul 2, 2026
@crusaderky crusaderky merged commit 4db4e05 into dask:main Jul 2, 2026
47 of 51 checks passed
@crusaderky crusaderky deleted the test_RetireWorker_stress branch July 2, 2026 14:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something is broken flaky test Intermittent failures on CI. networking p2 Affects more than a few users but doesn't prevent core functions stability Issue or feature related to cluster stability (e.g. deadlock) tests Unit tests and/or continuous integration

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant