Skip to content

Exclude comm handshake from connect timeout#7698

Merged
fjetter merged 16 commits into
dask:mainfrom
fjetter:remove_handshake_from_timeout
Jul 28, 2023
Merged

Exclude comm handshake from connect timeout#7698
fjetter merged 16 commits into
dask:mainfrom
fjetter:remove_handshake_from_timeout

Conversation

@fjetter

@fjetter fjetter commented Mar 22, 2023

Copy link
Copy Markdown
Member

There is not a very specific issue this is fixing but I've come across many connection timeouts over the time and some of them are actually caused by a timed out handshake.

There is no reason for the handshake to be included in the timeout. The timeout is primarily there to ensure that the remote actually exists and we're not attempting to connect to an otherwise dead remote.

Particularly in situations when the event loop is blocked, the handshake can actually take quite a while, particularly if read/write is done sequentially as on main. However, once the connection is established, the entire operation is mostly about sending/receiving some ordinary network messages and we're not timing out such sendrcv operations anywhere else. I believe it is very generally true that the handshake is typically not slow due to network but only due to server load.

@fjetter fjetter requested a review from graingert March 22, 2023 16:50
Comment thread distributed/comm/core.py
Comment thread distributed/comm/core.py
@github-actions

github-actions Bot commented Mar 22, 2023

Copy link
Copy Markdown
Contributor

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       20 files  ±  0         20 suites  ±0   10h 47m 25s ⏱️ - 20m 21s
  3 703 tests  -   1    3 594 ✔️  -   4     106 💤 ±0  3 +3 
35 816 runs   - 10  34 061 ✔️  - 17  1 748 💤 ±0  7 +7 

For more details on these failures, see this check.

Results for commit bd2ba8f. ± Comparison against base commit 566fd1f.

This pull request removes 1 test.
distributed.tests.test_client ‑ test_client_active_bad_port

♻️ This comment has been updated with latest results.



@gen_test()
async def test_client_active_bad_port():

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

http servers wait for a \r\n before parsing the request line and so we never get a handshake back from those servers so this test fails

graingert and others added 6 commits May 4, 2023 11:56
convert test_single_executable_deprecated into a sync test
wait_for_log_line was intermittently blocking the handshake from occuring
adding the extra asyncio.gather made this more likely
Co-authored-by: Thomas Grainger <tagrain@gmail.com>
`await comm.write()` doesn't actually yield to the event loop
(unless to_frames uses a thread) and so introducing `asyncio.gather`
actually makes the handshake slower, which seems to be introducing
more flaky test failures
@graingert graingert force-pushed the remove_handshake_from_timeout branch from a82f162 to d7214e2 Compare May 4, 2023 11:00
@fjetter

fjetter commented May 16, 2023

Copy link
Copy Markdown
Member Author

✅ From my end. @graingert feel free to merge

@hendrikmakait

Copy link
Copy Markdown
Member

@graingert: There's a lot of red on CI including failures like FAILED distributed/tests/test_stress.py::test_stress_creation_and_deletion - RuntimeError: ('Unclosed Comms', [<TCP (closed) local=tcp://127.0.0.1:46813 remote=tcp://127.0.0.1:36408>, <TCP (closed) local=tcp://127.0.0.1:40081 remote=tcp://127.0.0.1:47080>]). These might be related, could you have another look?

@graingert graingert force-pushed the remove_handshake_from_timeout branch from b49b28d to 08353e1 Compare June 15, 2023 11:57
@graingert graingert marked this pull request as draft June 15, 2023 11:58
@graingert graingert force-pushed the remove_handshake_from_timeout branch from 73bb326 to a1723de Compare July 6, 2023 11:58
changing when the worker calls self.stop() seems to cause this test to
fail unless the SlowKillNanny.kill() is allowed to proceed when called
during test teardown
@fjetter

fjetter commented Jul 18, 2023

Copy link
Copy Markdown
Member Author

I'm inclined to merge this even if it is increasing test flakiness. I do believe that this genuinely improves stability and test flakiness is mostly due to a bad test setup.
Next release should be happening this Friday

@fjetter fjetter marked this pull request as ready for review July 26, 2023 09:12
@fjetter fjetter merged commit 2751741 into dask:main Jul 28, 2023
@fjetter fjetter deleted the remove_handshake_from_timeout branch July 28, 2023 09:12
@fjetter fjetter mentioned this pull request Jul 28, 2023
Comment thread distributed/worker.py
for pc in self.periodic_callbacks.values():
pc.stop()

self.stop()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving this up here has broken P2P restarts by introducing a new race condition (or making it exponentially more likely to occur). Why do we need this change?

See #8011 (comment) for more on the P2P problem.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vaguely remember having a conversation with @graingert about this but can't recall the details. I believe there were a couple of connections left hanging because they were initiated during worker close and were never properly aborted. Not allowing any more connections once we are closing felt like a sane thing to do.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any issues with reverting this change but I am a bit reluctant to revert the handshake timeout removal. I assume that this is required to keep many tests green.

crusaderky added a commit to crusaderky/distributed that referenced this pull request Jul 2, 2026
If Listener.stop() runs after a connection has been accepted but before
_handle_stream gets to run, _handle_stream would crash with
ValueError('invalid operation on non-started TCPListener') when reading
self.contact_address, abandoning the accepted stream without closing it.

Since the comm handshake in connect() is deliberately not subject to the
connect timeout (dask#7698), the client would then hang forever waiting for the
server's handshake reply. abort_handshaking_comms() cannot help, as the comm
never reached on_connection().

This is what deadlocks test_RetireWorker_stress: a worker's gather_dep from a
closing worker gets stuck forever in the handshake, pinning the key in flight
state; the AMM RetireWorker policy then re-suggests the same transfer every
interval, but the recipient ignores it because the key is already in flight,
so retire_workers never completes. It also leaked one file descriptor (the
abandoned socket) every time the race hit without deadlocking.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants