Skip to content

WIP: Make workers gracefully handle sigint#2844

Open
jni wants to merge 6 commits into
dask:mainfrom
jni:worker-sigint
Open

WIP: Make workers gracefully handle sigint#2844
jni wants to merge 6 commits into
dask:mainfrom
jni:worker-sigint

Conversation

@jni

@jni jni commented Jul 15, 2019

Copy link
Copy Markdown

Was hacking with @mrocklin trying to fix dask/dask-jobqueue#122, ran into #2788. These are early attempts to fix both things. @TomAugspurger you might find my test code helpful, if overly verbose (I just copied the worker/scheduler creation from the test above it).

The basic idea behind the unregister_with_scheduler coroutine is that the cluster (apparently? @mrocklin told me this) sends SIGINT to processes before killing them for exceeding their time allocation. We can use this to close out the workers with safe=True so that the tasks running on them are not marked as suspicious.

@jni

jni commented Jul 15, 2019

Copy link
Copy Markdown
Author

(Note: I do not want to unduly associate @mrocklin with the quality of this PR (or lack thereof) — @mrocklin was just patiently sitting next to me answering my questions and pointing me to the right files and SO answers. =)

@TomAugspurger

Copy link
Copy Markdown
Member

Thanks @jni. I haven't worked any more on #2788 since a couple weeks ago when it completely stumped me.

Do you plan to continue working on this? If not, I'll try to pick it up again by the end of the week.

@jni

jni commented Jul 23, 2019

Copy link
Copy Markdown
Author

@TomAugspurger I'd very much appreciate you picking this up, as I don't think I can dedicate much time to it in the next few weeks...

@TomAugspurger

Copy link
Copy Markdown
Member

OK, I'll try to get #2788 sorted out today.

@TomAugspurger

TomAugspurger commented Jul 23, 2019

Copy link
Copy Markdown
Member

Well, that whole "step away from the problem for a bit" thing actually worked.

diff --git a/distributed/cli/dask_worker.py b/distributed/cli/dask_worker.py
index e23a8ab9..5a396686 100755
--- a/distributed/cli/dask_worker.py
+++ b/distributed/cli/dask_worker.py
@@ -394,6 +394,7 @@ def main(
         raise TimeoutError("Timed out starting worker.") from None
     finally:
         logger.info("End worker")
+        return 0
 
 
 def go():

is all I was missing for #2788 :) Now to write a test.

edit: never mind, that's not working :/

jni and others added 5 commits November 30, 2019 08:39
On distributed master, sending SigInt to a worker results in a
TimeoutError raised from Tornado, which is not at all what happened.
This test checks that this error is not raised.
@mrocklin

mrocklin commented Dec 1, 2019

Copy link
Copy Markdown
Member

OK, I've fiddled with this a bit. Things seem to behave well on linux, but windows CI is unhappy.

Comment on lines +412 to +415
for sig in [signal.SIGINT, signal.SIGTERM]:
asyncio.get_event_loop().add_signal_handler(
sig, functools.partial(on_signal, sig)
)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrocklin

Copy link
Copy Markdown
Member

cc @jacobtomlinson

Base automatically changed from master to main March 8, 2021 19:03
Comment on lines +399 to +401
if signum == signal.SIGINT:
logger.info("Gracefully closing worker because of SIGINT call")
await asyncio.gather(*[n.close_gracefully() for n in nannies])

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be reasonable to give SIGTERM the same treatment as SIGINT as well?

@jni jni requested a review from fjetter as a code owner January 23, 2024 10:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Handling workers with expiring allocation requests

5 participants