WIP: Make workers gracefully handle sigint#2844
Open
jni wants to merge 6 commits into
Open
Conversation
Author
Member
Author
|
@TomAugspurger I'd very much appreciate you picking this up, as I don't think I can dedicate much time to it in the next few weeks... |
Member
|
OK, I'll try to get #2788 sorted out today. |
Member
|
Well, that whole "step away from the problem for a bit" thing actually worked. diff --git a/distributed/cli/dask_worker.py b/distributed/cli/dask_worker.py
index e23a8ab9..5a396686 100755
--- a/distributed/cli/dask_worker.py
+++ b/distributed/cli/dask_worker.py
@@ -394,6 +394,7 @@ def main(
raise TimeoutError("Timed out starting worker.") from None
finally:
logger.info("End worker")
+ return 0
def go():is all I was missing for #2788 :) Now to write a test. edit: never mind, that's not working :/ |
On distributed master, sending SigInt to a worker results in a TimeoutError raised from Tornado, which is not at all what happened. This test checks that this error is not raised.
e4064b6 to
8770d44
Compare
Member
|
OK, I've fiddled with this a bit. Things seem to behave well on linux, but windows CI is unhappy. |
Comment on lines
+412
to
+415
| for sig in [signal.SIGINT, signal.SIGTERM]: | ||
| asyncio.get_event_loop().add_signal_handler( | ||
| sig, functools.partial(on_signal, sig) | ||
| ) |
Member
There was a problem hiding this comment.
I expect windows will not be happy with this. https://stackoverflow.com/questions/45987985/asyncio-loops-add-signal-handler-in-windows
Closed
Member
gerald732
suggested changes
Mar 17, 2021
Comment on lines
+399
to
+401
| if signum == signal.SIGINT: | ||
| logger.info("Gracefully closing worker because of SIGINT call") | ||
| await asyncio.gather(*[n.close_gracefully() for n in nannies]) |
Contributor
There was a problem hiding this comment.
Would it be reasonable to give SIGTERM the same treatment as SIGINT as well?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Was hacking with @mrocklin trying to fix dask/dask-jobqueue#122, ran into #2788. These are early attempts to fix both things. @TomAugspurger you might find my test code helpful, if overly verbose (I just copied the worker/scheduler creation from the test above it).
The basic idea behind the
unregister_with_schedulercoroutine is that the cluster (apparently? @mrocklin told me this) sends SIGINT to processes before killing them for exceeding their time allocation. We can use this to close out the workers withsafe=Trueso that the tasks running on them are not marked as suspicious.