Skip to content

Doc on handling worker with walltime#481

Merged
guillaumeeb merged 3 commits into
dask:masterfrom
guillaumeeb:update_docs_handling_workers
Jan 24, 2021
Merged

Doc on handling worker with walltime#481
guillaumeeb merged 3 commits into
dask:masterfrom
guillaumeeb:update_docs_handling_workers

Conversation

@guillaumeeb

Copy link
Copy Markdown
Member

Finally, a little contribution from me, and a doc fix to a long standing issue.

Fixes #122.

@mivade mivade left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks like great additional documentation! I've pointed out some typos and some suggestions to clarify the language a bit, but otherwise this looks good!

- when you don't have a lot of room on you HPC platform and have only a few workers at a time (less than what you were hopping for when using scale or adapt). These workers will be killed (and others started) before you workload ends.
- when you really don't know how long your workload will take: all your workers could be killed before reaching the end. In this case, you'll want to use adaptive clusters so that Dask ensures some workers are always up.

If you don't set the proper parameters, you'll run into KilleWorker exceptions in those two cases.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in the exception.


If you don't set the proper parameters, you'll run into KilleWorker exceptions in those two cases.

The solution to this problem is to tell Dask up front that the workers have a finit life time:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: finit -> finite. Similarly lifetime is usually spelled as a single word.


The solution to this problem is to tell Dask up front that the workers have a finit life time:

- Use `--lifetime` worker option. This will enables infinite workloads using adaptive. Workers will be properly shut down before the scheduling system kills them, and all their states moved.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enables -> enable

How to handle job queueing system walltime killing workers
----------------------------------------------------------

In dask-jobqueue, every worker processes run inside a job, and all jobs have a time limit in job queueing systems.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be "every worker process runs..."

In dask-jobqueue, every worker processes run inside a job, and all jobs have a time limit in job queueing systems.
Reaching walltime can be troublesome in several cases:

- when you don't have a lot of room on you HPC platform and have only a few workers at a time (less than what you were hopping for when using scale or adapt). These workers will be killed (and others started) before you workload ends.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hopping -> hoping and "before you workload" -> "before your workload"

The solution to this problem is to tell Dask up front that the workers have a finit life time:

- Use `--lifetime` worker option. This will enables infinite workloads using adaptive. Workers will be properly shut down before the scheduling system kills them, and all their states moved.
- Use `--lifetime-stagger` when dealing with many workers (say > 20): this will allow to avoid workers all terminating at the same time, and so to ease rebalancing tasks and scheduling burden.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"this will allow to avoid workers all" -> "this will prevent workers from"

"and so to ease" -> "and so ease" or (probably better) "thus"

cluster.adapt(minimum=0, maximum=200)


Here is an example of a workflow taking advantage of this, if you wan't to give it a try or adapt it to your use case:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wan't -> want

@guillaumeeb

Copy link
Copy Markdown
Member Author

Many thanks @mivade! I need to practice my english...

@andersy005 andersy005 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for putting this together, @guillaumeeb!

@andersy005 andersy005 added the documentation Documentation-related label Jan 24, 2021
@guillaumeeb guillaumeeb merged commit 69f27ac into dask:master Jan 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Documentation-related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Handling workers with expiring allocation requests

3 participants