Cluster.scale is not robust to multiple calls

As experienced in https://github.com/dask/dask-jobqueue/issues/112 and a related PR https://github.com/dask/dask-jobqueue/pull/97, `Cluster.scale` behavior is unstable if called multiple times in a row.

I suspect part of this problem is due to how asynchronism is used here:

- We retrieve the cluster number of workers in a synchronous way here https://github.com/dask/distributed/blob/master/distributed/deploy/cluster.py#L100, but we launch `scale_up` asynchronously, so something could happen (here: another call to `scale`) between state retrieval and effective scale_up.
- Similarly, we get the worker to close synchronously, but stop them asynchronously.

If we want `scale` to run asynchronously, I propose to just add a `_scale()` method here (a corountine?) to be called in an async manner from `scale()`. In this `scale`, we would get the state and perform the modifications at the same time:
````python
def _scale(self, n):
        with log_errors():
            if n >= len(self.scheduler.workers):
                self.scale_up(n)
            else:
                to_close = self.scheduler.workers_to_close(
                    n=len(self.scheduler.workers) - n)
                logger.debug("Closing workers: %s", to_close)
                self.scheduler.retire_workers(workers=to_close)
                self.scale_down(to_close)
````

@jhamman @mrocklin any opinion, advice?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Cluster.scale is not robust to multiple calls #2257

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Uh oh!

Cluster.scale is not robust to multiple calls #2257

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions