What happened:
I'm using a custom dask provider to run a distributed dask cluster with on-prem compute. These compute nodes can be preempted. I've switched on autoscaling with min workers 0 and max workers 500. Whenever I persist DF (200GB+) onto the cluster, and when workers are preempted, this is observed in the logs:
distributed.deploy.adaptive_core - INFO - Adaptive stop
Looking at pull 4422, I noticed this was made more verbose. I made that change locally, and was seeing this in the log:
distributed.deploy.adaptive_core - ERROR - Adaptive stopping due to error in <closed TCP>: Stream is closed: while trying to call remote method 'retire_workers'
When this happens, the cluster is unable to scale down to the minimum number of workers even after the persisted DF is deleted.
What you expected to happen:
Adaptive to be resilient to preemption.
Minimal Complete Verifiable Example:
You can repro this by simply generating 200+GB worth of random timeseries, and persisting this to a dask cluster. Kill a few workers - to simulate preemption - once the DF is about 90% persisted (from the Dask scheduler's dashboard status tab).
Anything else we need to know?:
Are there good reasons for calling stop on adaptive when this happens? I've tried removing the call to .stop() locally, and the cluster seem to run fine afterwards.
I am also aware of issue 4421. The solution was for the workers to poll a metaurl from public cloud providers for an early warning signal on preemption before shutting down gracefull. However, that is not an option in my case.
Environment:
- Dask version: 2.30.0
- Distributed version: 2.30.0
- Python version: 3.7
What happened:
I'm using a custom dask provider to run a distributed dask cluster with on-prem compute. These compute nodes can be preempted. I've switched on autoscaling with min workers 0 and max workers 500. Whenever I persist DF (200GB+) onto the cluster, and when workers are preempted, this is observed in the logs:
Looking at pull 4422, I noticed this was made more verbose. I made that change locally, and was seeing this in the log:
When this happens, the cluster is unable to scale down to the minimum number of workers even after the persisted DF is deleted.
What you expected to happen:
Adaptive to be resilient to preemption.
Minimal Complete Verifiable Example:
You can repro this by simply generating 200+GB worth of random timeseries, and persisting this to a dask cluster. Kill a few workers - to simulate preemption - once the DF is about 90% persisted (from the Dask scheduler's dashboard status tab).
Anything else we need to know?:
Are there good reasons for calling stop on adaptive when this happens? I've tried removing the call to
.stop()locally, and the cluster seem to run fine afterwards.I am also aware of issue 4421. The solution was for the workers to poll a metaurl from public cloud providers for an early warning signal on preemption before shutting down gracefull. However, that is not an option in my case.
Environment: