Skip to content

Adaptive stops when workers preempted #4591

Description

@gerald732

What happened:
I'm using a custom dask provider to run a distributed dask cluster with on-prem compute. These compute nodes can be preempted. I've switched on autoscaling with min workers 0 and max workers 500. Whenever I persist DF (200GB+) onto the cluster, and when workers are preempted, this is observed in the logs:

distributed.deploy.adaptive_core - INFO - Adaptive stop

Looking at pull 4422, I noticed this was made more verbose. I made that change locally, and was seeing this in the log:

distributed.deploy.adaptive_core - ERROR - Adaptive stopping due to error in <closed TCP>: Stream is closed: while trying to call remote method 'retire_workers'

When this happens, the cluster is unable to scale down to the minimum number of workers even after the persisted DF is deleted.

What you expected to happen:
Adaptive to be resilient to preemption.

Minimal Complete Verifiable Example:
You can repro this by simply generating 200+GB worth of random timeseries, and persisting this to a dask cluster. Kill a few workers - to simulate preemption - once the DF is about 90% persisted (from the Dask scheduler's dashboard status tab).

Anything else we need to know?:
Are there good reasons for calling stop on adaptive when this happens? I've tried removing the call to .stop() locally, and the cluster seem to run fine afterwards.

I am also aware of issue 4421. The solution was for the workers to poll a metaurl from public cloud providers for an early warning signal on preemption before shutting down gracefull. However, that is not an option in my case.

Environment:

  • Dask version: 2.30.0
  • Distributed version: 2.30.0
  • Python version: 3.7

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions