Skip to content

Handling workers with expiring allocation requests #122

Description

@wgustafson

I am trying to figure out how to handle the case of dask workers getting bumped from a cluster due to their requested allocation time expiring. From the intro YouTube video at https://www.youtube.com/watch?v=FXsgmwpRExM, it sounds like dask-jobqueue should detect when a worker expires and automatically start a replacement, which is what I want. However, my testing on DOE's edison computer at NERSC is not getting that behavior. If it matters, edison uses SLURM.

I have tried setting up my cluster two ways and both behave the same. I start a worker that uses dask.delayed to do a bunch of embarrassingly parallel tasks, the server spawns one worker, that worker does the first task or two, the worker expires, the server seems to hang, and nothing else happens.

The first approach I used to setup the cluster was with "scale":

    cluster = SLURMCluster(cores=1, processes=1)  # need all the memory for one task
    cluster.scale(1)  # testing with as simple as I can get, cycling 1 worker
    client = Client(cluster, timeout='45s')

@josephhardinee suggested a 2nd approach using "adapt" instead:

    cluster = SLURMCluster(cores=1, processes=1)  # need all the memory for one task
    cluster.adapt(minimum=1, maximum=1)  # trying adapt instead of scale
    client = Client(cluster, timeout='45s')

The dask-worker.err log concludes with:

slurmstepd: error: *** JOB 10234215 ON nid01242 CANCELLED AT 2018-08-10T13:25:30 DUE TO TIME LIMIT ***
distributed.dask_worker - INFO - Exiting on signal 15
distributed.nanny - INFO - Closing Nanny at 'tcp://10.128.4.227:35634'
distributed.dask_worker - INFO - Exiting on signal 15
distributed.dask_worker - INFO - End worker
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-1, started daemon)>

Am I expecting more from dask-jobqueue than I should? Or, is this a bug in my implementation or in dask.distributed of dask-jobqueue?

Thanks,
Bill

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingdocumentationDocumentation-relatedenhancementNew feature or requesthelp wantedExtra attention is needed

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions