Skip to content

Dask client connects to PBS workers, then rapidly loses them #30

Description

@apatlpo

The title is intentionally analogous to #20 as I have the feeling the explanation for the observed behavior is similar.

I'm on a PBS cluster whose nodes are made of 2 cpus with 14 cores each.

I was initially calling:

cluster = PBSCluster(queue='mpi_1', local_directory=local_dir, interface='ib0', walltime='24:00:00',
                     threads=4, processes=7, memory='10GB', resource_spec='select=1:ncpus=28:mem=100g', 
                     death_timeout=100)

This led to the creation of workers but they died after creation.

The following choice seems to fix the issue:

threads=14, processes=2, memory='50GB', 

Here is a link that describes dask workers:
http://distributed.readthedocs.io/en/latest/worker.html
this may be useful to readers having similar issues

Note that the link between cluster architecture and options that can be passed to PBSCluster is still not entirely clear to me.

So my issue seems to be fixed, but I wanted to put this experience visible to people that may encounter similar issues.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions