The title is intentionally analogous to #20 as I have the feeling the explanation for the observed behavior is similar.
I'm on a PBS cluster whose nodes are made of 2 cpus with 14 cores each.
I was initially calling:
cluster = PBSCluster(queue='mpi_1', local_directory=local_dir, interface='ib0', walltime='24:00:00',
threads=4, processes=7, memory='10GB', resource_spec='select=1:ncpus=28:mem=100g',
death_timeout=100)
This led to the creation of workers but they died after creation.
The following choice seems to fix the issue:
threads=14, processes=2, memory='50GB',
Here is a link that describes dask workers:
http://distributed.readthedocs.io/en/latest/worker.html
this may be useful to readers having similar issues
Note that the link between cluster architecture and options that can be passed to PBSCluster is still not entirely clear to me.
So my issue seems to be fixed, but I wanted to put this experience visible to people that may encounter similar issues.
The title is intentionally analogous to #20 as I have the feeling the explanation for the observed behavior is similar.
I'm on a PBS cluster whose nodes are made of 2 cpus with 14 cores each.
I was initially calling:
This led to the creation of workers but they died after creation.
The following choice seems to fix the issue:
Here is a link that describes dask workers:
http://distributed.readthedocs.io/en/latest/worker.html
this may be useful to readers having similar issues
Note that the link between cluster architecture and options that can be passed to PBSCluster is still not entirely clear to me.
So my issue seems to be fixed, but I wanted to put this experience visible to people that may encounter similar issues.