Skip to content

No workers info available after scale() with SLURMCluster #376

Description

@AChatzigoulas

I have a very similar issue with #246. After I create the cluster and scale it the nodes are running but no CPUs are available.

import logging
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.DEBUG)
from dask_jobqueue import SLURMCluster
cluster = SLURMCluster()
DEBUG:Using selector: EpollSelector
DEBUG:Using selector: EpollSelector
cluster.scale(jobs = 2)
DEBUG:Starting worker: 1
DEBUG:writing job script: 
#!/usr/bin/env bash

#SBATCH -J dask-worker
#SBATCH -p compute
#SBATCH -A pr008033
#SBATCH -n 1
#SBATCH --cpus-per-task=20
#SBATCH --mem=45G
#SBATCH -t 00:30:00
#SBATCH -o myjob.%j.out
#SBATCH -e myjob.%j.err

JOB_ID=${SLURM_JOB_ID%;*}

/users/pr008/...
tcp://195.251.23.79:33491 --nthreads 20 --memory-limit 48.00GB --name 1 --nanny --death-timeout 60 --interface ib0

DEBUG:Executing the following command to command line
sbatch /tmp/tmpoals04ok.sh
DEBUG:Starting job: 804606
DEBUG:Starting worker: 0
DEBUG:writing job script: 
#!/usr/bin/env bash

#SBATCH -J dask-worker
#SBATCH -p compute
#SBATCH -A pr008033
#SBATCH -n 1
#SBATCH --cpus-per-task=20
#SBATCH --mem=45G
#SBATCH -t 00:30:00
#SBATCH -o myjob.%j.out
#SBATCH -e myjob.%j.err

JOB_ID=${SLURM_JOB_ID%;*}

/users/pr008/...
tcp://195.251.23.79:33491 --nthreads 20 --memory-limit 48.00GB --name 0 --nanny --death-timeout 60 --interface ib0

DEBUG:Executing the following command to command line
sbatch /tmp/tmpsuq3ysbm.sh
DEBUG:Starting job: 804607
from dask.distributed import Client
client = Client(cluster)
client
Client Scheduler: tcp://195.251.23.79:33491 Dashboard: http://195.251.23.79:8787/status Cluster Workers: 0 Cores: 0 Memory: 0 B
client.scheduler_info()
{'type': 'Scheduler',
 'id': 'Scheduler-5198da07-3700-42aa-a068-036146073ef6',
 'address': 'tcp://195.251.23.79:33491',
 'services': {'dashboard': 8787},
 'workers': {}}

When I run squeue

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
804606   compute dask-wor  ...  R       6:04      1 node287
804607   compute dask-wor  ...  R       6:04      1 node003

The HPC has infiniband and each node has 20 CPUs. If you need any other info please tell me.

Thank you

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions