Hello!
I've been using dask-jobqueue for quite some time without messing with the default configuration file localted in ~/.config/dask. Recently. I decided to change the config file jobqueue.yaml to avoid copy/pasting the same jupyter cell across notebooks and just call
from dask_jobqueue import SLURMCluster
cluster = SLURMCluster()
with this configuration
jobqueue:
slurm:
name: dask-worker
cores: 12
memory: 60GB
processes: 1
interface: 'ib0'
local-directory: /home/grivera/scratch
queue: mpi_short2
walltime: '01:00:00'
log-directory: /home/grivera/slurm_logs
instead of
from dask_jobqueue import SLURMCluster
cluster = SLURMCluster(
queue="mpi_short2",
cores=12,
memory="60GB",
processes=1,
interface="ib0",
dashboard_address=":6767",
)
When using the config file, my workers don't seem to connect at all and eventually they get killed due to time out. To debug this, I printed cluster.job_script() for both cases and found the ip used is different, even thought I'm using the same values in both cases.
output of job_script when using kwargs
#!/usr/bin/env bash
#SBATCH -J dask-worker
#SBATCH -e /home/grivera/slurm_logs/dask-worker-%J.err
#SBATCH -o /home/grivera/slurm_logs/dask-worker-%J.out
#SBATCH -p mpi_short2
#SBATCH -n 1
#SBATCH --cpus-per-task=12
#SBATCH --mem=56G
#SBATCH -t 01:00:00
JOB_ID=${SLURM_JOB_ID%;*}
/home/grivera/miniconda3/envs/Work/bin/python -m distributed.cli.dask_worker tcp://192.168.0.15:33289 --nthreads 12 --memory-limit 60.00GB --name name --nanny --death-timeout 60 --local-directory /home/grivera/scratch --interface ib0
output of job_script when using the yaml file
#!/usr/bin/env bash
#SBATCH -J dask-worker
#SBATCH -e /home/grivera/slurm_logs/dask-worker-%J.err
#SBATCH -o /home/grivera/slurm_logs/dask-worker-%J.out
#SBATCH -p mpi_short2
#SBATCH -n 1
#SBATCH --cpus-per-task=12
#SBATCH --mem=56G
#SBATCH -t 01:00:00
JOB_ID=${SLURM_JOB_ID%;*}
/home/grivera/miniconda3/envs/Work/bin/python -m distributed.cli.dask_worker tcp://127.0.0.1:43852 --nthreads 12 --memory-limit 60.00GB --name name --nanny --death-timeout 60 --local-directory /home/grivera/scratch --interface ib0
If I omit the interface kwarg, the result is the same for both cases (workers can't connect and ip is 127.0.0.1) so I think this parameter is not being loaded from the config file.
Hello!
I've been using dask-jobqueue for quite some time without messing with the default configuration file localted in
~/.config/dask. Recently. I decided to change the config filejobqueue.yamlto avoid copy/pasting the same jupyter cell across notebooks and just callwith this configuration
instead of
When using the config file, my workers don't seem to connect at all and eventually they get killed due to time out. To debug this, I printed
cluster.job_script()for both cases and found the ip used is different, even thought I'm using the same values in both cases.output of job_script when using kwargs
output of job_script when using the yaml file
If I omit the
interfacekwarg, the result is the same for both cases (workers can't connect and ip is 127.0.0.1) so I think this parameter is not being loaded from the config file.