I think there should be an exception if the submission command has a non zero exit code.
A slight side-effect of #132 is that you now get an exception later in the job id parsing which is not ideal.
This is on dask-jobqueue master using a queue that does not exist:
from dask_jobqueue import SGECluster
from dask.distributed import Client
resource_spec = 'h_vmem=1000G,mem_req=16G'
cluster = SGECluster(queue='asdf',
cores=1,
memory='16GB',
resource_spec=resource_spec)
cluster.scale(2)
This is the output you get (on stderr):
Unable to run job: Job was rejected because job requests unknown queue "asdf".
Exiting.
tornado.application - ERROR - Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x2aaac6552620>, 2)
Traceback (most recent call last):
File "/sequoia/data1/lesteve/miniconda3/envs/dask-dev/lib/python3.6/site-packages/tornado/ioloop.py", line 760, in _run_callback
ret = callback()
File "/sequoia/data1/lesteve/miniconda3/envs/dask-dev/lib/python3.6/site-packages/tornado/stack_context.py", line 276, in null_wrapper
return fn(*args, **kwargs)
File "/home/lesteve/dev/dask-jobqueue/dask_jobqueue/core.py", line 375, in scale_up
self.start_workers(n - active_and_pending)
File "/home/lesteve/dev/dask-jobqueue/dask_jobqueue/core.py", line 302, in start_workers
job = self._job_id_from_submit_output(out.decode())
File "/home/lesteve/dev/dask-jobqueue/dask_jobqueue/core.py", line 420, in _job_id_from_submit_output
raise ValueError(msg)
ValueError: Could not parse job id from submission command output.
Job id regexp is '(?P<job_id>\\d+)'
Submission command output is:
So basically you get some warning that the submission output had some stderr (with logger.error) but then a traceback from the job_id parsing.
I think we should just have a traceback from JobQueueCluster.__call__ saying that the submission command exited with a non zero exit code, and put the stdout and stderr to help the user figure out what happened.
Any comments let me know, otherwise I'll implement something like this.
I think there should be an exception if the submission command has a non zero exit code.
A slight side-effect of #132 is that you now get an exception later in the job id parsing which is not ideal.
This is on dask-jobqueue master using a queue that does not exist:
This is the output you get (on stderr):
So basically you get some warning that the submission output had some stderr (with
logger.error) but then a traceback from the job_id parsing.I think we should just have a traceback from
JobQueueCluster.__call__saying that the submission command exited with a non zero exit code, and put the stdout and stderr to help the user figure out what happened.Any comments let me know, otherwise I'll implement something like this.