Skip to content

Should JobQueueCluster._call raise if the command has a non zero exit code? #142

Description

@lesteve

I think there should be an exception if the submission command has a non zero exit code.

A slight side-effect of #132 is that you now get an exception later in the job id parsing which is not ideal.

This is on dask-jobqueue master using a queue that does not exist:

from dask_jobqueue import SGECluster
from dask.distributed import Client
resource_spec = 'h_vmem=1000G,mem_req=16G'

cluster = SGECluster(queue='asdf',
                     cores=1,
                     memory='16GB',
                     resource_spec=resource_spec)
cluster.scale(2)

This is the output you get (on stderr):

Unable to run job: Job was rejected because job requests unknown queue "asdf".
Exiting.

tornado.application - ERROR - Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x2aaac6552620>, 2)
Traceback (most recent call last):
  File "/sequoia/data1/lesteve/miniconda3/envs/dask-dev/lib/python3.6/site-packages/tornado/ioloop.py", line 760, in _run_callback
    ret = callback()
  File "/sequoia/data1/lesteve/miniconda3/envs/dask-dev/lib/python3.6/site-packages/tornado/stack_context.py", line 276, in null_wrapper
    return fn(*args, **kwargs)
  File "/home/lesteve/dev/dask-jobqueue/dask_jobqueue/core.py", line 375, in scale_up
    self.start_workers(n - active_and_pending)
  File "/home/lesteve/dev/dask-jobqueue/dask_jobqueue/core.py", line 302, in start_workers
    job = self._job_id_from_submit_output(out.decode())
  File "/home/lesteve/dev/dask-jobqueue/dask_jobqueue/core.py", line 420, in _job_id_from_submit_output
    raise ValueError(msg)
ValueError: Could not parse job id from submission command output.
Job id regexp is '(?P<job_id>\\d+)'
Submission command output is:

So basically you get some warning that the submission output had some stderr (with logger.error) but then a traceback from the job_id parsing.

I think we should just have a traceback from JobQueueCluster.__call__ saying that the submission command exited with a non zero exit code, and put the stdout and stderr to help the user figure out what happened.

Any comments let me know, otherwise I'll implement something like this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions