SGE implementation#27
Conversation
jhamman
left a comment
There was a problem hiding this comment.
Thanks for resurrecting this one. Just a few questions on the tests for now.
|
|
||
|
|
||
| @pytest.mark.skipif('SGE_ACCOUNT' in os.environ, reason='SGE_ACCOUNT defined') # noqa: F811 | ||
| def test_errors(loop): |
There was a problem hiding this comment.
I don't think this test is relevant anymore.
There was a problem hiding this comment.
Yeah I was wondering about that and I felt the same way.
There was a problem hiding this comment.
I think that this test came from a cluster where project= was necessary to run, and was often set implicitly with an environment variable (perhaps PBS_ACCOUNT on the cluster on which I was working at the time). If this is not generally true for SGE clusters then I agree that this test should be removed.
| start = time() | ||
| while cluster.jobs: | ||
| sleep(0.100) | ||
| assert time() < start + 10 |
There was a problem hiding this comment.
What is this testing exactly? @mrocklin - given the recent development of Adaptive, can you help us flesh out what kind of tests we should be looking at here?
There was a problem hiding this comment.
I've simplified the test for now.
There was a problem hiding this comment.
(I hope you don't mind my pushing to your fork @lesteve )
There was a problem hiding this comment.
I've simplified the test for now.
Thanks!
(I hope you don't mind my pushing to your fork @lesteve )
Not at all, pushing into a PR's fork is a feature I use this feature quite often myself, it is very convenient and saves a lot of back and forth for minor things!
* use submit_command="qsub -terse" to make sure only the job id is returned.
Remove test_errors. PBS_ACCOUNT does not have an equivalent in SGE.
|
OK I I think this is in a mergeable state, comments more than welcome! It is a first implementation of SGE (probably by no means perfect) and allows to test some of the common code in A few comments:
More details about the adaptive problems I saw (maybe related to #26 not sure): dask-jobqueue/dask_jobqueue/core.py Lines 220 to 224 in bdb2c26 scale_down only does something if isinstance(workers, dict). The problem is that workers is a list of tcp://... worker addresses. So scale_down does nothing i.e. self.jobs stays the same so that scale_up doesn't see the need to schedule more jobs:dask-jobqueue/dask_jobqueue/core.py Lines 217 to 218 in bdb2c26 In other words your *Cluster Python object think he still have enough workers but your scheduling system has no jobs in running state that can process the work.
I guess this is all fixable by looking at the doc, the |
guillaumeeb
left a comment
There was a problem hiding this comment.
Some minor fixes and one question, but this looks really fine to me. Thanks!
| project=None, | ||
| resource_spec=None, | ||
| walltime='0:30:00', | ||
| interface=None, |
There was a problem hiding this comment.
I believe interface=None should not be here. It is a keyword from the parent class, and you don't seem to use it.
| project : str | ||
| Accounting string associated with each worker job. Passed to | ||
| `#$ -A` option. | ||
| threads_per_worker : int |
There was a problem hiding this comment.
For all the parameters inherited from JobQueueCluster, you should use the same mechanism as in PBSCluster or SLURMCluster, with the docstrings module: %(JobQueueCluster.parameters)s
There was a problem hiding this comment.
I believe the sge.py file still misses some import and annotation, e.g.
dask-jobqueue/dask_jobqueue/pbs.py
Line 5 in 2e406ae
and
dask-jobqueue/dask_jobqueue/pbs.py
Line 10 in 2e406ae
You also still have the description of interface kw in the docstring!
There was a problem hiding this comment.
Ah right I fixed that, I am not so familiar with the docrep magic!
| with self.job_file() as fn: | ||
| out = self._call([self.submit_command, fn]) | ||
| job = out.decode().split('.')[0] | ||
| out = self._call(shlex.split(self.submit_command) + [fn]) |
There was a problem hiding this comment.
I don't know shlex, is this mandatory? It serves what purpose?
There was a problem hiding this comment.
In SGE, by default qsub returns quite a verbose output, e.g. something like Your job 56 ("test.sh") has been submitted. In order for just the job id to be returned, you need to use qsub -terse. This is why submit_cmd = 'qsub -terse'.
That means that you need to split submit_cmd. I think shlex.split is the way to do it for sh commands. We could just do submit_cmd.split(' ') but it may break e.g. if one of the arguments is quoted with a space inside.
| 'resource_spec': resource_spec,} | ||
| self.job_header = self._header_template % self.config | ||
|
|
||
| logger.debug("Job script: \n %s" % self.job_script()) |
There was a problem hiding this comment.
We should probably debug log only the header here, but this is also true for PBS or SLURM, and it is really a minor detail...
|
And about your other remarks:
|
|
@guillaumeeb thanks for the review! I tackled your comments, let me know if you have some further comments! |
| option. | ||
| walltime : str | ||
| Walltime for each worker job. | ||
| interface : str |
There was a problem hiding this comment.
You still have the interface keyword in this docstring, once remove I think we're done :)
There was a problem hiding this comment.
Arrggh good point, I just pushed the fix.
| out = self._call([self.submit_command, fn]) | ||
| job = out.decode().split('.')[0] | ||
| out = self._call(shlex.split(self.submit_command) + [fn]) | ||
| job = out.decode().split('.')[0].strip() |
There was a problem hiding this comment.
And by the way the .strip() here is necessary because qsub -terse output finishes with a \n.
|
Thanks @lesteve! |
|
Thanks @lesteve!
…On Thu, Apr 12, 2018 at 4:00 PM, Joe Hamman ***@***.***> wrote:
Thanks @lesteve <https://github.com/lesteve>!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#27 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszDrftRGzVzTkfiIGwcwuuQh9mQKPks5tn8CKgaJpZM4TGxk->
.
|
|
Nice to see this merged! I'll try to find some time to fix the adaptive problem. |
I revived #6 and added simple tests. I was able to get
test_basicto pass but nottest_adaptiveyet.I'll wait that #25 is merged before working on this further.
Fix #3. Closes #6.