Pbs docker ci by guillaumeeb · Pull Request #47 · dask/dask-jobqueue

guillaumeeb · 2018-04-27T22:08:27Z

Bringing this work to the attention of everyone. Should close #41.

This is not fully working yet, but I need to begin integration with Travis, and maybe have some feedback.
Locally on my computer, test_basic is passing, but its teardown fails, and I don't really know what's happening. We'll see in Travis.

jhamman · 2018-04-27T23:05:48Z

Nice to see that you're making progress here. It doesn't look like travis is picking up the pbs environment in the matrix though...

Also, can you remove the following files:

dask-worker-space/global.lock
dask-worker-space/purge.lock
dask-worker-space/worker-t461kxij.dirlock

WIP: adding CI with a dockerized PBS cluster almost there Working pbs docker cluster, fix was to add user on slaves Test are almost working, may need feedback Adding new job in Travis. removing unused files

guillaumeeb · 2018-04-28T06:30:46Z

Thanks for the comment, I've cleaned up commits and removed the file.
Travis is triggered this time, we'll see how it goes.

mrocklin · 2018-04-28T06:39:12Z

For this failure:

    def close(self, all_fds=False):
        self.closing = True
        for fd in list(self.handlers):
            fileobj, handler_func = self.handlers[fd]
            self.remove_handler(fd)
            if all_fds:
                self.close_fd(fileobj)
        self.asyncio_loop.close()
>       del IOLoop._ioloop_for_asyncio[self.asyncio_loop]
E       KeyError: <_UnixSelectorEventLoop running=False closed=True debug=False>

I recommend using distributed master if possible

guillaumeeb · 2018-04-28T17:08:35Z

So test_adaptive is almost working on my computer: it only fails because the cluster.jobs object is not cleaned up. Otherwise, cluster is automatically scaling up and down as expected.
This is not the case on Travis though, don't know why yet.

If I manage to make it work on Travis as on my laptop, I propose to just comment the part testing if cluster.jobs is empty for now, so that adaptive cluster fix is tackled in another pull request.

Any opinion on that?

mrocklin · 2018-04-29T09:29:35Z

Removing that part of the test seems fine to me.

… to add some debug

lesteve · 2018-05-02T14:20:45Z

If I manage to make it work on Travis as on my laptop, I propose to just comment the part testing if cluster.jobs is empty for now, so that adaptive cluster fix is tackled in another pull request.

Great to see progress on having CIs for PBS! I agree that adaptive tests should be fixed in a separate PR. Last time I looked the problems I found were summed up in #27 (comment).

guillaumeeb · 2018-05-02T14:53:45Z

Thanks for the feedback.
I think I've reproduced the problem from Travis on my laptop, and identified its source: PBS does not manage to really finish the jobs because it cannot copy its stderr and stdout files on the master. Thus it cannot run the job from the adaptive test as the two jobs from the previous one did not release the reserved resources.
I will try to work on it on my free time.

lesteve

Some comments from a quick glance

lesteve · 2018-05-02T16:13:03Z

+    # start pbs cluster
+    cd ./ci/pbs
+    docker-compose up -d
+    while [ `docker exec -it -u pbsuser pbs_master pbsnodes -a | grep "Mom = pbs_slave" | wc -l` -ne 2 ]


Can we have this in start-pbs.sh similarly to what is done in ci/sge/start-sge.sh. IIRC this is useful to run the tests locally.

Actually, locally I just use the commands

cd ci/pbs docker-compose up -d

I can then see when the cluster is started. I feel there is no need of a script to do this.

I would prefer to avoid having too many scripts, I feel that it is easier to understand this way. When somebody new looks at the Travis build config file, he does'nt have to go through 2 other ones just for the before_install phase.

If we want to make things easier to test locally, may be we should add a script which call in order all the jobqueue_ * from pbs.sh, or better, any jobquepue.sh file.

I would prefer to avoid having too many scripts, I feel that it is easier to understand this way. When somebody new looks at the Travis build config file, he does'nt have to go through 2 other ones just for the before_install phase.

I totally get your point that using scripts adds another level of indirection. I think we are coming at it from a different point of view: I have looked at quite a few different .travis.yml files and to me the dask-jobqueue is nicely layed out and straightforward to follow. What I am a newbie about though is the docker side and I feel this may be the case for other potential contributers/users of dask-jobqueue. Being able to just run a bash script and not have to know too much about docker is a great advantage.

On top of the docker-compose up -d, my understanding is that the while loop is to check that the cluster is correctly set-up. IIRC when I was working on the SGE CI, there could be a delay of more than 1 minute before you exit the while loop and it's nice to have something at the end confirming that everything went fine.

Also I think having consistency between the different schedulers CI setup is very important.

Okay, you got me with the last consistency argument.

lesteve · 2018-05-02T16:13:54Z


-function jobqueue_after_success {
-    echo "Hurrah"
+function jobqueue_after_script {


Can you remind me the difference between after_success and after_script, is that that after_script can still fail the build but not after_success?

The difference as I understand it is that after_success is run only if the script succeeded, and after_script is run no matter if the script succeeded or failed.
I don't believe any of those can fail the build.
I needed after_script to debug the failures.

lesteve · 2018-05-02T16:16:59Z

+qmgr -c "create node pbs_slave_2"
+
+# wait until the end of tests
+/bin/sleep 3600


Why the sleep? Could we not do the same as in SGE i.e. start a Python HTTP server? To be honest, I am not sure why we do that either ...

The two are working. The idea is to have a blocking process so as not to exit the docker run command, and have the container always up during the test.
I just found it weird to start a Python HTTP server just to lock the launching script in a process, but this is a perfectly working solution, which would last longer than mine 🙂 .

Idealy, we should block on one of the PBS process, or SGE in the other case.

I just found it weird to start a Python HTTP server just to lock the launching script in a process, but this is a perfectly working solution, which would last longer than mine

OK this may show my sheer ignorance of docker-related things, but my worry is what happens if I start the cluster locally on my laptop, forget about it and come back the next day, will I still be able to do things like docker exec -it pbs_master '/bin/bash -c'

will I still be able to do things like docker exec -it pbs_master '/bin/bash -c'

Nope, you won't! Cluster will shut down after 1 hour as you expected.

I get your point here too. I don't really like the idea of starting a fake python HTTP server just to have a hanging process, but I don't really have time to look for other solutions, so lets do it this way 😁 .

lesteve · 2018-05-03T12:31:09Z

@@ -0,0 +1,42 @@
+version: "3"


I need to use version: "2" with my version of docker-compose (1.8.0) (I followed the docker installation doc on a Ubuntu 16.04 box). Is there a reason you are using version: "3" (newbie question again ...)?

I don't think I need version 3 here, which is more for using docker stack deploy or things like that IIRC. I will update that.

I find it weird that you installed such an old version of docker-compose though, following official doc should install 1.21 currently: https://docs.docker.com/compose/install/.

Probably I only followed the install instructions for docker and thought that it would do docker-compose too ... I am guessing this is the docker-compose from apt on Ubuntu 16.04 ...

guillaumeeb · 2018-05-04T13:13:07Z

I believe this PR is ready to be merged, does anyone have some comment?

jhamman

Thanks for sticking with this one!

Squashing commit into one. Adding CI with dockerized PBS

ba2c15b

WIP: adding CI with a dockerized PBS cluster almost there Working pbs docker cluster, fix was to add user on slaves Test are almost working, may need feedback Adding new job in Travis. removing unused files

guillaumeeb force-pushed the pbs_docker_ci branch from e97d0ac to ba2c15b Compare April 28, 2018 06:25

guillaumeeb added 2 commits April 28, 2018 11:51

Use latest distributed versio from master

9230e5e

Fixing versions of OS and PBS for stability

a6a1ec0

guillaumeeb added 2 commits April 29, 2018 14:48

(Altered) tests workings with Docker on laptop. Modifying travis conf…

e6d5741

… to add some debug

changing PBS scheduling time. Adding some trace at the end

8f138ee

lesteve reviewed May 2, 2018

View reviewed changes

guillaumeeb and others added 2 commits May 3, 2018 00:21

Disabling scp from stdout and stderr at the end of the jobs

ffd8702

Consistency with sge ci, improved debugging in travis

7af1a11

lesteve reviewed May 3, 2018

View reviewed changes

docker-compose version 2 should be enough

ca514d9

jhamman approved these changes May 4, 2018

View reviewed changes

guillaumeeb mentioned this pull request May 7, 2018

Use released version of distributed. #54

Merged

guillaumeeb changed the title ~~[WIP] Pbs docker ci~~ Pbs docker ci May 7, 2018

guillaumeeb mentioned this pull request May 14, 2018

Slurm CI implementation #57

Merged

jhamman merged commit 8d243b6 into dask:master May 14, 2018

guillaumeeb deleted the pbs_docker_ci branch August 27, 2018 11:12

Uh oh!

Uh oh!

Conversation

guillaumeeb commented Apr 27, 2018

Uh oh!

jhamman commented Apr 27, 2018

Uh oh!

guillaumeeb commented Apr 28, 2018

Uh oh!

mrocklin commented Apr 28, 2018

Uh oh!

guillaumeeb commented Apr 28, 2018

Uh oh!

mrocklin commented Apr 29, 2018

Uh oh!

lesteve commented May 2, 2018

Uh oh!

guillaumeeb commented May 2, 2018

Uh oh!

lesteve left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lesteve May 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lesteve May 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lesteve May 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guillaumeeb May 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guillaumeeb commented May 4, 2018

Uh oh!

jhamman left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

lesteve May 3, 2018 •

edited

Loading

lesteve May 3, 2018 •

edited

Loading

lesteve May 3, 2018 •

edited

Loading

guillaumeeb May 3, 2018 •

edited

Loading