Pbs docker ci#47
Conversation
|
Nice to see that you're making progress here. It doesn't look like travis is picking up the pbs environment in the matrix though... Also, can you remove the following files: |
WIP: adding CI with a dockerized PBS cluster almost there Working pbs docker cluster, fix was to add user on slaves Test are almost working, may need feedback Adding new job in Travis. removing unused files
e97d0ac to
ba2c15b
Compare
|
Thanks for the comment, I've cleaned up commits and removed the file. |
|
For this failure: def close(self, all_fds=False):
self.closing = True
for fd in list(self.handlers):
fileobj, handler_func = self.handlers[fd]
self.remove_handler(fd)
if all_fds:
self.close_fd(fileobj)
self.asyncio_loop.close()
> del IOLoop._ioloop_for_asyncio[self.asyncio_loop]
E KeyError: <_UnixSelectorEventLoop running=False closed=True debug=False>I recommend using distributed master if possible |
|
So test_adaptive is almost working on my computer: it only fails because the cluster.jobs object is not cleaned up. Otherwise, cluster is automatically scaling up and down as expected. If I manage to make it work on Travis as on my laptop, I propose to just comment the part testing if cluster.jobs is empty for now, so that adaptive cluster fix is tackled in another pull request. Any opinion on that? |
|
Removing that part of the test seems fine to me. |
Great to see progress on having CIs for PBS! I agree that adaptive tests should be fixed in a separate PR. Last time I looked the problems I found were summed up in #27 (comment). |
|
Thanks for the feedback. |
lesteve
left a comment
There was a problem hiding this comment.
Some comments from a quick glance
| # start pbs cluster | ||
| cd ./ci/pbs | ||
| docker-compose up -d | ||
| while [ `docker exec -it -u pbsuser pbs_master pbsnodes -a | grep "Mom = pbs_slave" | wc -l` -ne 2 ] |
There was a problem hiding this comment.
Can we have this in start-pbs.sh similarly to what is done in ci/sge/start-sge.sh. IIRC this is useful to run the tests locally.
There was a problem hiding this comment.
Actually, locally I just use the commands
cd ci/pbs
docker-compose up -dI can then see when the cluster is started. I feel there is no need of a script to do this.
I would prefer to avoid having too many scripts, I feel that it is easier to understand this way. When somebody new looks at the Travis build config file, he does'nt have to go through 2 other ones just for the before_install phase.
If we want to make things easier to test locally, may be we should add a script which call in order all the jobqueue_ * from pbs.sh, or better, any jobquepue.sh file.
There was a problem hiding this comment.
I would prefer to avoid having too many scripts, I feel that it is easier to understand this way. When somebody new looks at the Travis build config file, he does'nt have to go through 2 other ones just for the before_install phase.
I totally get your point that using scripts adds another level of indirection. I think we are coming at it from a different point of view: I have looked at quite a few different .travis.yml files and to me the dask-jobqueue is nicely layed out and straightforward to follow. What I am a newbie about though is the docker side and I feel this may be the case for other potential contributers/users of dask-jobqueue. Being able to just run a bash script and not have to know too much about docker is a great advantage.
On top of the docker-compose up -d, my understanding is that the while loop is to check that the cluster is correctly set-up. IIRC when I was working on the SGE CI, there could be a delay of more than 1 minute before you exit the while loop and it's nice to have something at the end confirming that everything went fine.
Also I think having consistency between the different schedulers CI setup is very important.
There was a problem hiding this comment.
Okay, you got me with the last consistency argument.
|
|
||
| function jobqueue_after_success { | ||
| echo "Hurrah" | ||
| function jobqueue_after_script { |
There was a problem hiding this comment.
Can you remind me the difference between after_success and after_script, is that that after_script can still fail the build but not after_success?
There was a problem hiding this comment.
The difference as I understand it is that after_success is run only if the script succeeded, and after_script is run no matter if the script succeeded or failed.
I don't believe any of those can fail the build.
I needed after_script to debug the failures.
| qmgr -c "create node pbs_slave_2" | ||
|
|
||
| # wait until the end of tests | ||
| /bin/sleep 3600 |
There was a problem hiding this comment.
Why the sleep? Could we not do the same as in SGE i.e. start a Python HTTP server? To be honest, I am not sure why we do that either ...
There was a problem hiding this comment.
The two are working. The idea is to have a blocking process so as not to exit the docker run command, and have the container always up during the test.
I just found it weird to start a Python HTTP server just to lock the launching script in a process, but this is a perfectly working solution, which would last longer than mine 🙂 .
Idealy, we should block on one of the PBS process, or SGE in the other case.
There was a problem hiding this comment.
I just found it weird to start a Python HTTP server just to lock the launching script in a process, but this is a perfectly working solution, which would last longer than mine
OK this may show my sheer ignorance of docker-related things, but my worry is what happens if I start the cluster locally on my laptop, forget about it and come back the next day, will I still be able to do things like docker exec -it pbs_master '/bin/bash -c'
There was a problem hiding this comment.
will I still be able to do things like docker exec -it pbs_master '/bin/bash -c'
Nope, you won't! Cluster will shut down after 1 hour as you expected.
I get your point here too. I don't really like the idea of starting a fake python HTTP server just to have a hanging process, but I don't really have time to look for other solutions, so lets do it this way 😁 .
| @@ -0,0 +1,42 @@ | |||
| version: "3" | |||
There was a problem hiding this comment.
I need to use version: "2" with my version of docker-compose (1.8.0) (I followed the docker installation doc on a Ubuntu 16.04 box). Is there a reason you are using version: "3" (newbie question again ...)?
There was a problem hiding this comment.
I don't think I need version 3 here, which is more for using docker stack deploy or things like that IIRC. I will update that.
I find it weird that you installed such an old version of docker-compose though, following official doc should install 1.21 currently: https://docs.docker.com/compose/install/.
There was a problem hiding this comment.
Probably I only followed the install instructions for docker and thought that it would do docker-compose too ... I am guessing this is the docker-compose from apt on Ubuntu 16.04 ...
|
I believe this PR is ready to be merged, does anyone have some comment? |
jhamman
left a comment
There was a problem hiding this comment.
Thanks for sticking with this one!
Bringing this work to the attention of everyone. Should close #41.
This is not fully working yet, but I need to begin integration with Travis, and maybe have some feedback.
Locally on my computer, test_basic is passing, but its teardown fails, and I don't really know what's happening. We'll see in Travis.