added wait until n workers arguments by danpf · Pull Request #223 · dask/dask-jobqueue

danpf · 2019-01-16T17:03:34Z

Sometimes my scheduler is very full and it can take me a very long time to get a worker. This way, when I do something like:

cluster.adapt(minimum=10,wait_until_n=2)
...{setup client}...
client.scatter([mything])

I don't get a timeout error on the scatter if it takes me 1hr+ to get those workers.

not sure if you are actually interested in this feature, just posting in case you are.

related to
dask/distributed#2454

lesteve · 2019-01-17T08:27:08Z

There was another related discussion in distributed in dask/distributed#2138.

I would encourage to open a PR in distributed because it feels something like this belongs there. I haven't thought about the interplay with adapt and scatter, I feel this is slightly orthogonal but I may be wrong.

danpf · 2019-01-17T19:53:22Z

it is the same problem as there, but it isn't.
Since jobqueue ClusterManagers don't inherit from distributed Clusters, they both would have to implement this separately.

there is also i think a bug here since Adaptive gets the wait_until_n kwarg... but i haven't seen any problems yet.

lesteve · 2019-01-18T08:01:15Z

Since jobqueue ClusterManagers don't inherit from distributed Clusters, they both would have to implement this separately.

I see, good point. Still I think something like this should be done in distributed because it is applicable in other projects (LocalCluster in distributed, KubernetesCluster in dask-kubernetes, etc ...) and I think a PR in distributed would be very welcome.

We can then think about how to have the same functionality in dask-jobqueue with minimal code and deferring to the distributed implementation. Full disclosure: I haven't followed very closely the ClusterManager development so the situation may be a bit more complicated than this, not 100 % sure ...

You may do something like this already, but in the mean time, I would recommend to use a work-around like dask/distributed#2138 (comment).

guillaumeeb · 2019-01-21T20:40:30Z

For the reason behind ClusterManager, see #170. The idea is to use dask-jobqueue as an experimental project for designing the common part from distributed codebase.

The wait_for_n feature here is another to add to the ClusterManager features. Maybe it is simple enough to be put in distributed right now, I'm not sure how we want to do things here. @mrocklin any opinion?

guillaumeeb

sleep_until_n_workers internals shall be modified, see comment.

guillaumeeb · 2019-01-21T20:46:13Z


+    def sleep_until_n_workers(self, n):
+        '''Block by sleeping until we have n active workers'''
+        while self._count_active_workers() < n:


In ClusterManager, we should not rely on functions from JobQueueCluster. You need to find something than can be called on the scheduler remotly at one point. Or we need the ClusterManager to be informed of any incoming worker.
What shall be kept in mind is that the ClusterManager is potentially another process than the Scheduler, and that it is build to be usable by other Cluster implementation than those of dask-jobqueue.

So you should use len(self.scheduler.workers). And at one point will use some remote call to the scheduler process to actually have this information.

mrocklin · 2019-01-22T17:09:31Z

No strong opinion from me

…

On Mon, Jan 21, 2019 at 12:47 PM Guillaume Eynard-Bontemps < ***@***.***> wrote: ***@***.**** requested changes on this pull request. sleep_until_n_workers internals shall be modified, see comment. ------------------------------ In dask_jobqueue/deploy/cluster_manager.py <#223 (comment)>: > @@ -83,8 +84,14 @@ def __init__(self, adaptive_options={}): self._adaptive_options = adaptive_options self._adaptive_options.setdefault('worker_key', self.worker_key) + def sleep_until_n_workers(self, n): + '''Block by sleeping until we have n active workers''' + while self._count_active_workers() < n: In ClusterManager, we should not rely on functions from JobQueueCluster. You need to find something than can be called on the scheduler remotly at one point. Or we need the ClusterManager to be informed of any incoming worker. What shall be kept in mind is that the ClusterManager is potentially another process than the Scheduler, and that it is build to be usable by other Cluster implementation than those of dask-jobqueue. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#223 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszNSrpu2S8ODJV6uClMyzDgMvaixlks5vFidkgaJpZM4aDhuH> .

guillaumeeb · 2019-01-30T16:07:02Z

Let's work on this in dask-jobqueue first, and we'll see how it can be made available to distributed if needed.

guillaumeeb · 2019-02-05T20:28:22Z

@danpf any interest in finishing this?

danpf · 2019-02-08T19:12:17Z

it's pretty much done. pending tests.

danpf · 2019-02-15T16:48:38Z

There is some test leaking somehow. I'm not sure why but If i just move some tests aroudn like in the last commit the tests pass.

guillaumeeb

I'm honestly not sure what caused the test failures...

Anyway, this looks good to me, thanks @danpf!

If possible, I'd like for @jhamman, @lesteve or @mrocklin to take a quick look at this one.

mrocklin

I've added some feedback. I have some concerns both with adding this as a keyword argument, and also with the current lack of async support.

For context, async support is necessary to make it easy to manage clusters in things like the current dask-jupyterlab extension, or for any future dask-hub project.

mrocklin · 2019-03-04T02:26:51Z

                kwargs['maximum'] = self._get_nb_workers_from_memory(maximum_memory)
        self._adaptive_options.update(kwargs)
        self._adaptive = Adaptive(self.scheduler, self, **self._adaptive_options)
+        self.wait_until_n_workers(wait_until_n)


Adapt can be called within the event loop, so we can't have a while ...: time.sleep() call in there, otherwise things will halt/block.

mrocklin · 2019-03-04T02:28:42Z

+    def wait_until_n_workers(self, n):
+        '''Block by sleeping until we have n active workers'''
+        while n and len(self.scheduler.workers) < n:
+            time.sleep(1)


If we keep this function then I would decrease this sleep significantly, probably to around 10ms. It'll still be way less than 1% cpu time.

It's common in Dask to build functions that are async-friendly first, and then wrap around them with self.cluster.sync. If we're supporting Python 2 then we do this with @gen.coroutine, if we're not then we use async def.

mrocklin · 2019-03-04T02:29:46Z

        # TODO we should not rely on scheduler loop here, self should have its
        # own loop
        self.scheduler.loop.add_callback(self._scale, n, cores, memory)
+        self.wait_until_n_workers(wait_until_n)


I think that having a keyword like this in many methods is a sign that maybe we shouldn't include it, and instead we should expect users to make two calls.

-cluster.scale(10, wait_until_n=10) +cluster.scale(10) +cluster.wait_until_n_workers(10)

Otherwise keyword arguments like that tend to propagate to many new methods, resulting in an n x m maintenance problem.

mrocklin · 2019-03-04T02:32:23Z

Should have said first though, this is nice work. People have asked for things like this before, so thank you @danpf for pushing on it.

mrocklin · 2019-03-04T02:38:20Z

I've added a comment to the original issue: dask/distributed#2138 (comment)

guillaumeeb · 2019-03-04T20:59:36Z

@mrocklin from what I understand from your comment, expecting user to make and explicit call to the method also solve the problem related to the lack of async support, or am I missing something?

mrocklin · 2019-03-04T21:05:15Z

If you don't call wait within scale or adapt then having a synchronous-only method is fine, as long as people don't try to call it from async contexts. Most people don't use async programming so supporting async isn't a big deal. The implementation in the issue should work out OK though, and hopefully isn't too complex.

danpf · 2019-03-05T09:03:22Z

As one of the few people that use async -->
I don't have, nor forsee any reasons to asynchronously scale nor adapt. 100 % of the time currently for me my scripts look like:

distributed client is initted with adapt, or scale, wait until workers are up
then actually start doing complex async workflows.

since we depend on distributed workers for actual computations, there's really nothing else we could be doing instead of synchronously sleeping, so it's not a big deal for us.
I understand the need to plan ahead, but figured I would just mention my actual usage to provide some context.

guillaumeeb · 2019-03-05T10:02:39Z

Even if we leave the wait_until_n_workers code into scale or adapt, there's no reason dask-labexstension would call it with a wait kwarg > 0, so I don't think this would block the loop.

Anyway I'm really new to async programming, so I leave this discussion to both of you. @danpf, is there a downside into handling async call, except a few lines of code more?

mrocklin · 2019-03-05T14:29:29Z

there's really nothing else we could be doing instead of synchronously sleeping

Well, there are some things you would be doing, like having the scheduler respond to the workers that are coming in and update its count of current workers. The current loop, if run within the event loop, will never terminate because the scheduler will never be able to acknowledge the new workers.

Calling blocking calls within the event loop effectively shuts down the entire system.

guillaumeeb · 2019-03-06T20:38:27Z

As moving to async as proposed by @mrocklin in dask/distributed#2138 (comment) seems not to be a big deal, I propose @danpf implements it, along with the sleeping time at 0.01, or 0.1 which is probably reactive enough.

As per not including the call in scale and adapt and the associated keyword, I'm still undecided...

guillaumeeb · 2019-03-19T15:14:22Z

@danpf, do you have time for finishing this along what @mrocklin proposes?

danpf · 2019-03-19T16:07:06Z

Sorry I'm really busy for a while, probably ~2 weeks I can do this. apologies if that's too late.

guillaumeeb · 2019-05-09T20:13:21Z

Edit: sorry wron github handle...

@danpf , do you think you can follow up on this one in the near future?

guillaumeeb · 2019-05-16T19:48:32Z

So @danpf, I believe this is superseeded by dask/distributed#2688, so we can close this one?

danpf · 2019-05-16T19:54:45Z

yup!

added wait until n workers arguments

bc25ba1

guillaumeeb requested changes Jan 21, 2019

View reviewed changes

guillaumeeb mentioned this pull request Feb 5, 2019

Release 0.4.2 or 0.5.0 #234

Closed

danpf added 2 commits February 8, 2019 11:02

consistent naming + better worker count

7a6a6dd

add test for wait until n

588202d

danpf added 4 commits February 9, 2019 20:52

merge with master

634e965

fix test failing due to loop close

def9bad

take into account 0 in wait_for_n

928db8e

change test ordering

9468835

Trigger CI

c5c59cb

guillaumeeb approved these changes Mar 2, 2019

View reviewed changes

mrocklin reviewed Mar 4, 2019

View reviewed changes

guillaumeeb mentioned this pull request Apr 13, 2019

Wait for workers to join before continuing dask/distributed#2138

Closed

danpf closed this May 16, 2019

Uh oh!

Uh oh!

Conversation

danpf commented Jan 16, 2019

Uh oh!

lesteve commented Jan 17, 2019

Uh oh!

danpf commented Jan 17, 2019

Uh oh!

lesteve commented Jan 18, 2019

Uh oh!

guillaumeeb commented Jan 21, 2019

Uh oh!

guillaumeeb left a comment

Choose a reason for hiding this comment

Uh oh!

guillaumeeb Jan 21, 2019

Choose a reason for hiding this comment

Uh oh!

guillaumeeb Feb 5, 2019

Choose a reason for hiding this comment

Uh oh!

mrocklin commented Jan 22, 2019 via email

Uh oh!

guillaumeeb commented Jan 30, 2019

Uh oh!

guillaumeeb commented Feb 5, 2019

Uh oh!

danpf commented Feb 8, 2019

Uh oh!

danpf commented Feb 15, 2019

Uh oh!

guillaumeeb left a comment

Choose a reason for hiding this comment

Uh oh!

mrocklin left a comment

Choose a reason for hiding this comment

Uh oh!

mrocklin Mar 4, 2019

Choose a reason for hiding this comment

Uh oh!

mrocklin Mar 4, 2019

Choose a reason for hiding this comment

Uh oh!

mrocklin Mar 4, 2019

Choose a reason for hiding this comment

Uh oh!

mrocklin Mar 4, 2019

Choose a reason for hiding this comment

Uh oh!

mrocklin commented Mar 4, 2019

Uh oh!

mrocklin commented Mar 4, 2019

Uh oh!

guillaumeeb commented Mar 4, 2019

Uh oh!

mrocklin commented Mar 4, 2019

Uh oh!

danpf commented Mar 5, 2019

Uh oh!

guillaumeeb commented Mar 5, 2019

Uh oh!

mrocklin commented Mar 5, 2019

Uh oh!

guillaumeeb commented Mar 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guillaumeeb commented Mar 19, 2019

Uh oh!

danpf commented Mar 19, 2019

Uh oh!

guillaumeeb commented May 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guillaumeeb commented May 16, 2019

Uh oh!

danpf commented May 16, 2019

Uh oh!

Reviewers

Assignees

guillaumeeb commented Mar 6, 2019 •

edited

Loading

guillaumeeb commented May 9, 2019 •

edited

Loading