Wait for n workers before continuing by danpf · Pull Request #2688 · dask/distributed

danpf · 2019-05-13T08:31:00Z

This is the easiest and I think most well supported implementation of waiting for n workers.

Initially I thought it was wrong to be in Client, but considering the implementation of Cluster -- (the bulk of which is in LocalCluster) this is the solution which is most immediately compatible with jobqueue and kubernetes.

The alternative would be to move asynchronous and loop from LocalCluster to Cluster, and then make this a top level member of Cluster. In addition, jobqueue would need to inherit from Cluster instead of ClusterManager. This is currently tested and working on SLURMCluster, I will think of a way to test it without jobqueue though once we settle on the correct path.

I'm okay with both ways, but figured I would start with this since it's so low effort.
@mrocklin @guillaumeeb

fix #2138

guillaumeeb

This may very well be sufficient, and eventually superseeds dask/dask-jobqueue#223.

Maybe you will need to add some conditionals like in

distributed/distributed/client.py

Line 780 in 5938763

def _repr_html_(self):

, because cluster object could be different from what you expect.

I've one slight concern about the future, I'm not sure Cluster objects should have a Scheduler attribute if they are clearly separated as proposed in #2235.

mrocklin

One comment below that goes with @guillaumeeb 's comment about not assuming the presence of the scheduler.

Also, this could use a test. Maybe something like the following:

@gen_cluster(client=True)
def test_wait_for_workers(c, s, a, b):
    future = client.wait_for_workers(n=3)
    yield gen.sleep(0.1)
    assert not future.done()

    w = yield Worker(s.address)
    start = time()
    yield future
    assert time() < start + 1
    yield w._close()

Also, any thoughts on how to future-proof the method name? I can imagine people wanting other methods like this in the future, and I'd prefer not to have many of them, but to simply add more keyword arguments to this one. Anything we can do to make this a place for such functionality in the future would be welcome.

mrocklin · 2019-05-13T12:41:44Z

+    @gen.coroutine
+    def _wait_until_n_workers(self, n):
+        while n and len(self.cluster.scheduler.workers) < n:
+            yield gen.sleep(0.01)


I agree with @guillaumeeb 's point that we shouldn't assume the existence of either the scheduler or even the cluster attribute. The generic way to do this is to poll the scheduler's identity function.

while True: info = yield self.scheduler.identity() if len(info['workers']) >= n: break else: yield gen.sleep(0.200)

The self.scheduler attribute used above is a connection to the scheduler that will always be there, rather than an explicit reference to the scheduler itself.

danpf · 2019-05-13T16:02:33Z

Maybe we could do something like wait_until_n('workers', 3) with respective if-statements and that function serve _wait_until_n a number + a function that is evaluated until the criterea is hit?

    @gen.coroutine
    def _wait_until_n(self, func, n):
        while n and (yield func()) < n:
            yield gen.sleep(0.200)

    def wait_until_n(self, thing, n):
        if thing == 'workers':
            @gen.coroutine
            def f():
                info = yield self.scheduler.identity()
                raise gen.Return(len(info['workers']))
            return self.sync(self._wait_until_n, f, n)

mrocklin · 2019-05-13T16:06:44Z

yes, something like that might work. We could probably use standard keyword arguments for this as well

def wait_for_workers(n_workers=None):
    ...

Then as we add more things we can add more keywords

def wait_for_workers(n_workers=None, memory=None, cores=None):
    ...

danpf · 2019-05-13T16:29:48Z

@guillaumeeb what should we do when the scheduler isn't there? currently I have it sleep until it appears, but maybe we should raise error.... or raise error after a counter triggers.

mrocklin · 2019-05-13T16:42:41Z

+        if workers:
+            @gen.coroutine
+            def f():
+                info, _ = self._get_current_info_and_scheduler()


Thoughts on using yield self.scheduler.identity() here instead ?

_get_current_info_and_scheduler helps make sure that we have a scheduler and a cluster, and that they are valid. it's a wrapper around the first part of _repr_html_.

your suggested way seems fine to me, and was the first way I did it, but I was just doing what @guillaumeeb suggested. I definitely am less aware of all of the possible scenarios than you guys.

Maybe you will need to add some conditionals like in

is there no need to worry about that?

Please just use yield self.scheduler.identity(). It's simpler and cleaner. If you run into an issue with this then bring it up and we'll handle it. I think that you'll be fine though.

helps make sure that we have a scheduler and a cluster, and that they are valid

We may not ever have a cluster object locally. That shouldn't be required.

If you want to wait until things are set up cleanly then you could yield self, but that should already have run, and I don't think should be necessary.

mrocklin · 2019-05-13T16:43:46Z

+        while n and (yield func()) < n:
+            yield gen.sleep(0.2)
+
+    def wait_until_n(self, workers=0):


I recommend the name wait_for_workers instead. I'm not sure that a novice user will understand what wait_until_n means as immediately.

mrocklin · 2019-05-14T01:30:00Z

+            def f():
+                info = yield self.scheduler.identity()
+                raise gen.Return(len(info['workers']))
+            return self.sync(self._wait_for_workers, f, n_workers)


Rather than have the nested coroutine here I recommend that you unpack the definition of f in _wait_for_workers. I think that this should do it.

def _wait_for_workers(self, n_workers=0): info = yield self.scheduler.identity() while len(info['workers']) < n_workers: yield gen.sleep(0.1) info = yield self.scheduler.identity() def wait_for_workers(self, n_workers=0): return self.sync(self._wait_for_workers, n_workers=n_workers)

mrocklin · 2019-05-14T18:30:53Z

This looks good to me. There are a couple of unrelated intermittent failures. Also it looks like there are some linting issues. I recommend using black as described in https://github.com/dask/distributed/blob/master/CONTRIBUTING.md

See also https://travis-ci.org/dask/distributed/jobs/532348315

mrocklin · 2019-05-15T17:04:45Z

This looks good to me. Thanks @danpf . Merging

Easiest solution

5938763

guillaumeeb reviewed May 13, 2019

View reviewed changes

mrocklin reviewed May 13, 2019

View reviewed changes

dont assume scheduler existance

5db867d

danpf added 2 commits May 13, 2019 09:16

Give wait until n args

89dcd44

Check if scheduler info exists yet

6b10dd5

mrocklin reviewed May 13, 2019

View reviewed changes

danpf added 3 commits May 13, 2019 18:08

suggested changes

38d6d37

add simple test

b56720a

modify test and waiting time

ca70bd0

mrocklin reviewed May 14, 2019

View reviewed changes

danpf added 2 commits May 13, 2019 20:08

move coroutine in wait_for_workers

7d5e5fb

add docstring to wait_for_workers

797b2d8

black changes

16d4166

mrocklin merged commit 73362ea into dask:master May 15, 2019

guillaumeeb mentioned this pull request May 16, 2019

added wait until n workers arguments dask/dask-jobqueue#223

Closed

Uh oh!

Uh oh!

Conversation

danpf commented May 13, 2019

Uh oh!

guillaumeeb left a comment

Choose a reason for hiding this comment

Uh oh!

mrocklin left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrocklin May 13, 2019

Choose a reason for hiding this comment

Uh oh!

danpf commented May 13, 2019

Uh oh!

mrocklin commented May 13, 2019

Uh oh!

danpf commented May 13, 2019

Uh oh!

mrocklin May 13, 2019

Choose a reason for hiding this comment

Uh oh!

danpf May 13, 2019

Choose a reason for hiding this comment

Uh oh!

mrocklin May 13, 2019

Choose a reason for hiding this comment

Uh oh!

mrocklin May 13, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mrocklin May 14, 2019

Choose a reason for hiding this comment

Uh oh!

mrocklin commented May 14, 2019

Uh oh!

mrocklin commented May 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mrocklin left a comment •

edited

Loading

mrocklin commented May 15, 2019 •

edited

Loading