Skip to content

Dask client connects to SLURM workers, then rapidly loses them #20

Description

@bw4sz

Summary

When adding workers to a SLURM dask client, workers are added as resources are provisioned by the scheduler, but then they quickly disappear. Presumably they are killed by the client because a lack of connection (--death_timeout flag). Its not clear whether this is intended behavior. My goal is to add workers to a dask client, connect to that client from my local laptop using jupyter lab. By the time I ssh tunnel in from my laptop, all the workers are killed.

(pangeo) [b.weinstein@c30b-s1 ~]$ python
Python 3.6.4 | packaged by conda-forge | (default, Dec 23 2017, 16:31:06) 
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from dask_jobqueue import SLURMCluster
>>> from datetime import datetime
>>> from time import sleep
>>> 
>>> cluster = SLURMCluster(project='ewhite',death_timeout=200)
>>> cluster.start_workers(5)
[3, 5, 7, 9, 11]
>>> 
>>> from dask.distributed import Client
>>> client = Client(cluster)
>>> 
>>> while True:
...     print(datetime.now().strftime("%a, %d %B %Y %I:%M:%S"))
...     print(client)
...     sleep(30)
... 
Wed, 21 March 2018 10:57:19
<Client: scheduler='tcp://172.16.194.66:35459' processes=0 cores=0>
Wed, 21 March 2018 10:57:49
<Client: scheduler='tcp://172.16.194.66:35459' processes=6 cores=24>
Wed, 21 March 2018 10:58:19
<Client: scheduler='tcp://172.16.194.66:35459' processes=6 cores=24>
Wed, 21 March 2018 10:58:49
<Client: scheduler='tcp://172.16.194.66:35459' processes=6 cores=24>
Wed, 21 March 2018 10:59:19
<Client: scheduler='tcp://172.16.194.66:35459' processes=6 cores=24>
Wed, 21 March 2018 10:59:49
<Client: scheduler='tcp://172.16.194.66:35459' processes=6 cores=24>
Wed, 21 March 2018 11:00:20
<Client: scheduler='tcp://172.16.194.66:35459' processes=6 cores=24>
Wed, 21 March 2018 11:00:50
<Client: scheduler='tcp://172.16.194.66:35459' processes=5 cores=20>
Wed, 21 March 2018 11:01:20
<Client: scheduler='tcp://172.16.194.66:35459' processes=5 cores=20>
Wed, 21 March 2018 11:01:50
<Client: scheduler='tcp://172.16.194.66:35459' processes=0 cores=0>
Wed, 21 March 2018 11:02:20
<Client: scheduler='tcp://172.16.194.66:35459' processes=0 cores=0>
Wed, 21 March 2018 11:02:50
<Client: scheduler='tcp://172.16.194.66:35459' processes=0 cores=0>

Expected Behavior

Following this helpful screencast, I thought that once workers were added, they would remain available for computation. Either the client is very aggressive about pruning unused workers, or something else is wrong.

Comments

  • May be relevant, but unknown. Once I quit the above python session, I am greeted with several error messages saying the TCP stream is closed. Example
Traceback (most recent call last):
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/comm/tcp.py", line 200, in read
    convert_stream_closed_error(self, e)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/comm/tcp.py", line 128, in convert_stream_closed_error
    raise CommClosedError("in %s: %s" % (obj, exc))
distributed.comm.core.CommClosedError: in <closed TCP>: Stream is closed

I can confirm that the workers that were once there, are now gone.

(pangeo) [b.weinstein@c30b-s1 ~]$ squeue -u b.weinstein
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          18361726 hpg2-comp     bash b.weinst  R      22:14      1 c30b-s1

presumably killed by the client.

Edited desk.err file produced, with many hundreds of duplicate lines removed.

(pangeo) [b.weinstein@c30b-s1 ~]$ cat dask.err 
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.194.178:36916'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.194.178:38675'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.194.178:32914'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.194.178:44959'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.194.178:44970'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.194.178:36426'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.194.178:35157'
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.194.178:40539'
distributed.diskutils - WARNING - Found stale lock file and directory '/home/b.weinstein/dask-worker-space/worker-r9gleghg', purging
distributed.worker - INFO -       Start worker at: tcp://172.16.194.178:39384
distributed.worker - INFO -          Listening to: tcp://172.16.194.178:39384
distributed.worker - INFO -              nanny at:       172.16.194.178:44959
distributed.worker - INFO -              bokeh at:        172.16.194.178:8789
distributed.worker - INFO - Waiting to connect to:  tcp://172.16.194.66:35459
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-worker-space/worker-l7jp7he1
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:  tcp://172.16.194.66:35459
distributed.worker - INFO - -------------------------------------------------
distributed.nanny - INFO - Failed to start worker process.  Restarting
...
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.nanny - INFO - Failed to start worker process.  Restarting
distributed.worker - INFO -       Start worker at: tcp://172.16.194.178:46551
distributed.worker - INFO -          Listening to: tcp://172.16.194.178:46551
distributed.worker - INFO -              nanny at:       172.16.194.178:38675
distributed.worker - INFO -              bokeh at:       172.16.194.178:41314
distributed.worker - INFO - Waiting to connect to:  tcp://172.16.194.66:35459
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-worker-space/worker-yjnhwx2e
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:  tcp://172.16.194.66:35459
distributed.worker - INFO - -------------------------------------------------
...
distributed.nanny - INFO - Closing Nanny at 'tcp://172.16.192.163:44012'
...
:35125
distributed.worker - INFO -          Listening to: tcp://172.16.194.178:35125
distributed.worker - INFO -              nanny at:       172.16.194.178:32914
distributed.worker - INFO -              bokeh at:       172.16.194.178:45869
distributed.worker - INFO - Waiting to connect to:  tcp://172.16.194.66:35459
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-worker-space/worker-w9w338nb
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Stopping worker at tcp://172.16.194.178:35125
distributed.worker - INFO -       Start worker at: tcp://172.16.194.184:46207
distributed.worker - INFO -          Listening to: tcp://172.16.194.184:46207
distributed.worker - INFO -              nanny at:       172.16.194.184:45711
distributed.worker - INFO -              bokeh at:        172.16.194.184:8789
distributed.worker - INFO - Waiting to connect to:  tcp://172.16.194.66:35459
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-worker-space/worker-g0jjm93z
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Stopping worker at tcp://172.16.194.184:46207
orker - INFO -              nanny at:       172.16.194.178:38675
distributed.worker - INFO -              bokeh at:       172.16.194.178:45033
distributed.worker - INFO - Waiting to connect to:  tcp://172.16.194.66:35459
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-worker-space/worker-u2_1pu5v
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Stopping worker at tcp://172.16.194.178:35568
tornado.application - ERROR - Multiple exceptions in yield list
Traceback (most recent call last):
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 849, in callback
    result_list.append(f.result())
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 155, in _start
    response = yield self.instantiate()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 223, in instantiate
    self.process.start()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 363, in start
    self._wait_until_started())
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
    yielded = self.gen.send(value)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 471, in _wait_until_started
    assert msg == 'started', msg
AssertionError: {'address': 'tcp://172.16.194.178:37068', 'dir': '/home/b.weinstein/dask-worker-space/worker-5ipv17oi'}
distributed.dask_worker - INFO - End worker
distributed.nanny - INFO - Failed to start worker process.  Restarting
/envs/pangeo/bin/dask-worker", line 6, in <module>
    sys.exit(distributed.cli.dask_worker.go())
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/cli/dask_worker.py", line 252, in go
    main()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/cli/dask_worker.py", line 243, in main
    loop.run_sync(run)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/ioloop.py", line 582, in run_sync
    return future_cell[0].result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/cli/dask_worker.py", line 236, in run
    yield [n._start(addr) for n in nannies]
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 849, in callback
    result_list.append(f.result())
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 155, in _start
    response = yield self.instantiate()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 223, in instantiate
    self.process.start()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1107, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 363, in start
    self._wait_until_started())
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1099, in run
    value = future.result()
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1113, in run
    yielded = self.gen.send(value)
  File "/home/b.weinstein/miniconda3/envs/pangeo/lib/python3.6/site-packages/distributed/nanny.py", line 471, in _wait_until_started
    assert msg == 'started', msg
AssertionError: {'address': 'tcp://172.16.194.178:37661', 'dir': '/home/b.weinstein/dask-worker-space/worker-_b3fhcyu'}
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-25, started daemon)>
...
distributed.worker - INFO -          Listening to: tcp://172.16.194.184:43288
distributed.worker - INFO -              nanny at:       172.16.194.184:43216
distributed.worker - INFO -              bokeh at:        172.16.194.184:8789
distributed.worker - INFO - Waiting to connect to:  tcp://172.16.194.66:35459
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-worker-space/worker-mo6fnfvo
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Stopping worker at tcp://172.16.194.184:43288
distributed.nanny - INFO - Failed to start worker process.  Restarting
...
:36157
distributed.worker - INFO -          Listening to: tcp://172.16.194.184:36157
distributed.worker - INFO -              nanny at:       172.16.194.184:42948
distributed.worker - INFO -              bokeh at:        172.16.194.184:8789
distributed.worker - INFO - Waiting to connect to:  tcp://172.16.194.66:35459
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-worker-space/worker-h0jj6gi3
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Stopping worker at tcp://172.16.194.184:36157
distributed.nanny - INFO - Failed to start worker process.  Restarting
...
distributed.worker - INFO -          Listening to: tcp://172.16.194.184:46197
distributed.worker - INFO -              nanny at:       172.16.194.184:38837
distributed.worker - INFO -              bokeh at:        172.16.194.184:8789
distributed.worker - INFO - Waiting to connect to:  tcp://172.16.194.66:35459
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          4
distributed.worker - INFO -                Memory:                    7.00 GB
distributed.worker - INFO -       Local Directory: /home/b.weinstein/dask-worker-space/worker-lft5bgwk
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Stopping worker at tcp://172.16.194.184:46197
...
distributed.nanny - WARNING - Worker process 27618 was killed by unknown signal
...
distributed.dask_worker - INFO - End worker
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-392, 
...
distributed.nanny - WARNING - Worker process still alive after 159 seconds, killing
distributed.nanny - WARNING - Worker process 13695 was killed by unknown signal
distributed.nanny - WARNING - Worker process still alive after 159 seconds, killing
...
distributed.nanny - WARNING - Worker process 13712 was killed by unknown signal
...
distributed.nanny - WARNING - Worker process still alive after 159 seconds, killing
distributed.dask_worker - INFO - End worker
distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-303, started daemon)>
(pangeo) [b.weinstein@c30b-s1 ~]$ 

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions