Skip to content

For loop never ending #333

Description

@eXpensia

Hello,

I'm using dask to analyse some images, my code looks like this :

somefunc(n):
    try:
        long calculation
        return result
    except:
         return to be filtered result

def main():
    num_image = 268828
    X = np.empty((num_image,7), dtype=np.float32)
    cluster = SLURMCluster(cores=32,
                           processes=32,
                           interface ="ib0",
                           walltime="04:10:00",
                           memory = "80GB")
    cluster.scale(600)
    client = Client(cluster)

    with tqdm (total = num_image) as pbar:
        future = client.map(somefunc, range(num_image))
        for future,result in as_completed(future,with_results = True):
            X[result[0],:] = result
            pbar.update(1)
            future.release()

    return X


if __name__ == '__main__':

    x = main()
    io.savemat('/gpfs/home/savedata.mat', {'data': x})

The scripts very well excepts for some iterations and it stays stuck without doing some iterations

100%|█████████▉| 268711/268828 [3:40:11<00:31, 3.68it/s]

Looking at the log files I found that it first looped on a first error :

Traceback (most recent call last):
  File "/gpfs/home/maxlavaud/miniconda3/envs/env_dask/lib/python3.6/site-packages/tornado/ioloop.py", line 743, in _run_callback
    ret = callback()
  File "/gpfs/home/maxlavaud/miniconda3/envs/env_dask/lib/python3.6/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
    future.result()
  File "/gpfs/home/maxlavaud/miniconda3/envs/env_dask/lib/python3.6/site-packages/distributed/worker.py", line 875, in heartbeat
    metrics=await self.get_metrics(),
  File "/gpfs/home/maxlavaud/miniconda3/envs/env_dask/lib/python3.6/site-packages/distributed/core.py", line 747, in send_recv_from_rpc
    comm = await self.pool.connect(self.addr)
  File "/gpfs/home/maxlavaud/miniconda3/envs/env_dask/lib/python3.6/site-packages/distributed/core.py", line 874, in connect
    connection_args=self.connection_args,
  File "/gpfs/home/maxlavaud/miniconda3/envs/env_dask/lib/python3.6/site-packages/distributed/comm/core.py", line 227, in connect
    _raise(error)
  File "/gpfs/home/maxlavaud/miniconda3/envs/env_dask/lib/python3.6/site-packages/distributed/comm/core.py", line 204, in _raise
    raise IOError(msg)
OSError: Timed out trying to connect to 'tcp://192.168.11.3:16892' after 10 s: in <distributed.comm.tcp.TCPConnector object at 0x2b1a8a2bc630>: ConnectionRefusedError: [Errno 111] Connection refused
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x2b1a32811c88>>, <Task finished coro=<Worker.heartbeat() done, defined at /gpfs/home/maxlavaud/miniconda3/envs/env_dask/lib/python3.6/site-packages/distributed/worker.py:866> exception=OSError("Timed out trying to connect to 'tcp://192.168.11.3:16892' after 10 s: in <distributed.comm.tcp.TCPConnector object at 0x2b1a8a346c50>: ConnectionRefusedError: [Errno 111] Connection refused",)>)
Traceback (most recent call last):
  File "/gpfs/home/maxlavaud/miniconda3/envs/env_dask/lib/python3.6/site-packages/distributed/comm/core.py", line 215, in connect
    quiet_exceptions=EnvironmentError,
tornado.util.TimeoutError: Timeout

And then I think it's when the wall time as been reached it loops on another one :

Traceback (most recent call last):
  File "/gpfs/home/maxlavaud/miniconda3/envs/env_dask/lib/python3.6/site-packages/tornado/ioloop.py", line 907, in _run
    return self.callback()
  File "/gpfs/home/maxlavaud/miniconda3/envs/env_dask/lib/python3.6/site-packages/distributed/worker.py", line 857, in <lambda>
    lambda: self.batched_stream.send({"op": "keep-alive"}),
  File "/gpfs/home/maxlavaud/miniconda3/envs/env_dask/lib/python3.6/site-packages/distributed/batched.py", line 117, in send
    raise CommClosedError
distributed.comm.core.CommClosedError

Also I've been warned that there is some process emited by dask that are using a lot of memory (10-100MB) on the login node, I don't quit understand where do they come from.

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions