Skip to content

Error when scaling a function over a wider range of parameters (LSFCluster) #139

Description

@adamhaber

Hi,

I'm trying to compute some function over a wide range of parameters using LSFCluster.

The general outline is as follows:

def f(x,y):
   ...

futures = [f(x,y) for x,y in list(itertools.product(range(X),range(Y)))]
x = progress(client.compute(futures))
x

When I try to compute with X=Y=20, everything goes smoothly.
However, when I increase the range of parameters over which I'm computing f(x,y) (for example = X=Y=100), I get an error message I don't understand:

distributed.scheduler - ERROR - '856313'
Traceback (most recent call last):
  File "/home/adamh/miniconda3/lib/python3.5/site-packages/distributed/scheduler.py", line 1267, in add_worker
    plugin.add_worker(scheduler=self, worker=address)
  File "/home/adamh/miniconda3/lib/python3.5/site-packages/dask_jobqueue/core.py", line 61, in add_worker
    self.running_jobs[job_id] = self.pending_jobs.pop(job_id)
KeyError: '856313'

Just to be sure, I ran bjobs -r and indeed I have a job running with job id 856313. I get similar error message for many other different workers.

Some more info which might be relevant:

  1. When I change f to be some really simple function (f(x,y)=x+y), the problem disappears.
  2. During runtime, f writes temporary files to a tmp directory - each f(x,y) creates its own temp directory - perhaps this involves too much disk operations?
  3. When I run everything locally, it takes forever (hence dask :-)) but doesn't crash and doesn't fill up the memory.

Any help would be much appreciated!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions