Skip to content

OverflowError when sending large sparse arrays #366

Description

@jcrist

I don't yet have a small reproducible example, but I can make this happen every time I try to collect many large sparse arrays. I do have a notebook that will produce it though, and can make that available. The traceback:

Traceback (most recent call last):
  File "/home/jcrist/miniconda/envs/dask_learn/lib/python2.7/site-packages/distributed/core.py", line 266, in write
    frames = protocol.dumps(msg)
  File "/home/jcrist/miniconda/envs/dask_learn/lib/python2.7/site-packages/distributed/protocol.py", line 81, in dumps
    frames = dumps_msgpack(small)
  File "/home/jcrist/miniconda/envs/dask_learn/lib/python2.7/site-packages/distributed/protocol.py", line 155, in dumps_msgpack
    fmt, payload = maybe_compress(payload)
  File "/home/jcrist/miniconda/envs/dask_learn/lib/python2.7/site-packages/distributed/protocol.py", line 137, in maybe_compress
    compressed = compress(payload)
OverflowError: size does not fit in an int

A few notes:

  • Each array is roughly 675000 x 745, and ~1% dense. The total bytes for indices + indptr + data is ~40MB each.
  • I can get each array individually, so it's not a problem with a chunk being too large
  • The error appears only when I'm collecting enough at once (for my size, 39 and and lower works fine).
  • At 41 arrays I get the above error, 40 arrays gives me a different (but probably related) error:
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-55-7b87709b6c67> in <module>()
----> 1 res = t.compute()

/home/jcrist/dask/dask/base.pyc in compute(self, **kwargs)
     84             Extra keywords to forward to the scheduler ``get`` function.
     85         """
---> 86         return compute(self, **kwargs)[0]
     87 
     88     @classmethod

/home/jcrist/dask/dask/base.pyc in compute(*args, **kwargs)
    177         dsk = merge(var.dask for var in variables)
    178     keys = [var._keys() for var in variables]
--> 179     results = get(dsk, keys, **kwargs)
    180 
    181     results_iter = iter(results)

/home/jcrist/miniconda/envs/dask_learn/lib/python2.7/site-packages/distributed/executor.pyc in get(self, dsk, keys, **kwargs)
   1008 
   1009         if status == 'error':
-> 1010             raise result
   1011         else:
   1012             return result

ValueError: corrupt input at byte 2

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions