I don't yet have a small reproducible example, but I can make this happen every time I try to collect many large sparse arrays. I do have a notebook that will produce it though, and can make that available. The traceback:
Traceback (most recent call last):
File "/home/jcrist/miniconda/envs/dask_learn/lib/python2.7/site-packages/distributed/core.py", line 266, in write
frames = protocol.dumps(msg)
File "/home/jcrist/miniconda/envs/dask_learn/lib/python2.7/site-packages/distributed/protocol.py", line 81, in dumps
frames = dumps_msgpack(small)
File "/home/jcrist/miniconda/envs/dask_learn/lib/python2.7/site-packages/distributed/protocol.py", line 155, in dumps_msgpack
fmt, payload = maybe_compress(payload)
File "/home/jcrist/miniconda/envs/dask_learn/lib/python2.7/site-packages/distributed/protocol.py", line 137, in maybe_compress
compressed = compress(payload)
OverflowError: size does not fit in an int
A few notes:
- Each array is roughly
675000 x 745, and ~1% dense. The total bytes for indices + indptr + data is ~40MB each.
- I can get each array individually, so it's not a problem with a chunk being too large
- The error appears only when I'm collecting enough at once (for my size, 39 and and lower works fine).
- At 41 arrays I get the above error, 40 arrays gives me a different (but probably related) error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-55-7b87709b6c67> in <module>()
----> 1 res = t.compute()
/home/jcrist/dask/dask/base.pyc in compute(self, **kwargs)
84 Extra keywords to forward to the scheduler ``get`` function.
85 """
---> 86 return compute(self, **kwargs)[0]
87
88 @classmethod
/home/jcrist/dask/dask/base.pyc in compute(*args, **kwargs)
177 dsk = merge(var.dask for var in variables)
178 keys = [var._keys() for var in variables]
--> 179 results = get(dsk, keys, **kwargs)
180
181 results_iter = iter(results)
/home/jcrist/miniconda/envs/dask_learn/lib/python2.7/site-packages/distributed/executor.pyc in get(self, dsk, keys, **kwargs)
1008
1009 if status == 'error':
-> 1010 raise result
1011 else:
1012 return result
ValueError: corrupt input at byte 2
I don't yet have a small reproducible example, but I can make this happen every time I try to collect many large sparse arrays. I do have a notebook that will produce it though, and can make that available. The traceback:
A few notes:
675000 x 745, and ~1% dense. The total bytes for indices + indptr + data is ~40MB each.