Skip to content

zero check in encoding less than ideal for lazy arrays #3627

@hmaarrfk

Description

@hmaarrfk

The zero check in the encoding path is less than ideal

if not chunk_spec.config.write_empty_chunks and chunk_array.all_equal(

For lazy arrays, it eventually calls:

  1. np.broadcast_arrays
  2. np.array_equal

where the latter can instantiate a full array in memory.

The problem is that some lazy array implementation will not "keep this instantiation of the interediate full memory array" in memory for the the operation 2 lines down, that will eventually write the array.

So if the instantiation of the lazy array is costly, we are:

  1. Instantiating it once for checking for zeros.
  2. Instantiating a second time to write it to the store.

I would love to avoid this 'double cost' in a "pythonic way".

I would have loved to suggest
My proposal would be to refactor the "zero check" to use np.array_equiv https://numpy.org/devdocs/reference/generated/numpy.array_equiv.html#numpy.array_equiv

>>> import numpy as np
>>> np.array_equiv(np.zeros((3, 4)), 0)
True

but looking at the source, it seems to recreate the same problem https://github.com/numpy/numpy/blob/main/numpy/_core/numeric.py#L2552-L2598

I'm not sure what the right solution is.

One could cache the local array.

I notice that this operation is actually considered inefficient by the original writers leaving a note like

        # use array_equal to obtain equal_nan=True functionality
        # Since fill-value is a scalar, isn't there a faster path than allocating a new array for fill value
        # every single time we have to write data?

Where this is a problem for me:

The numpy broadcast operation is actually really difficult to implement if you want to ensure high performance based on the chunks of the array. My experience with dask from 7 years ago reminds me that it was also difficult for them to implement efficient rechunking.

So implementing the np.broadcast_arrays just feels like "alot of work"....

_data, other = np.broadcast_arrays(self._data, other)

The zero check feels like it makes sense.... but its like "alot of work to do" for some array backends.


Suggested patch:

diff --git a/src/zarr/core/buffer/core.py b/src/zarr/core/buffer/core.py
index f0d01566..54a2c09f 100644
--- a/src/zarr/core/buffer/core.py
+++ b/src/zarr/core/buffer/core.py
@@ -540,6 +540,11 @@ class NDBuffer:
         # use array_equal to obtain equal_nan=True functionality
         # Since fill-value is a scalar, isn't there a faster path than allocating a new array for fill value
         # every single time we have to write data?
+
+        # The operation array_equal operation below effectively will force the array
+        # into memory.
+        # So we cache it here.
+        self._data = np.asarray(self._data)
         _data, other = np.broadcast_arrays(self._data, other)
         return np.array_equal(
             self._data,

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions