zero check in encoding less than ideal for lazy arrays

The zero check in the encoding path is less than ideal
https://github.com/zarr-developers/zarr-python/blob/b8736917c351418dce3fb425ee16f747136da96e/src/zarr/core/codec_pipeline.py#L416

For lazy arrays, it eventually calls:

1. np.broadcast_arrays
2. np.array_equal

where the latter can instantiate a full array in memory.

The problem is that some lazy array implementation will not "keep this instantiation of the interediate full memory array" in memory for the the operation 2 lines down, that will eventually write the array.

So if the instantiation of the lazy array is costly, we are:

1. Instantiating it once for checking for zeros.
2. Instantiating a second time to write it to the store.

I would love to avoid this 'double cost' in a "pythonic way".

I would have loved to suggest 
~My proposal would be~ to refactor the "zero check" to use `np.array_equiv` https://numpy.org/devdocs/reference/generated/numpy.array_equiv.html#numpy.array_equiv

```python
>>> import numpy as np
>>> np.array_equiv(np.zeros((3, 4)), 0)
True
```

but looking at the source, it seems to recreate the same problem https://github.com/numpy/numpy/blob/main/numpy/_core/numeric.py#L2552-L2598

I'm not sure what the right solution is.

One could cache the local array.

I notice that this operation is actually considered inefficient by the original writers leaving a note like
```
        # use array_equal to obtain equal_nan=True functionality
        # Since fill-value is a scalar, isn't there a faster path than allocating a new array for fill value
        # every single time we have to write data?
```

---------------------------------------------------------------------------

Where this is a problem for me:

The numpy broadcast operation is actually really difficult to implement if you want to ensure high performance based on the chunks of the array. My experience with dask from 7 years ago reminds me that it was also difficult for them to implement efficient rechunking.

So implementing the `np.broadcast_arrays` just feels like "alot of work"....

https://github.com/zarr-developers/zarr-python/blob/b8736917c351418dce3fb425ee16f747136da96e/src/zarr/core/buffer/core.py#L543

The zero check feels like it makes sense.... but its like "alot of work to do" for some array backends.

----------------------------------------------------------------------------

Suggested patch:
```
diff --git a/src/zarr/core/buffer/core.py b/src/zarr/core/buffer/core.py
index f0d01566..54a2c09f 100644
--- a/src/zarr/core/buffer/core.py
+++ b/src/zarr/core/buffer/core.py
@@ -540,6 +540,11 @@ class NDBuffer:
         # use array_equal to obtain equal_nan=True functionality
         # Since fill-value is a scalar, isn't there a faster path than allocating a new array for fill value
         # every single time we have to write data?
+
+        # The operation array_equal operation below effectively will force the array
+        # into memory.
+        # So we cache it here.
+        self._data = np.asarray(self._data)
         _data, other = np.broadcast_arrays(self._data, other)
         return np.array_equal(
             self._data,

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

zero check in encoding less than ideal for lazy arrays #3627

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

zero check in encoding less than ideal for lazy arrays #3627

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions