What happened?
I have found an example where open_zarr appears to ignore the use_zarr_fill_value_as_mask=True argument.
What did you expect to happen?
Background: Zarr Arrays have a fill_value property. This is used to determine the value of any "uninitialized" part of an array, e.g. chunks that have never been written. Zarr itself has no concept of missing data or masked data. This parameter was optional in Zarr Python 2 and required in Zarr Python 3.
When we implemented Zarr 3 support, we also changed the behavior of Zarr fill values.
Before Zarr 3, the Zarr array fill_value was optional, and Xarray used this field analogously to NetCDF's special _FillValue attribute. The value of the underlying Zarr array fill_value was used to apply a mask to the array when decoding it.
After Zarr 3, the Zarr array fill_value became mandatory, and we could no longer keep that behavior because it triggered automatic masking of data. So an integer array with fill_value=0 (the default) would be coerced to float in order to apply the mask and turn all 0s to nans.
Instead, we added an option use_zarr_fill_value_as_mask, to toggle this behavior. When set to False, we just use a regular _FillValue attribute to store the sentinel value for the mask.
|
if self._use_zarr_fill_value_as_mask: |
|
# Setting this attribute triggers CF decoding for missing values |
|
# by interpreting Zarr's fill_value to mean the same as netCDF's _FillValue |
|
if zarr_array.fill_value is not None: |
|
attributes["_FillValue"] = zarr_array.fill_value |
|
elif "_FillValue" in attributes: |
|
original_zarr_dtype = zarr_array.metadata.data_type |
|
attributes["_FillValue"] = FillValueCoder.decode( |
|
attributes["_FillValue"], original_zarr_dtype.value |
|
) |
In the example below, I expected that setting use_zarr_fill_value_as_mask=True would produce the "old" behavior, but instead it appears to be ignored.
A separate but related issue is that the actual Zarr fill_value no longer appears in the encoding anywhere. That feels like a problem.
Minimal Complete Verifiable Example
import zarr
import xarray as xr
import numpy as np
# create a dataset with a different _FillValue attr and fill_value encoding
ds = xr.DataArray([1, 2, 3], attrs={"_FillValue": 1}, dims="x").to_dataset(name="foo")
ds.foo.encoding = {"fill_value": -99}
# write to zarr
store = zarr.storage.MemoryStore()
ds.to_zarr(store, zarr_format=3, consolidated=False)
# check that the fill_value encoding was propagated
za = zarr.open(store, path="foo")
assert za.fill_value == -99
# now resize the Zarr array to create true missing data
za.resize((4,))
# open this back up in xarray
# scenario 1: no mask_and_scale at all
# use_zarr_fill_value_as_mask=False should not mask the
ds1 = xr.open_zarr(store, consolidated=False, zarr_format=3, chunks=None, mask_and_scale=False)
# ✅ no mask has been applied
np.testing.assert_equal(ds1.foo.values, [1, 2, 3, -99])
# _FillValue still in attrs
assert ds1.foo.attrs['_FillValue'] == 1
# note that that fill_value doesn't appear in encoding anywhere
assert "fill_value" not in ds1.foo.encoding
# scenario 2: default for Zarr 3 - mask_and_scale=True, use_zarr_fill_value_as_mask=False
ds2 = xr.open_zarr(store, consolidated=False, zarr_format=3, chunks=None, use_zarr_fill_value_as_mask=False)
# ✅ mask applied correctly
np.testing.assert_equal(ds2.foo.values, [np.nan, 2, 3, -99])
# _FillValue moved from attrs to encoding
assert ds2.foo.encoding['_FillValue'] == 1
assert "_FillValue" not in ds2.foo.attrs
# still no sign of -99
assert "fill_value" not in ds2.foo.encoding
assert "fill_value" not in ds2.foo.attrs
# scenario 3: use_zarr_fill_value_as_mask=True
ds3 = xr.open_zarr(store, consolidated=False, zarr_format=3, chunks=None, use_zarr_fill_value_as_mask=True)
# ❌ mask not applied correctly
try:
np.testing.assert_equal(ds2.foo.values, [1, 2, 3, np.nan])
except AssertionError:
# the mask was not applied correctly
print("got", ds2.foo.values)
# _FillValue is in both encoding and attrs!
print("encoding", ds2.foo.encoding)
print("attrs", ds1.foo.attrs)
MVCE confirmation
Relevant log output
got [ nan 2. 3. -99.]
encoding {'chunks': (3,), 'preferred_chunks': {'x': 3}, 'compressors': (ZstdCodec(level=0, checksum=False),), 'filters': (), 'shards': None, 'serializer': BytesCodec(endian=<Endian.little: 'little'>), '_FillValue': 1, 'dtype': dtype('int64')}
attrs {'_FillValue': 1}
Anything else we need to know?
No response
Environment
Details
INSTALLED VERSIONS
------------------
commit: None
python: 3.12.4 | packaged by conda-forge | (main, Jun 17 2024, 10:23:07) [GCC 12.3.0]
python-bits: 64
OS: Linux
OS-release: 5.10.220-209.869.amzn2.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C.UTF-8
LANG: C.UTF-8
LOCALE: ('C', 'UTF-8')
libhdf5: 1.14.3
libnetcdf: 4.9.2
xarray: 2025.3.1
pandas: 2.2.2
numpy: 1.26.4
scipy: 1.13.1
netCDF4: 1.6.5
pydap: installed
h5netcdf: 1.3.0
h5py: 3.11.0
zarr: 3.0.7
cftime: 1.6.4
nc_time_axis: 1.4.1
iris: None
bottleneck: 1.4.0
dask: 2024.6.2
distributed: 2024.6.2
matplotlib: 3.8.4
cartopy: 0.23.0
seaborn: 0.13.2
numbagg: 0.8.1
fsspec: 2024.6.0
cupy: None
pint: 0.23
sparse: 0.15.4
flox: 0.9.9
numpy_groupies: 0.11.1
setuptools: 70.1.0
pip: 24.0
conda: None
pytest: 8.2.2
mypy: None
IPython: 8.25.0
sphinx: None
What happened?
I have found an example where
open_zarrappears to ignore theuse_zarr_fill_value_as_mask=Trueargument.What did you expect to happen?
Background: Zarr Arrays have a
fill_valueproperty. This is used to determine the value of any "uninitialized" part of an array, e.g. chunks that have never been written. Zarr itself has no concept of missing data or masked data. This parameter was optional in Zarr Python 2 and required in Zarr Python 3.When we implemented Zarr 3 support, we also changed the behavior of Zarr fill values.
Before Zarr 3, the Zarr array
fill_valuewas optional, and Xarray used this field analogously to NetCDF's special_FillValueattribute. The value of the underlying Zarr arrayfill_valuewas used to apply a mask to the array when decoding it.After Zarr 3, the Zarr array
fill_valuebecame mandatory, and we could no longer keep that behavior because it triggered automatic masking of data. So an integer array withfill_value=0(the default) would be coerced to float in order to apply the mask and turn all 0s to nans.Instead, we added an option
use_zarr_fill_value_as_mask, to toggle this behavior. When set toFalse, we just use a regular_FillValueattribute to store the sentinel value for the mask.xarray/xarray/backends/zarr.py
Lines 874 to 883 in 729c4fa
In the example below, I expected that setting
use_zarr_fill_value_as_mask=Truewould produce the "old" behavior, but instead it appears to be ignored.A separate but related issue is that the actual Zarr
fill_valueno longer appears in the encoding anywhere. That feels like a problem.Minimal Complete Verifiable Example
MVCE confirmation
Relevant log output
Anything else we need to know?
No response
Environment
Details