Conversation
Mask sets entries of an array to null. I like the analogy to light: the array is a sequence of lights (each value might be a different wavelength). Null is represented by the absence of light. Placing a mask (i.e. a piece of plastic with slits) over the array causes those values where the mask is present (i.e. "on", "true") to be dark. An example in pseudo-code: ```rust a = [1, 2, 3, 4, 5] a_mask = [t, f, f, t, f] mask(a, a_mask) == [null, 2, 3, null, 5] ``` Specializations --------------- I only fallback to Arrow for two of the core arrays: - Sparse. I was skeptical that I could do better than decompressing and applying it. - Constant. If the mask is sparse, SparseArray might be a good choice. I didn't investigate. For the non-core arrays, I'm missing the following. I'm not clear that I can beat decompression for run end. The others are easy enough but some amount of typing and testing. - fastlanes - fsst - roaring - runend - runend-bool - zigzag Naming ------ Pandas also calls this operation [`mask`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mask.html) but accepts an optional second argument which is an array of values to use instead of null (which makes Pandas' mask more like an `if_else`). Arrow-rs calls this [`nullif`](https://arrow.apache.org/rust/arrow/compute/fn.nullif.html). Arrow-cpp has [`if_else(condition, consequent, alternate)`](https://arrow.apache.org/docs/cpp/compute.html#cpp-compute-scalar-selections) and [`replace_with_mask(array, mask, replacements)`](https://arrow.apache.org/docs/cpp/compute.html#replace-functions) both of which can implement our `mask` by passing a `NullArray` as the third argument.
|
Nick, dict array no longer uses |
gatesn
left a comment
There was a problem hiding this comment.
It would have been really helpful if this was done one PR per encoding.
| array | ||
| .patches() | ||
| .map(|patches| { | ||
| patches.map_values(|values| try_cast(&values, &values.dtype().as_nullable())) |
There was a problem hiding this comment.
This doesn't seem to apply the mask to the patches?
There was a problem hiding this comment.
This is gone. When #1951 merges, I'll revisit ALP mask.
| }; | ||
|
|
||
| Ok(DateTimePartsArray::try_new( | ||
| array.dtype().clone().as_nullable(), |
There was a problem hiding this comment.
This logic seems wrong? You're check for eq_ignore_nullability, but you don't check which direction the cast is, yet you always return a nullable DType.
There was a problem hiding this comment.
Maybe pull this into a PR and add tests
There was a problem hiding this comment.
And now removed from this PR.
| FilterIter::Slices(slices) => { | ||
| for slice in slices { | ||
| for index in slice.0..slice.1 { | ||
| codes[index] = T::zero(); |
There was a problem hiding this comment.
I think this works?
codes[slice].fill(T::zero())
There was a problem hiding this comment.
this works: codes[slice.0..slice.1].fill(T::zero()) (it needed to be a Range). Thanks, pushed!
|
|
||
| impl MaskFn<DictArray> for DictEncoding { | ||
| fn mask(&self, array: &DictArray, mask: FilterMask) -> VortexResult<ArrayData> { | ||
| let new_codes = match_each_integer_ptype!(array.codes_ptype(), |$T| { |
There was a problem hiding this comment.
Don't you also have to check that the array was nullable before? Otherwise zero may not be the null value.
There was a problem hiding this comment.
Good point. This is fixed in 373ae02 and a test was added.
| impl CastFn<BoolArray> for BoolEncoding { | ||
| fn cast(&self, array: &BoolArray, dtype: &DType) -> VortexResult<ArrayData> { | ||
| let DType::Bool(new_nullability) = dtype else { | ||
| vortex_bail!(MismatchedTypes: "bool type", dtype); |
There was a problem hiding this comment.
I think an error message saying "cannot cast from X to Y" is probably better.
| impl MaskFn<ExtensionArray> for ExtensionEncoding { | ||
| fn mask(&self, array: &ExtensionArray, filter_mask: FilterMask) -> VortexResult<ArrayData> { | ||
| let DType::Extension(ext_dtype) = array.dtype() else { | ||
| vortex_bail!("extension array must have extension dtype"); |
There was a problem hiding this comment.
Doesn't ExtensionArray have a ext_dtype() function? If not, it should. And it should panic, rather than return a result.
| array.names().clone(), | ||
| array | ||
| .children() | ||
| .zip_eq(sdtype.dtypes()) |
There was a problem hiding this comment.
This seems like wrong (or at the very least, undocumented) behavior. So we cast each field to the positionally equivalent DType in the destination type? But we preserve the names of the original array.
I would expect cast(array, dtype)?.dtype() == dtype. In fact, can you add this assertion in the cast entry-point function.=
There was a problem hiding this comment.
I now require the names to match and added a test that shuffling the name order causes the cast to fail.
| VarBinArray::try_new( | ||
| array.offsets(), | ||
| array.bytes(), | ||
| array.dtype().with_nullability(*nullability), |
There was a problem hiding this comment.
You match on the target DType, but then you just change the nullability of the source DType. So VarBin(Binary).cast(UTF8?) => VarBin(Binary?) ??
| VarBinViewArray::try_new( | ||
| array.views(), | ||
| array.buffers().collect(), | ||
| array.dtype().with_nullability(*nullability), |
| } | ||
|
|
||
| fn boolean_buffer(&self) -> VortexResult<&BooleanBuffer> { | ||
| pub fn boolean_buffer(&self) -> VortexResult<&BooleanBuffer> { |
There was a problem hiding this comment.
This is now already public in develop.
Deploying vortex-bench with
|
| Latest commit: |
041e913
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://72b23c2a.vortex-bench.pages.dev |
| Branch Preview URL: | https://dk-mask.vortex-bench.pages.dev |
|
Alright @gatesn , this is ready for another look. |
| (0.01, 0.1), | ||
| (0.01, 0.01), | ||
| ])] | ||
| fn bench_dict_mask(bencher: Bencher, (fraction_valid, fraction_masked): (f64, f64)) { |
There was a problem hiding this comment.
This is a useful microbenchmark but dict currently lacks a MaskFn so this just measures decompression time. I intend to follow up with a real MaskFn.
Mask sets entries of an array to null. I like the analogy to light: the array is a sequence of lights (each value might be a different wavelength). Null is represented by the absence oflight. Placing a mask (i.e. a piece of plastic with slits) over the array causes those values where the mask is present (i.e. "on", "true") to be dark.
An example in pseudo-code:
Specializations
I only fallback to Arrow for two of the core arrays:
For the non-core arrays, I'm missing the following. I'm not clear that I can beat decompression forrun end. The others are easy enough but some amount of typing and testing.
Naming
Pandas also calls this operation
maskbut accepts an optional second argument which is an array of values to use instead of null (which makes Pandas' mask more like anif_else).Arrow-rs calls this
nullif.Arrow-cpp has
if_else(condition, consequent, alternate)andreplace_with_mask(array, mask, replacements)both of which can implement ourmaskby passing aNullArrayas the third argument.Follow ups: