Skip to content

fix: teach column and getitem to respect validity#1927

Closed
danking wants to merge 8 commits into
developfrom
dk/getitem-validity
Closed

fix: teach column and getitem to respect validity#1927
danking wants to merge 8 commits into
developfrom
dk/getitem-validity

Conversation

@danking

@danking danking commented Jan 13, 2025

Copy link
Copy Markdown
Contributor

No description provided.

Mask sets entries of an array to null. I like the analogy to light: the array is a sequence of
lights (each value might be a different wavelength). Null is represented by the absence of
light. Placing a mask (i.e. a piece of plastic with slits) over the array causes those values where
the mask is present (i.e. "on", "true") to be dark.

An example in pseudo-code:

```rust
a = [1, 2, 3, 4, 5]
a_mask = [t, f, f, t, f]
mask(a, a_mask) == [null, 2, 3, null, 5]
```

Specializations
---------------

I only fallback to Arrow for two of the core arrays:

- Sparse. I was skeptical that I could do better than decompressing and applying it.
- Constant. If the mask is sparse, SparseArray might be a good choice. I didn't investigate.

For the non-core arrays, I'm missing the following. I'm not clear that I can beat decompression for
run end. The others are easy enough but some amount of typing and testing.

- fastlanes
- fsst
- roaring
- runend
- runend-bool
- zigzag

Naming
------

Pandas also calls this operation
[`mask`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mask.html) but accepts an
optional second argument which is an array of values to use instead of null (which makes Pandas'
mask more like an `if_else`).

Arrow-rs calls this [`nullif`](https://arrow.apache.org/rust/arrow/compute/fn.nullif.html).

Arrow-cpp has [`if_else(condition, consequent,
alternate)`](https://arrow.apache.org/docs/cpp/compute.html#cpp-compute-scalar-selections) and
[`replace_with_mask(array, mask,
replacements)`](https://arrow.apache.org/docs/cpp/compute.html#replace-functions) both of which can
implement our `mask` by passing a `NullArray` as the third argument.
@danking

danking commented Jan 14, 2025

Copy link
Copy Markdown
Contributor Author

Will reopen after #1900 merges.

@danking danking closed this Jan 14, 2025
@robert3005 robert3005 deleted the dk/getitem-validity branch January 29, 2025 22:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant