-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Describe the bug
The parquet Statistics::null_count_opt method returns Some(0) when the underlying stats are missing.
This makes the nullcount stats we report actively dangerous -- not merely useless -- when attempting to prune row groups based on parquet stats:
- Consider the predicate
x IS NULL - Any footer with nullcount stat
Some(0)for x (no nulls) can be safely skipped, because no row can satisfy the predicate - Any footer with nullcount stat
Nonefor x (no nullcount stat) must not be skipped, because any number of rows could satisfy the predicate
Because we force None to Some(0) when decoding null count stats, it is impossible for a reader to safely use the resulting stat -- even if the writer correctly distinguished None vs. Some(0)!
To Reproduce
Read null count stats (or, just read the source code linked above).
Expected behavior
The reader should return whatever the writer produced. If a sub-optimal writer produced None when it could have produced Some(0), that just means the stat is unavailable -- the whole point of an _opt method. Callers who really want to conflate the two are welcome to unwrap_or(0). At least then the intent is stated clearly.
Additional context
The current behavior is actually documented... but gives no justification for the behavior nor any hint of warning about the ramifications:
Note this API returns Some(0) even if the null count was not present in the statistics. See https://github.com/apache/arrow-rs/pull/6216/files
Worse, ValueStatistics::null_count_opt does not include the same note, which tricked at least one arrow-rs user into thinking that the problem is somehow local to the Statistics::null_count_opt wrapper method rather than built into the stats decoding itself.
Meanwhile, on the only PR comment that mentions Some(0) seems to deal with the write path:
The previous API treated null_count = 0 as None.
The new API treats null_count = 0 as Some(0).
... which apparently is a problem because the parquet writer still encodes Some(0) as None?
... but the issue description there makes a false claim :
This will not cause issues for people using parquet-rs to read and write data as the reader also (incorrectly) reports
Some(0)when the thrift metadata hasNone
That incorrect reader behavior IS, in fact, a problem:
- In the days before
null_count_opt, the value0could mean "no nulls" but it could also mean "no nullcount stat". - So readers could not safely perform stats-based row group pruning for an IS [NOT] NULL predicate (because "no nulls" and "no nullcount stat" have opposite meanings).
- Now that
null_count_optexists, we should be able to distinguish the two scenarios asSome(0)vs.None. But the reader undoes that distinction with anunwrap_or(0)infrom_thrift_page_stats - Presumably that was a misguided attempt to preserve... something... but in practice it just means null count stats continue to be dangerous (not merely useless).
Note that writers and readers have different safety rules here:
- It's correct (albeit pessimal) for a writer to not produce a stat. It just means readers won't have as much information to work with later.
- It's dangerously incorrect for a reader to infer Some(0) when encountering None. See example above.
Do we know of any writer that emits Some(0) where it should have produced None? All the commentary I could find (see additional context below) suggests the opposite problem -- that writers emit None when they could/should have emitted Some(0). If that's true, then we should still honor the None vs. Some(0) distinction on the read path when it exists, and let writers catch up when they catch up.