perf: use roaring's range iter to speedup mask_to_offset_ranges#6871
Merged
Conversation
Add a criterion benchmark suite targeting RowAddrMask / RowAddrTreeMap
that quantifies the cost of operations whose work is fundamentally
range-shaped but currently goes through per-row Partial(RoaringBitmap)
representation. Six groups:
insert_range_single_run - producer cost: insert one range
into_addr_iter_single_run - consumer cost: walk every row addr
next_range_iter_single_run - achievable cost via Iter::next_range
intersect_two_runs - set op on two range-shaped masks
mask_to_offset_ranges_inner_loop - end-to-end slow path observed in
IS NULL trace (495 ms / 889 ms)
insert_runs_constant_cardinality - many small runs vs one big run
Each varies dataset size while holding number-of-ranges fixed at 1, so
linear scaling in N reveals where row count dominates the cost.
Headline finding (10M-row inputs):
into_addr_iter: 19.4 ms per-bit walk
next_range iter: 1.72 us per-run walk (~11000x faster)
The next_range/iter delta represents the speedup an alternate
range-aware iterator could surface to callers. The roaring crate
already represents the data as run-encoded containers; the
RowAddrMask public API does not expose them.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds `RowAddrTreeMap::iter_runs()` — a range-shaped consumer that walks roaring's run-encoded containers via `Iter::next_range` instead of yielding individual bits. Rewrites the `U64Segment::Range` arm of `mask_to_offset_ranges` to use it, eliminating the per-bit walk that dominated the IS NULL hot path documented in 1b9d7c0. Benchmark deltas at 10M rows (single contiguous run, vs the bench commit's `into_addr_iter` baseline): Consumer iteration into_addr_iter iter_runs speedup N = 10K 19.4 µs 17.6 ns 1,100x N = 100K 191 µs 28.4 ns 6,800x N = 1M 1.92 ms 181 ns 10,400x N = 10M 19.5 ms 1.68 µs 11,600x mask_to_offset_ranges_inner_loop (end-to-end hot path): N = 10K 19.7 µs 132 ns 150x N = 100K 194 µs 262 ns 775x N = 1M 1.93 ms 1.92 µs 1,000x N = 10M 19.3 ms 20.1 µs 960x Within ~3x of a dedicated Vec<RangeInclusive>-backed representation at 10M rows, but both are in the microseconds while the original was in the milliseconds — irrelevant in the context of a query that takes hundreds of ms. The new method is ~70 lines (method + 2 tests + bench wiring) vs the ~700-line Runs-variant alternative, and adds no new public enum variant or representation switch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The function
mask_to_offset_rangesis used at scan planning time to determine which rows to read from the file. This was a bottleneck when the mask was the result of a zonal index search because the old implementation materialized all of the offsets only to convert them back into ranges.Luckily, roaring recently implemented a range-based iterator. Using this we can skip the materialization step. On my zonemap benchmark this doubles the speed of the search and, perhaps more importantly, removes a penalty I observed when the index is used even on queries that are not highly selective.
Generated with the assistance of Claude code.