perf(index): query BTree lookup batch directly#7186
Merged
wjones127 merged 10 commits intoJun 11, 2026
Conversation
…)" This reverts commit 9698bfb.
`BTreeIndex` previously held both the `page_lookup.lance` batch and a parallel `BTreeLookup` (a `BTreeMap<OrderableScalarValue, Vec<PageRecord>>` plus null-page lists), duplicating every min/max value as owned `ScalarValue`s alongside the Arrow buffers. `BTreeLookup` now wraps the lookup batch as the single source of truth. Range searches binary-search the sorted `min` column with `arrow_ord::make_comparator` and scan forward filtering by `max`, instead of walking a `BTreeMap`. Only the small all-null / partial-null page index lists are precomputed. This removes the min/max duplication and is more cache-friendly (packed buffers vs scattered tree nodes). `make_comparator` uses `total_cmp` for floats, matching the previous `OrderableScalarValue` ordering (including NaN). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Querying the lookup batch directly built fresh arrow DynComparators per query value, regressing warm-cache high-cardinality point and IN lookups (up to +44% on a 30-value IN) since make_comparator does type dispatch + downcasting and the actual search is only microseconds. Route pages_eq and pages_in through a shared candidate_pages_for_values that builds the min/max comparators once against an array holding all query values and reuses them, so an N-value IN costs three comparator constructions instead of three per value. Restores parity with the previous BTreeMap approach on those cases while keeping the load-time and memory wins of querying the batch directly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
candidate_pages_for_values pushed page numbers into a Vec::new() and pages_in collected non-null query values the same way, growing one element at a time. Profiling the warm-cache IN benchmarks showed this RawVec::grow_one churn was a measurable chunk of the hot path. Presize both: the candidate vec to query.len() (high-cardinality lookups hit ~one page per value) and the non-null vec via the iterator size hint. Removes the per-push reallocs; brings the int IN-cached cases back to parity with (or faster than) the previous BTreeMap path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…r_values The candidate scan window splits at `p` (first row with `min >= value`): rows in `[p, end)` have `min == value` and therefore always match (`max >= min == value`) and never carry a null `max`, while only the peek-left/straddle region `[start, p)` needs the `max >= value` filter. Copy the `[p, end)` run with `extend_from_slice` instead of pushing each page number individually. The exact-match run is largest for low-cardinality data (many pages share one `min`), where this avoids the per-row branch and bound check. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The equality/IN page lookup compared via the boxed `DynComparator` from `make_comparator`, which costs one vtable call per element comparison in the binary search and scan. For `Utf8`/`LargeUtf8` columns, downcast once and compare with a generic `ArrayAccessor`-based closure (`accessor_cmp`) that inlines the `&str` comparison, matching arrow's NULLs-first ascending order. Extracted the binary-search/scan body into `scan_equality_pages`, monomorphized over the comparator closures, so the native path and the `make_comparator` fallback share one implementation. Neutralizes the warm-cache regression on `equality/string_unique` (was +5.2% vs main, now no significant change) and improves low-cardinality string lookups (~-19% cached, ~-26% cold). Other types fall back to `make_comparator` unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Extend the native equality comparator dispatch beyond strings: route all primitive columns (ints, floats, decimals, temporal) through `primitive_cmp` (comparing via `ArrowNativeTypeOp::compare`, total order so floats keep arrow's NaN-last `make_comparator` semantics), and `Binary`/`LargeBinary`/ `FixedSizeBinary` through the existing `accessor_cmp` byte path (UUID columns are commonly `FixedSizeBinary`). Dispatch uses arrow's `downcast_primitive!` macro; remaining types fall back to `make_comparator`. Both native paths inline the element comparison instead of dispatching through the boxed `DynComparator` once per comparison, so the benefit scales with the number of comparisons: `IN` on warm cache improves with list size (int_unique in_10/20/30 cached: -4% / -7% / -10% vs main) and low-cardinality lookups improve ~2-5% warm and ~25% cold. Single-value warm-cache equality is allocation-bound (one `to_array_of_size(1)`), not comparison-bound, so it is unaffected by the comparator change. Adds `test_btree_lookup_pages_eq_bytes` covering Binary and FixedSizeBinary equality/IN over a null-min straddle page and duplicate `min`s. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The native equality/IN comparator dispatched on the logical arrow type via `downcast_primitive!`, monomorphizing the scan ~37 times — once per logical type, including a separate path for every Date/Time/Timestamp/Duration/Decimal32-64 variant. Dispatch on the physical storage type instead: logical types backed by the same native (Date32/Time32/Decimal32/IntervalYearMonth → i32; Date64/Time64/Timestamp/ Duration/Decimal64 → i64) are reinterpreted to that integer `PrimitiveArray` via `reinterpret_primitive` (zero-copy when already that type, otherwise an O(1) `ArrayData` relabel — no value copy) and share one comparison path. This keeps the native, inlined comparator (no per-comparison vtable) for all those types while cutting the scan to ~19 monomorphizations and removing the per-temporal-type code. Intervals with struct natives and booleans keep the `make_comparator` fallback. Benchmarks (equality + IN) vs main: zero regressions; low-cardinality and IN lookups improved (e.g. in_30/int_unique/cached -9.7%). vs the prior logical-type version: held everywhere except ~2% on string_unique/cached IN, within the calibrated identical-code noise band (string path logic is unchanged). Adds `test_btree_lookup_pages_eq_temporal` covering the Date32→i32 and Timestamp→i64 reinterpret. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Self-review follow-ups on the BTreeLookup rewrite: - Add lookup-level tests for the previously-uncovered branches: pages_in with a NULL in the value list (and a NULL-only list), the pages_eq(NULL) short-circuit with Some/All classification, a 0-row page_lookup batch, every integer width/signedness plus Float16 and Decimal128/256, the LargeBinary/LargeUtf8 byte arms, and an all-null page that sorts behind a straddle page so it falls inside the equality and range scan windows. - Fix the BTreeLookup struct doc, which only described the range path and framed all dispatch as going through make_comparator; describe the native physical-type equality/IN dispatch with make_comparator as the fallback. - Revert BTreeIndexState from pub back to private (no external callers). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
wjones127
commented
Jun 10, 2026
Derive `DeepSizeOf`/`PartialEq` for `BTreeLookup` instead of hand-rolling them, and account for the lookup batch via `deep_size_of_children` (which `RecordBatch` now implements) rather than `get_array_memory_size`. Same fix in `BTreeIndexState`. Also drop the doc reference to the prior `BTreeMap` implementation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
LuQQiu
approved these changes
Jun 11, 2026
LuQQiu
left a comment
Contributor
There was a problem hiding this comment.
Great improvement, performance validated in benchmarking
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #6802.
BTreeIndexheld both thepage_lookup.lancebatch and a parallelBTreeLookup(aBTreeMap<OrderableScalarValue, Vec<PageRecord>>plus null-page lists), duplicating every min/max value as ownedScalarValues alongside the Arrow buffers.This PR rewrites
BTreeLookupto wrap the lookup batch as the single source of truth:mincolumn viaarrow_ord::make_comparator, then scan forward filtering bymaxand classifyMatches::Some/All— same big-O as before, with packed Arrow buffers instead of scattered tree nodes. Only the small all-null / partial-null page index lists are precomputed.INlookups go through a sharedcandidate_pages_for_valuesthat compares query values against the page columns with a native, inlined comparator (no boxedDynComparatorvtable call per comparison). Dispatch is on the physical storage type: logical types backed by the same native are reinterpreted to one path — zero-copy when already that type, otherwise an O(1)ArrayDatarelabel with no value copy — so everyDate32/Time32/Decimal32/IntervalYearMonthreuses thei32path and everyDate64/Time64/Timestamp/Duration/Decimal64reuses thei64path, rather than each generating its own. Byte-like columns (Utf8/LargeUtf8/Binary/LargeBinary/FixedSizeBinary) compare lexicographically viaArrayAccessor; intervals with struct natives and booleans fall back tomake_comparator. Comparators are built once per query and reused across all values. This keeps the scan to ~19 monomorphizations instead of one per logical type.BTreeIndexno longer stores a separatelookup_batch;statistics()reads bounds from the batch and cache serialization clones the batch out ofpage_lookup.total_cmp(ArrowNativeTypeOp::compareon the native path,make_comparatoron the fallback), matching the previousOrderableScalarValueordering — so the NaN caveat in the issue is a non-issue.The first commit reverts #7161, which had taken the opposite approach (caching the parsed tree and regenerating the batch on serialize); that machinery is obsolete once the batch is the source of truth.
Testing
cargo test -p lance-index --lib(all pass, incl. NaN ordering, null handling, range/fragment consistency)test_btree_lookup_pages_betweencovers duplicatemins, a null-minstraddling page, Some/All classification, and empty/inverted ranges.test_btree_lookup_pages_eq_bytescovers the native byte path forBinaryandFixedSizeBinary(e.g. UUID columns).test_btree_lookup_pages_eq_temporalcovers the physical-type reinterpret (Date32→i32,Timestamp→i64).Benchmarks
cargo bench -p lance-index --bench btree(the existing suite: numeric + string, high/low cardinality, equality / range /IN, cached + uncached). Compared against currentmain(which includes #7161) using criterion's baseline significance test —--load-baseline <pr> --baseline main— so the numbers below are criterion's own bootstrapped change estimate[lower, upper]around the median, with its p-value. Config:sample_size = 10,measurement_time = 10s, default 2% noise threshold. Run on a single dev machine (macOS arm64), so absolute timings are machine-specific; the verdicts are what matter. Equality/INare from the current HEAD;range_*exercisespages_between, which the equality/IN dispatch work did not touch, so those numbers reflect the same code.Wins — low-cardinality and load/deserialize-bound cases
No longer rebuilding a
BTreeMapon load, plus the native comparator inlining:Hot-path point /
INlookups (native physical-type dispatch)An earlier iteration that queried the batch through the boxed
make_comparator(one vtable call per comparison) regressed warm-cache high-cardinality lookups by up to +44% on a 30-valueIN. The native comparators remove that. Vsmain, all 32 equality+IN benchmarks are improved or at parity — zero regressions. High-cardinality warm-cache equality is back to parity; warm-cacheINis improved:Noise calibration
To measure the harness noise floor, comparing two runs of identical code (same binary behavior) still produces "significant" verdicts:
The code did not change, so these p < 0.05 verdicts are run-to-run variance. Consistent with that,
in_10/int_unique/cachedhas read anywhere from −4.2% to +9.3% across runs. Atsample_size = 10on warm multi-µs benchmarks, swings of ±5% earn p < 0.05 from variance alone, so single-digit-percent deltas on the warm high-cardinality cases are not reliable signal.Residual cost
range_*/unique/cachedshows small regressions — e.g.range_few/string_unique/cached +2.7%,range_few/int_unique/cached +3.8%.pages_between(the range path) still usesmake_comparator; the native physical-type dispatch was added only to the equality/INpath. These sit at the edge of the calibrated noise band but are plausibly a small real cost — the deliberate tradeoff for not duplicating every min/max as an ownedScalarValuein memory, and a candidate for extending the physical-type dispatch to ranges in a follow-up. All other cases report "no change in performance detected."