perf(index): query BTree lookup batch directly by wjones127 · Pull Request #7186 · lance-format/lance

wjones127 · 2026-06-09T16:21:35Z

Closes #6802.

BTreeIndex held both the page_lookup.lance batch and a parallel BTreeLookup (a BTreeMap<OrderableScalarValue, Vec<PageRecord>> plus null-page lists), duplicating every min/max value as owned ScalarValues alongside the Arrow buffers.

This PR rewrites BTreeLookup to wrap the lookup batch as the single source of truth:

Range searches binary-search the sorted min column via arrow_ord::make_comparator, then scan forward filtering by max and classify Matches::Some/All — same big-O as before, with packed Arrow buffers instead of scattered tree nodes. Only the small all-null / partial-null page index lists are precomputed.
Equality / IN lookups go through a shared candidate_pages_for_values that compares query values against the page columns with a native, inlined comparator (no boxed DynComparator vtable call per comparison). Dispatch is on the physical storage type: logical types backed by the same native are reinterpreted to one path — zero-copy when already that type, otherwise an O(1) ArrayData relabel with no value copy — so every Date32/Time32/Decimal32/IntervalYearMonth reuses the i32 path and every Date64/Time64/Timestamp/Duration/Decimal64 reuses the i64 path, rather than each generating its own. Byte-like columns (Utf8/LargeUtf8/Binary/LargeBinary/FixedSizeBinary) compare lexicographically via ArrayAccessor; intervals with struct natives and booleans fall back to make_comparator. Comparators are built once per query and reused across all values. This keeps the scan to ~19 monomorphizations instead of one per logical type.
BTreeIndex no longer stores a separate lookup_batch; statistics() reads bounds from the batch and cache serialization clones the batch out of page_lookup.
Float ordering uses total_cmp (ArrowNativeTypeOp::compare on the native path, make_comparator on the fallback), matching the previous OrderableScalarValue ordering — so the NaN caveat in the issue is a non-issue.

The first commit reverts #7161, which had taken the opposite approach (caching the parsed tree and regenerating the batch on serialize); that machinery is obsolete once the batch is the source of truth.

Testing

cargo test -p lance-index --lib (all pass, incl. NaN ordering, null handling, range/fragment consistency)
test_btree_lookup_pages_between covers duplicate mins, a null-min straddling page, Some/All classification, and empty/inverted ranges.
test_btree_lookup_pages_eq_bytes covers the native byte path for Binary and FixedSizeBinary (e.g. UUID columns).
test_btree_lookup_pages_eq_temporal covers the physical-type reinterpret (Date32→i32, Timestamp→i64).

Benchmarks

cargo bench -p lance-index --bench btree (the existing suite: numeric + string, high/low cardinality, equality / range / IN, cached + uncached). Compared against current main (which includes #7161) using criterion's baseline significance test — --load-baseline <pr> --baseline main — so the numbers below are criterion's own bootstrapped change estimate [lower, upper] around the median, with its p-value. Config: sample_size = 10, measurement_time = 10s, default 2% noise threshold. Run on a single dev machine (macOS arm64), so absolute timings are machine-specific; the verdicts are what matter. Equality/IN are from the current HEAD; range_* exercises pages_between, which the equality/IN dispatch work did not touch, so those numbers reflect the same code.

Wins — low-cardinality and load/deserialize-bound cases

No longer rebuilding a BTreeMap on load, plus the native comparator inlining:

case	change	verdict
equality/int_low_card/no_cache	−27.5% [−30.7, −24.3]	improved
equality/string_low_card/no_cache	−26.6% [−29.3, −24.3]	improved
range_few/int_low_card/no_cache	−23.7% [−26.5, −21.4]	improved
range_few/string_low_card/no_cache	−23.5% [−24.7, −21.9]	improved
equality/string_low_card/cached	−20.3% [−22.8, −18.5]	improved
range_few/string_low_card/cached	−18.0% [−19.4, −16.3]	improved
equality/int_low_card/cached	−6.64% [−7.90, −5.61]	improved
range_few/int_low_card/cached	−6.34% [−8.02, −4.49]	improved

Hot-path point / `IN` lookups (native physical-type dispatch)

An earlier iteration that queried the batch through the boxed make_comparator (one vtable call per comparison) regressed warm-cache high-cardinality lookups by up to +44% on a 30-value IN. The native comparators remove that. Vs main, all 32 equality+IN benchmarks are improved or at parity — zero regressions. High-cardinality warm-cache equality is back to parity; warm-cache IN is improved:

case	change vs main	p	verdict
equality/int_unique/cached	parity	—	no change
equality/string_unique/cached	parity	—	no change
in_30/int_unique/cached	−9.67% [−10.7, −9.0]	0.00	improved
in_20/int_unique/cached	−3.81% [−4.11, −3.57]	0.00	improved
in_10/int_unique/cached	−2.80% [−4.84, −1.27]	0.00	improved
in_20/string_low_card/cached	−5.40% [−8.39, −2.96]	0.00	improved
in_10/string_low_card/cached	−3.05% [−4.36, −1.94]	0.00	improved
in_30/string_low_card/cached	−2.06% [−3.15, −1.23]	0.00	improved

Noise calibration

To measure the harness noise floor, comparing two runs of identical code (same binary behavior) still produces "significant" verdicts:

case (identical code)	change	p	verdict
in_10/string_unique/cached	−4.29% [−7.41, −1.86]	0.01	"improved"
equality/int_unique/cached	−2.87% [−5.60, −0.68]	0.02	"improved"
equality/int_unique/no_cache	−3.14% [−4.42, −2.02]	0.00	"improved"

The code did not change, so these p < 0.05 verdicts are run-to-run variance. Consistent with that, in_10/int_unique/cached has read anywhere from −4.2% to +9.3% across runs. At sample_size = 10 on warm multi-µs benchmarks, swings of ±5% earn p < 0.05 from variance alone, so single-digit-percent deltas on the warm high-cardinality cases are not reliable signal.

Residual cost

range_*/unique/cached shows small regressions — e.g. range_few/string_unique/cached +2.7%, range_few/int_unique/cached +3.8%. pages_between (the range path) still uses make_comparator; the native physical-type dispatch was added only to the equality/IN path. These sit at the edge of the calibrated noise band but are plausibly a small real cost — the deliberate tradeoff for not duplicating every min/max as an owned ScalarValue in memory, and a candidate for extending the physical-type dispatch to ranges in a follow-up. All other cases report "no change in performance detected."

…)" This reverts commit 9698bfb.

`BTreeIndex` previously held both the `page_lookup.lance` batch and a parallel `BTreeLookup` (a `BTreeMap<OrderableScalarValue, Vec<PageRecord>>` plus null-page lists), duplicating every min/max value as owned `ScalarValue`s alongside the Arrow buffers. `BTreeLookup` now wraps the lookup batch as the single source of truth. Range searches binary-search the sorted `min` column with `arrow_ord::make_comparator` and scan forward filtering by `max`, instead of walking a `BTreeMap`. Only the small all-null / partial-null page index lists are precomputed. This removes the min/max duplication and is more cache-friendly (packed buffers vs scattered tree nodes). `make_comparator` uses `total_cmp` for floats, matching the previous `OrderableScalarValue` ordering (including NaN). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

codecov · 2026-06-09T16:54:53Z

Codecov Report

❌ Patch coverage is 95.50296% with 38 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance-index/src/scalar/btree.rs	95.50%	11 Missing and 27 partials ⚠️

📢 Thoughts on this report? Let us know!

Querying the lookup batch directly built fresh arrow DynComparators per query value, regressing warm-cache high-cardinality point and IN lookups (up to +44% on a 30-value IN) since make_comparator does type dispatch + downcasting and the actual search is only microseconds. Route pages_eq and pages_in through a shared candidate_pages_for_values that builds the min/max comparators once against an array holding all query values and reuses them, so an N-value IN costs three comparator constructions instead of three per value. Restores parity with the previous BTreeMap approach on those cases while keeping the load-time and memory wins of querying the batch directly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

candidate_pages_for_values pushed page numbers into a Vec::new() and pages_in collected non-null query values the same way, growing one element at a time. Profiling the warm-cache IN benchmarks showed this RawVec::grow_one churn was a measurable chunk of the hot path. Presize both: the candidate vec to query.len() (high-cardinality lookups hit ~one page per value) and the non-null vec via the iterator size hint. Removes the per-push reallocs; brings the int IN-cached cases back to parity with (or faster than) the previous BTreeMap path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…r_values The candidate scan window splits at `p` (first row with `min >= value`): rows in `[p, end)` have `min == value` and therefore always match (`max >= min == value`) and never carry a null `max`, while only the peek-left/straddle region `[start, p)` needs the `max >= value` filter. Copy the `[p, end)` run with `extend_from_slice` instead of pushing each page number individually. The exact-match run is largest for low-cardinality data (many pages share one `min`), where this avoids the per-row branch and bound check. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The equality/IN page lookup compared via the boxed `DynComparator` from `make_comparator`, which costs one vtable call per element comparison in the binary search and scan. For `Utf8`/`LargeUtf8` columns, downcast once and compare with a generic `ArrayAccessor`-based closure (`accessor_cmp`) that inlines the `&str` comparison, matching arrow's NULLs-first ascending order. Extracted the binary-search/scan body into `scan_equality_pages`, monomorphized over the comparator closures, so the native path and the `make_comparator` fallback share one implementation. Neutralizes the warm-cache regression on `equality/string_unique` (was +5.2% vs main, now no significant change) and improves low-cardinality string lookups (~-19% cached, ~-26% cold). Other types fall back to `make_comparator` unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Extend the native equality comparator dispatch beyond strings: route all primitive columns (ints, floats, decimals, temporal) through `primitive_cmp` (comparing via `ArrowNativeTypeOp::compare`, total order so floats keep arrow's NaN-last `make_comparator` semantics), and `Binary`/`LargeBinary`/ `FixedSizeBinary` through the existing `accessor_cmp` byte path (UUID columns are commonly `FixedSizeBinary`). Dispatch uses arrow's `downcast_primitive!` macro; remaining types fall back to `make_comparator`. Both native paths inline the element comparison instead of dispatching through the boxed `DynComparator` once per comparison, so the benefit scales with the number of comparisons: `IN` on warm cache improves with list size (int_unique in_10/20/30 cached: -4% / -7% / -10% vs main) and low-cardinality lookups improve ~2-5% warm and ~25% cold. Single-value warm-cache equality is allocation-bound (one `to_array_of_size(1)`), not comparison-bound, so it is unaffected by the comparator change. Adds `test_btree_lookup_pages_eq_bytes` covering Binary and FixedSizeBinary equality/IN over a null-min straddle page and duplicate `min`s. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The native equality/IN comparator dispatched on the logical arrow type via `downcast_primitive!`, monomorphizing the scan ~37 times — once per logical type, including a separate path for every Date/Time/Timestamp/Duration/Decimal32-64 variant. Dispatch on the physical storage type instead: logical types backed by the same native (Date32/Time32/Decimal32/IntervalYearMonth → i32; Date64/Time64/Timestamp/ Duration/Decimal64 → i64) are reinterpreted to that integer `PrimitiveArray` via `reinterpret_primitive` (zero-copy when already that type, otherwise an O(1) `ArrayData` relabel — no value copy) and share one comparison path. This keeps the native, inlined comparator (no per-comparison vtable) for all those types while cutting the scan to ~19 monomorphizations and removing the per-temporal-type code. Intervals with struct natives and booleans keep the `make_comparator` fallback. Benchmarks (equality + IN) vs main: zero regressions; low-cardinality and IN lookups improved (e.g. in_30/int_unique/cached -9.7%). vs the prior logical-type version: held everywhere except ~2% on string_unique/cached IN, within the calibrated identical-code noise band (string path logic is unchanged). Adds `test_btree_lookup_pages_eq_temporal` covering the Date32→i32 and Timestamp→i64 reinterpret. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Self-review follow-ups on the BTreeLookup rewrite: - Add lookup-level tests for the previously-uncovered branches: pages_in with a NULL in the value list (and a NULL-only list), the pages_eq(NULL) short-circuit with Some/All classification, a 0-row page_lookup batch, every integer width/signedness plus Float16 and Decimal128/256, the LargeBinary/LargeUtf8 byte arms, and an all-null page that sorts behind a straddle page so it falls inside the equality and range scan windows. - Fix the BTreeLookup struct doc, which only described the range path and framed all dispatch as going through make_comparator; describe the native physical-type equality/IN dispatch with make_comparator as the fallback. - Revert BTreeIndexState from pub back to private (no external callers). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Derive `DeepSizeOf`/`PartialEq` for `BTreeLookup` instead of hand-rolling them, and account for the lookup batch via `deep_size_of_children` (which `RecordBatch` now implements) rather than `get_array_memory_size`. Same fix in `BTreeIndexState`. Also drop the doc reference to the prior `BTreeMap` implementation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

LuQQiu

Great improvement, performance validated in benchmarking

wjones127 and others added 2 commits June 8, 2026 14:50

Revert "perf(index): cache parsed btree lookup state (lance-format#7161…

fbc122a

…)" This reverts commit 9698bfb.

github-actions Bot added A-index Vector index, linalg, tokenizer performance labels Jun 9, 2026

wjones127 mentioned this pull request Jun 9, 2026

Change SargableQuery::IsIn to hold an Arrow array instead of Vec<ScalarValue> #7192

Open

wjones127 and others added 6 commits June 9, 2026 12:10

wjones127 commented Jun 10, 2026

View reviewed changes

Comment thread rust/lance-index/src/scalar/btree.rs Outdated

Comment thread rust/lance-index/src/scalar/btree.rs Outdated

wjones127 marked this pull request as ready for review June 10, 2026 16:19

wjones127 mentioned this pull request Jun 10, 2026

feat: stabilize cache codec with a versioned envelope #7163

Merged

LuQQiu approved these changes Jun 11, 2026

View reviewed changes

wjones127 merged commit 89a6dae into lance-format:main Jun 11, 2026
30 checks passed

wjones127 deleted the btree-query-lookup-batch-directly branch June 11, 2026 15:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(index): query BTree lookup batch directly#7186

perf(index): query BTree lookup batch directly#7186
wjones127 merged 10 commits into
lance-format:mainfrom
wjones127:btree-query-lookup-batch-directly

wjones127 commented Jun 9, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

LuQQiu left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

wjones127 commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing

Benchmarks

Wins — low-cardinality and load/deserialize-bound cases

Hot-path point / IN lookups (native physical-type dispatch)

Noise calibration

Residual cost

Uh oh!

codecov Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

LuQQiu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wjones127 commented Jun 9, 2026 •

edited

Loading

Hot-path point / `IN` lookups (native physical-type dispatch)

codecov Bot commented Jun 9, 2026 •

edited

Loading