Skip to content

perf(index): query BTree lookup batch directly#7186

Merged
wjones127 merged 10 commits into
lance-format:mainfrom
wjones127:btree-query-lookup-batch-directly
Jun 11, 2026
Merged

perf(index): query BTree lookup batch directly#7186
wjones127 merged 10 commits into
lance-format:mainfrom
wjones127:btree-query-lookup-batch-directly

Conversation

@wjones127

@wjones127 wjones127 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Closes #6802.

BTreeIndex held both the page_lookup.lance batch and a parallel BTreeLookup (a BTreeMap<OrderableScalarValue, Vec<PageRecord>> plus null-page lists), duplicating every min/max value as owned ScalarValues alongside the Arrow buffers.

This PR rewrites BTreeLookup to wrap the lookup batch as the single source of truth:

  • Range searches binary-search the sorted min column via arrow_ord::make_comparator, then scan forward filtering by max and classify Matches::Some/All — same big-O as before, with packed Arrow buffers instead of scattered tree nodes. Only the small all-null / partial-null page index lists are precomputed.
  • Equality / IN lookups go through a shared candidate_pages_for_values that compares query values against the page columns with a native, inlined comparator (no boxed DynComparator vtable call per comparison). Dispatch is on the physical storage type: logical types backed by the same native are reinterpreted to one path — zero-copy when already that type, otherwise an O(1) ArrayData relabel with no value copy — so every Date32/Time32/Decimal32/IntervalYearMonth reuses the i32 path and every Date64/Time64/Timestamp/Duration/Decimal64 reuses the i64 path, rather than each generating its own. Byte-like columns (Utf8/LargeUtf8/Binary/LargeBinary/FixedSizeBinary) compare lexicographically via ArrayAccessor; intervals with struct natives and booleans fall back to make_comparator. Comparators are built once per query and reused across all values. This keeps the scan to ~19 monomorphizations instead of one per logical type.
  • BTreeIndex no longer stores a separate lookup_batch; statistics() reads bounds from the batch and cache serialization clones the batch out of page_lookup.
  • Float ordering uses total_cmp (ArrowNativeTypeOp::compare on the native path, make_comparator on the fallback), matching the previous OrderableScalarValue ordering — so the NaN caveat in the issue is a non-issue.

The first commit reverts #7161, which had taken the opposite approach (caching the parsed tree and regenerating the batch on serialize); that machinery is obsolete once the batch is the source of truth.

Testing

  • cargo test -p lance-index --lib (all pass, incl. NaN ordering, null handling, range/fragment consistency)
  • test_btree_lookup_pages_between covers duplicate mins, a null-min straddling page, Some/All classification, and empty/inverted ranges.
  • test_btree_lookup_pages_eq_bytes covers the native byte path for Binary and FixedSizeBinary (e.g. UUID columns).
  • test_btree_lookup_pages_eq_temporal covers the physical-type reinterpret (Date32i32, Timestampi64).

Benchmarks

cargo bench -p lance-index --bench btree (the existing suite: numeric + string, high/low cardinality, equality / range / IN, cached + uncached). Compared against current main (which includes #7161) using criterion's baseline significance test — --load-baseline <pr> --baseline main — so the numbers below are criterion's own bootstrapped change estimate [lower, upper] around the median, with its p-value. Config: sample_size = 10, measurement_time = 10s, default 2% noise threshold. Run on a single dev machine (macOS arm64), so absolute timings are machine-specific; the verdicts are what matter. Equality/IN are from the current HEAD; range_* exercises pages_between, which the equality/IN dispatch work did not touch, so those numbers reflect the same code.

Wins — low-cardinality and load/deserialize-bound cases

No longer rebuilding a BTreeMap on load, plus the native comparator inlining:

case change p verdict
equality/int_low_card/no_cache −27.5% [−30.7, −24.3] 0.00 improved
equality/string_low_card/no_cache −26.6% [−29.3, −24.3] 0.00 improved
range_few/int_low_card/no_cache −23.7% [−26.5, −21.4] 0.00 improved
range_few/string_low_card/no_cache −23.5% [−24.7, −21.9] 0.00 improved
equality/string_low_card/cached −20.3% [−22.8, −18.5] 0.00 improved
range_few/string_low_card/cached −18.0% [−19.4, −16.3] 0.00 improved
equality/int_low_card/cached −6.64% [−7.90, −5.61] 0.00 improved
range_few/int_low_card/cached −6.34% [−8.02, −4.49] 0.00 improved

Hot-path point / IN lookups (native physical-type dispatch)

An earlier iteration that queried the batch through the boxed make_comparator (one vtable call per comparison) regressed warm-cache high-cardinality lookups by up to +44% on a 30-value IN. The native comparators remove that. Vs main, all 32 equality+IN benchmarks are improved or at parity — zero regressions. High-cardinality warm-cache equality is back to parity; warm-cache IN is improved:

case change vs main p verdict
equality/int_unique/cached parity no change
equality/string_unique/cached parity no change
in_30/int_unique/cached −9.67% [−10.7, −9.0] 0.00 improved
in_20/int_unique/cached −3.81% [−4.11, −3.57] 0.00 improved
in_10/int_unique/cached −2.80% [−4.84, −1.27] 0.00 improved
in_20/string_low_card/cached −5.40% [−8.39, −2.96] 0.00 improved
in_10/string_low_card/cached −3.05% [−4.36, −1.94] 0.00 improved
in_30/string_low_card/cached −2.06% [−3.15, −1.23] 0.00 improved

Noise calibration

To measure the harness noise floor, comparing two runs of identical code (same binary behavior) still produces "significant" verdicts:

case (identical code) change p verdict
in_10/string_unique/cached −4.29% [−7.41, −1.86] 0.01 "improved"
equality/int_unique/cached −2.87% [−5.60, −0.68] 0.02 "improved"
equality/int_unique/no_cache −3.14% [−4.42, −2.02] 0.00 "improved"

The code did not change, so these p < 0.05 verdicts are run-to-run variance. Consistent with that, in_10/int_unique/cached has read anywhere from −4.2% to +9.3% across runs. At sample_size = 10 on warm multi-µs benchmarks, swings of ±5% earn p < 0.05 from variance alone, so single-digit-percent deltas on the warm high-cardinality cases are not reliable signal.

Residual cost

range_*/unique/cached shows small regressions — e.g. range_few/string_unique/cached +2.7%, range_few/int_unique/cached +3.8%. pages_between (the range path) still uses make_comparator; the native physical-type dispatch was added only to the equality/IN path. These sit at the edge of the calibrated noise band but are plausibly a small real cost — the deliberate tradeoff for not duplicating every min/max as an owned ScalarValue in memory, and a candidate for extending the physical-type dispatch to ranges in a follow-up. All other cases report "no change in performance detected."

wjones127 and others added 2 commits June 8, 2026 14:50
`BTreeIndex` previously held both the `page_lookup.lance` batch and a parallel
`BTreeLookup` (a `BTreeMap<OrderableScalarValue, Vec<PageRecord>>` plus null-page
lists), duplicating every min/max value as owned `ScalarValue`s alongside the
Arrow buffers.

`BTreeLookup` now wraps the lookup batch as the single source of truth. Range
searches binary-search the sorted `min` column with `arrow_ord::make_comparator`
and scan forward filtering by `max`, instead of walking a `BTreeMap`. Only the
small all-null / partial-null page index lists are precomputed. This removes the
min/max duplication and is more cache-friendly (packed buffers vs scattered
tree nodes). `make_comparator` uses `total_cmp` for floats, matching the
previous `OrderableScalarValue` ordering (including NaN).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added A-index Vector index, linalg, tokenizer performance labels Jun 9, 2026
@codecov

codecov Bot commented Jun 9, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 95.50296% with 38 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-index/src/scalar/btree.rs 95.50% 11 Missing and 27 partials ⚠️

📢 Thoughts on this report? Let us know!

Querying the lookup batch directly built fresh arrow DynComparators per
query value, regressing warm-cache high-cardinality point and IN lookups
(up to +44% on a 30-value IN) since make_comparator does type dispatch +
downcasting and the actual search is only microseconds.

Route pages_eq and pages_in through a shared candidate_pages_for_values
that builds the min/max comparators once against an array holding all
query values and reuses them, so an N-value IN costs three comparator
constructions instead of three per value. Restores parity with the
previous BTreeMap approach on those cases while keeping the load-time
and memory wins of querying the batch directly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
wjones127 and others added 6 commits June 9, 2026 12:10
candidate_pages_for_values pushed page numbers into a Vec::new() and
pages_in collected non-null query values the same way, growing one
element at a time. Profiling the warm-cache IN benchmarks showed this
RawVec::grow_one churn was a measurable chunk of the hot path.

Presize both: the candidate vec to query.len() (high-cardinality
lookups hit ~one page per value) and the non-null vec via the iterator
size hint. Removes the per-push reallocs; brings the int IN-cached cases
back to parity with (or faster than) the previous BTreeMap path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…r_values

The candidate scan window splits at `p` (first row with `min >= value`):
rows in `[p, end)` have `min == value` and therefore always match
(`max >= min == value`) and never carry a null `max`, while only the
peek-left/straddle region `[start, p)` needs the `max >= value` filter.

Copy the `[p, end)` run with `extend_from_slice` instead of pushing each
page number individually. The exact-match run is largest for
low-cardinality data (many pages share one `min`), where this avoids the
per-row branch and bound check.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The equality/IN page lookup compared via the boxed `DynComparator` from
`make_comparator`, which costs one vtable call per element comparison in the
binary search and scan. For `Utf8`/`LargeUtf8` columns, downcast once and
compare with a generic `ArrayAccessor`-based closure (`accessor_cmp`) that
inlines the `&str` comparison, matching arrow's NULLs-first ascending order.

Extracted the binary-search/scan body into `scan_equality_pages`, monomorphized
over the comparator closures, so the native path and the `make_comparator`
fallback share one implementation.

Neutralizes the warm-cache regression on `equality/string_unique` (was +5.2%
vs main, now no significant change) and improves low-cardinality string lookups
(~-19% cached, ~-26% cold). Other types fall back to `make_comparator` unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Extend the native equality comparator dispatch beyond strings: route all
primitive columns (ints, floats, decimals, temporal) through `primitive_cmp`
(comparing via `ArrowNativeTypeOp::compare`, total order so floats keep arrow's
NaN-last `make_comparator` semantics), and `Binary`/`LargeBinary`/
`FixedSizeBinary` through the existing `accessor_cmp` byte path (UUID columns
are commonly `FixedSizeBinary`). Dispatch uses arrow's `downcast_primitive!`
macro; remaining types fall back to `make_comparator`.

Both native paths inline the element comparison instead of dispatching through
the boxed `DynComparator` once per comparison, so the benefit scales with the
number of comparisons: `IN` on warm cache improves with list size
(int_unique in_10/20/30 cached: -4% / -7% / -10% vs main) and low-cardinality
lookups improve ~2-5% warm and ~25% cold. Single-value warm-cache equality is
allocation-bound (one `to_array_of_size(1)`), not comparison-bound, so it is
unaffected by the comparator change.

Adds `test_btree_lookup_pages_eq_bytes` covering Binary and FixedSizeBinary
equality/IN over a null-min straddle page and duplicate `min`s.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The native equality/IN comparator dispatched on the logical arrow type via
`downcast_primitive!`, monomorphizing the scan ~37 times — once per logical type,
including a separate path for every Date/Time/Timestamp/Duration/Decimal32-64
variant.

Dispatch on the physical storage type instead: logical types backed by the same
native (Date32/Time32/Decimal32/IntervalYearMonth → i32; Date64/Time64/Timestamp/
Duration/Decimal64 → i64) are reinterpreted to that integer `PrimitiveArray` via
`reinterpret_primitive` (zero-copy when already that type, otherwise an O(1)
`ArrayData` relabel — no value copy) and share one comparison path. This keeps the
native, inlined comparator (no per-comparison vtable) for all those types while
cutting the scan to ~19 monomorphizations and removing the per-temporal-type code.
Intervals with struct natives and booleans keep the `make_comparator` fallback.

Benchmarks (equality + IN) vs main: zero regressions; low-cardinality and IN
lookups improved (e.g. in_30/int_unique/cached -9.7%). vs the prior logical-type
version: held everywhere except ~2% on string_unique/cached IN, within the
calibrated identical-code noise band (string path logic is unchanged).

Adds `test_btree_lookup_pages_eq_temporal` covering the Date32→i32 and
Timestamp→i64 reinterpret.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Self-review follow-ups on the BTreeLookup rewrite:

- Add lookup-level tests for the previously-uncovered branches: pages_in
  with a NULL in the value list (and a NULL-only list), the pages_eq(NULL)
  short-circuit with Some/All classification, a 0-row page_lookup batch,
  every integer width/signedness plus Float16 and Decimal128/256, the
  LargeBinary/LargeUtf8 byte arms, and an all-null page that sorts behind a
  straddle page so it falls inside the equality and range scan windows.
- Fix the BTreeLookup struct doc, which only described the range path and
  framed all dispatch as going through make_comparator; describe the native
  physical-type equality/IN dispatch with make_comparator as the fallback.
- Revert BTreeIndexState from pub back to private (no external callers).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread rust/lance-index/src/scalar/btree.rs Outdated
Comment thread rust/lance-index/src/scalar/btree.rs Outdated
Derive `DeepSizeOf`/`PartialEq` for `BTreeLookup` instead of hand-rolling
them, and account for the lookup batch via `deep_size_of_children` (which
`RecordBatch` now implements) rather than `get_array_memory_size`. Same fix
in `BTreeIndexState`. Also drop the doc reference to the prior `BTreeMap`
implementation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@wjones127 wjones127 marked this pull request as ready for review June 10, 2026 16:19

@LuQQiu LuQQiu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great improvement, performance validated in benchmarking

@wjones127 wjones127 merged commit 89a6dae into lance-format:main Jun 11, 2026
30 checks passed
@wjones127 wjones127 deleted the btree-query-lookup-batch-directly branch June 11, 2026 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-index Vector index, linalg, tokenizer performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Query the BTree lookup batch directly instead of building a separate BTreeLookup

2 participants