perf(index): cache parsed btree lookup state#7161
Merged
Merged
Conversation
# Conflicts: # rust/lance-index/src/scalar/btree.rs
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Contributor
Author
That would be wonderful, if we can directly query from arrow lookup batch while preserving the perf |
This was referenced Jun 9, 2026
wjones127
added a commit
that referenced
this pull request
Jun 11, 2026
Closes #6802. `BTreeIndex` held both the `page_lookup.lance` batch and a parallel `BTreeLookup` (a `BTreeMap<OrderableScalarValue, Vec<PageRecord>>` plus null-page lists), duplicating every min/max value as owned `ScalarValue`s alongside the Arrow buffers. This PR rewrites `BTreeLookup` to wrap the lookup batch as the single source of truth: - Range searches binary-search the sorted `min` column via `arrow_ord::make_comparator`, then scan forward filtering by `max` and classify `Matches::Some`/`All` — same big-O as before, with packed Arrow buffers instead of scattered tree nodes. Only the small all-null / partial-null page index lists are precomputed. - Equality / `IN` lookups go through a shared `candidate_pages_for_values` that compares query values against the page columns with a native, inlined comparator (no boxed `DynComparator` vtable call per comparison). Dispatch is on the **physical storage type**: logical types backed by the same native are reinterpreted to one path — zero-copy when already that type, otherwise an O(1) `ArrayData` relabel with no value copy — so every `Date32`/`Time32`/`Decimal32`/`IntervalYearMonth` reuses the `i32` path and every `Date64`/`Time64`/`Timestamp`/`Duration`/`Decimal64` reuses the `i64` path, rather than each generating its own. Byte-like columns (`Utf8`/`LargeUtf8`/`Binary`/`LargeBinary`/`FixedSizeBinary`) compare lexicographically via `ArrayAccessor`; intervals with struct natives and booleans fall back to `make_comparator`. Comparators are built once per query and reused across all values. This keeps the scan to ~19 monomorphizations instead of one per logical type. - `BTreeIndex` no longer stores a separate `lookup_batch`; `statistics()` reads bounds from the batch and cache serialization clones the batch out of `page_lookup`. - Float ordering uses `total_cmp` (`ArrowNativeTypeOp::compare` on the native path, `make_comparator` on the fallback), matching the previous `OrderableScalarValue` ordering — so the NaN caveat in the issue is a non-issue. The first commit reverts #7161, which had taken the opposite approach (caching the parsed tree and regenerating the batch on serialize); that machinery is obsolete once the batch is the source of truth. ## Testing - `cargo test -p lance-index --lib` (all pass, incl. NaN ordering, null handling, range/fragment consistency) - `test_btree_lookup_pages_between` covers duplicate `min`s, a null-`min` straddling page, Some/All classification, and empty/inverted ranges. - `test_btree_lookup_pages_eq_bytes` covers the native byte path for `Binary` and `FixedSizeBinary` (e.g. UUID columns). - `test_btree_lookup_pages_eq_temporal` covers the physical-type reinterpret (`Date32`→`i32`, `Timestamp`→`i64`). ## Benchmarks `cargo bench -p lance-index --bench btree` (the existing suite: numeric + string, high/low cardinality, equality / range / `IN`, cached + uncached). Compared against current `main` (which includes #7161) using criterion's baseline significance test — `--load-baseline <pr> --baseline main` — so the numbers below are criterion's own bootstrapped change estimate `[lower, upper]` around the median, with its p-value. Config: `sample_size = 10`, `measurement_time = 10s`, default 2% noise threshold. Run on a single dev machine (macOS arm64), so absolute timings are machine-specific; the verdicts are what matter. Equality/`IN` are from the current HEAD; `range_*` exercises `pages_between`, which the equality/IN dispatch work did not touch, so those numbers reflect the same code. ### Wins — low-cardinality and load/deserialize-bound cases No longer rebuilding a `BTreeMap` on load, plus the native comparator inlining: | case | change | p | verdict | |---|---|---|---| | equality/int_low_card/no_cache | −27.5% [−30.7, −24.3] | 0.00 | improved | | equality/string_low_card/no_cache | −26.6% [−29.3, −24.3] | 0.00 | improved | | range_few/int_low_card/no_cache | −23.7% [−26.5, −21.4] | 0.00 | improved | | range_few/string_low_card/no_cache | −23.5% [−24.7, −21.9] | 0.00 | improved | | equality/string_low_card/cached | −20.3% [−22.8, −18.5] | 0.00 | improved | | range_few/string_low_card/cached | −18.0% [−19.4, −16.3] | 0.00 | improved | | equality/int_low_card/cached | −6.64% [−7.90, −5.61] | 0.00 | improved | | range_few/int_low_card/cached | −6.34% [−8.02, −4.49] | 0.00 | improved | ### Hot-path point / `IN` lookups (native physical-type dispatch) An earlier iteration that queried the batch through the boxed `make_comparator` (one vtable call per comparison) regressed warm-cache high-cardinality lookups by up to +44% on a 30-value `IN`. The native comparators remove that. **Vs `main`, all 32 equality+IN benchmarks are improved or at parity — zero regressions.** High-cardinality warm-cache equality is back to parity; warm-cache `IN` is improved: | case | change vs main | p | verdict | |---|---|---|---| | equality/int_unique/cached | parity | — | no change | | equality/string_unique/cached | parity | — | no change | | in_30/int_unique/cached | −9.67% [−10.7, −9.0] | 0.00 | improved | | in_20/int_unique/cached | −3.81% [−4.11, −3.57] | 0.00 | improved | | in_10/int_unique/cached | −2.80% [−4.84, −1.27] | 0.00 | improved | | in_20/string_low_card/cached | −5.40% [−8.39, −2.96] | 0.00 | improved | | in_10/string_low_card/cached | −3.05% [−4.36, −1.94] | 0.00 | improved | | in_30/string_low_card/cached | −2.06% [−3.15, −1.23] | 0.00 | improved | ### Noise calibration To measure the harness noise floor, comparing two runs of **identical code** (same binary behavior) still produces "significant" verdicts: | case (identical code) | change | p | verdict | |---|---|---|---| | in_10/string_unique/cached | −4.29% [−7.41, −1.86] | 0.01 | "improved" | | equality/int_unique/cached | −2.87% [−5.60, −0.68] | 0.02 | "improved" | | equality/int_unique/no_cache | −3.14% [−4.42, −2.02] | 0.00 | "improved" | The code did not change, so these p < 0.05 verdicts are run-to-run variance. Consistent with that, `in_10/int_unique/cached` has read anywhere from −4.2% to +9.3% across runs. At `sample_size = 10` on warm multi-µs benchmarks, swings of ±5% earn p < 0.05 from variance alone, so single-digit-percent deltas on the warm high-cardinality cases are not reliable signal. ### Residual cost `range_*/unique/cached` shows small regressions — e.g. `range_few/string_unique/cached +2.7%`, `range_few/int_unique/cached +3.8%`. `pages_between` (the range path) still uses `make_comparator`; the native physical-type dispatch was added only to the equality/`IN` path. These sit at the edge of the calibrated noise band but are plausibly a small real cost — the deliberate tradeoff for not duplicating every min/max as an owned `ScalarValue` in memory, and a candidate for extending the physical-type dispatch to ranges in a follow-up. All other cases report "no change in performance detected." --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Arc<BTreeLookup>instead of reparsingpage_lookup.lanceon cache hitsTesting
cargo test -p lance-index btree_index_statecargo test -p lance-index btree_plugin_cache -- btree_range_partitioned_plugin_cache_roundtripcargo check -p lance-indexcargo clippy -p lance-index --all-targets