perf(index): cache parsed btree lookup state by LuQQiu · Pull Request #7161 · lance-format/lance

LuQQiu · 2026-06-08T18:56:02Z

Summary

cache BTree lookup state as parsed Arc<BTreeLookup> instead of reparsing page_lookup.lance on cache hits
add a cache codec that regenerates the lookup RecordBatch only when serializing cache state
keep cache size accounting tied to the resident parsed lookup tree and add coverage for codec round-trip, cache sizing, plugin cache reuse, range-partitioned cache, and frag-reuse reconstruction

Testing

cargo test -p lance-index btree_index_state
cargo test -p lance-index btree_plugin_cache -- btree_range_partitioned_plugin_cache_roundtrip
cargo check -p lance-index
cargo clippy -p lance-index --all-targets

# Conflicts: # rust/lance-index/src/scalar/btree.rs

wjones127

FWIW my plan was to implement #6802

This can work as an interim solution, though.

codecov · 2026-06-08T19:33:43Z

Codecov Report

❌ Patch coverage is 94.16342% with 15 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance-index/src/scalar/btree.rs	94.16%	3 Missing and 12 partials ⚠️

📢 Thoughts on this report? Let us know!

LuQQiu · 2026-06-08T20:00:22Z

FWIW my plan was to implement #6802

This can work as an interim solution, though.

That would be wonderful, if we can directly query from arrow lookup batch while preserving the perf

Closes #6802. `BTreeIndex` held both the `page_lookup.lance` batch and a parallel `BTreeLookup` (a `BTreeMap<OrderableScalarValue, Vec<PageRecord>>` plus null-page lists), duplicating every min/max value as owned `ScalarValue`s alongside the Arrow buffers. This PR rewrites `BTreeLookup` to wrap the lookup batch as the single source of truth: - Range searches binary-search the sorted `min` column via `arrow_ord::make_comparator`, then scan forward filtering by `max` and classify `Matches::Some`/`All` — same big-O as before, with packed Arrow buffers instead of scattered tree nodes. Only the small all-null / partial-null page index lists are precomputed. - Equality / `IN` lookups go through a shared `candidate_pages_for_values` that compares query values against the page columns with a native, inlined comparator (no boxed `DynComparator` vtable call per comparison). Dispatch is on the **physical storage type**: logical types backed by the same native are reinterpreted to one path — zero-copy when already that type, otherwise an O(1) `ArrayData` relabel with no value copy — so every `Date32`/`Time32`/`Decimal32`/`IntervalYearMonth` reuses the `i32` path and every `Date64`/`Time64`/`Timestamp`/`Duration`/`Decimal64` reuses the `i64` path, rather than each generating its own. Byte-like columns (`Utf8`/`LargeUtf8`/`Binary`/`LargeBinary`/`FixedSizeBinary`) compare lexicographically via `ArrayAccessor`; intervals with struct natives and booleans fall back to `make_comparator`. Comparators are built once per query and reused across all values. This keeps the scan to ~19 monomorphizations instead of one per logical type. - `BTreeIndex` no longer stores a separate `lookup_batch`; `statistics()` reads bounds from the batch and cache serialization clones the batch out of `page_lookup`. - Float ordering uses `total_cmp` (`ArrowNativeTypeOp::compare` on the native path, `make_comparator` on the fallback), matching the previous `OrderableScalarValue` ordering — so the NaN caveat in the issue is a non-issue. The first commit reverts #7161, which had taken the opposite approach (caching the parsed tree and regenerating the batch on serialize); that machinery is obsolete once the batch is the source of truth. ## Testing - `cargo test -p lance-index --lib` (all pass, incl. NaN ordering, null handling, range/fragment consistency) - `test_btree_lookup_pages_between` covers duplicate `min`s, a null-`min` straddling page, Some/All classification, and empty/inverted ranges. - `test_btree_lookup_pages_eq_bytes` covers the native byte path for `Binary` and `FixedSizeBinary` (e.g. UUID columns). - `test_btree_lookup_pages_eq_temporal` covers the physical-type reinterpret (`Date32`→`i32`, `Timestamp`→`i64`). ## Benchmarks `cargo bench -p lance-index --bench btree` (the existing suite: numeric + string, high/low cardinality, equality / range / `IN`, cached + uncached). Compared against current `main` (which includes #7161) using criterion's baseline significance test — `--load-baseline <pr> --baseline main` — so the numbers below are criterion's own bootstrapped change estimate `[lower, upper]` around the median, with its p-value. Config: `sample_size = 10`, `measurement_time = 10s`, default 2% noise threshold. Run on a single dev machine (macOS arm64), so absolute timings are machine-specific; the verdicts are what matter. Equality/`IN` are from the current HEAD; `range_*` exercises `pages_between`, which the equality/IN dispatch work did not touch, so those numbers reflect the same code. ### Wins — low-cardinality and load/deserialize-bound cases No longer rebuilding a `BTreeMap` on load, plus the native comparator inlining: | case | change | p | verdict | |---|---|---|---| | equality/int_low_card/no_cache | −27.5% [−30.7, −24.3] | 0.00 | improved | | equality/string_low_card/no_cache | −26.6% [−29.3, −24.3] | 0.00 | improved | | range_few/int_low_card/no_cache | −23.7% [−26.5, −21.4] | 0.00 | improved | | range_few/string_low_card/no_cache | −23.5% [−24.7, −21.9] | 0.00 | improved | | equality/string_low_card/cached | −20.3% [−22.8, −18.5] | 0.00 | improved | | range_few/string_low_card/cached | −18.0% [−19.4, −16.3] | 0.00 | improved | | equality/int_low_card/cached | −6.64% [−7.90, −5.61] | 0.00 | improved | | range_few/int_low_card/cached | −6.34% [−8.02, −4.49] | 0.00 | improved | ### Hot-path point / `IN` lookups (native physical-type dispatch) An earlier iteration that queried the batch through the boxed `make_comparator` (one vtable call per comparison) regressed warm-cache high-cardinality lookups by up to +44% on a 30-value `IN`. The native comparators remove that. **Vs `main`, all 32 equality+IN benchmarks are improved or at parity — zero regressions.** High-cardinality warm-cache equality is back to parity; warm-cache `IN` is improved: | case | change vs main | p | verdict | |---|---|---|---| | equality/int_unique/cached | parity | — | no change | | equality/string_unique/cached | parity | — | no change | | in_30/int_unique/cached | −9.67% [−10.7, −9.0] | 0.00 | improved | | in_20/int_unique/cached | −3.81% [−4.11, −3.57] | 0.00 | improved | | in_10/int_unique/cached | −2.80% [−4.84, −1.27] | 0.00 | improved | | in_20/string_low_card/cached | −5.40% [−8.39, −2.96] | 0.00 | improved | | in_10/string_low_card/cached | −3.05% [−4.36, −1.94] | 0.00 | improved | | in_30/string_low_card/cached | −2.06% [−3.15, −1.23] | 0.00 | improved | ### Noise calibration To measure the harness noise floor, comparing two runs of **identical code** (same binary behavior) still produces "significant" verdicts: | case (identical code) | change | p | verdict | |---|---|---|---| | in_10/string_unique/cached | −4.29% [−7.41, −1.86] | 0.01 | "improved" | | equality/int_unique/cached | −2.87% [−5.60, −0.68] | 0.02 | "improved" | | equality/int_unique/no_cache | −3.14% [−4.42, −2.02] | 0.00 | "improved" | The code did not change, so these p < 0.05 verdicts are run-to-run variance. Consistent with that, `in_10/int_unique/cached` has read anywhere from −4.2% to +9.3% across runs. At `sample_size = 10` on warm multi-µs benchmarks, swings of ±5% earn p < 0.05 from variance alone, so single-digit-percent deltas on the warm high-cardinality cases are not reliable signal. ### Residual cost `range_*/unique/cached` shows small regressions — e.g. `range_few/string_unique/cached +2.7%`, `range_few/int_unique/cached +3.8%`. `pages_between` (the range path) still uses `make_comparator`; the native physical-type dispatch was added only to the equality/`IN` path. These sit at the edge of the calibrated noise band but are plausibly a small real cost — the deliberate tradeoff for not duplicating every min/max as an owned `ScalarValue` in memory, and a candidate for extending the physical-type dispatch to ranges in a follow-up. All other cases report "no change in performance detected." --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

LuQQiu added 3 commits June 7, 2026 16:17

perf(index): cache deserialized btree indexes

cd53b5e

perf(index): cache parsed btree lookup state

fac36ab

Merge branch 'main' of github.com:lancedb/lance into btree_cache

7a1751e

# Conflicts: # rust/lance-index/src/scalar/btree.rs

github-actions Bot added A-index Vector index, linalg, tokenizer performance labels Jun 8, 2026

wjones127 approved these changes Jun 8, 2026

View reviewed changes

fix(index): preserve btree null counts in cache state

ea35511

wjones127 merged commit 9698bfb into lance-format:main Jun 8, 2026
30 checks passed

This was referenced Jun 9, 2026

perf(index): query BTree lookup batch directly #7186

Merged

feat: stabilize cache codec with a versioned envelope #7163

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(index): cache parsed btree lookup state#7161

perf(index): cache parsed btree lookup state#7161
wjones127 merged 4 commits into
lance-format:mainfrom
LuQQiu:btree_cache

LuQQiu commented Jun 8, 2026

Uh oh!

wjones127 left a comment

Uh oh!

codecov Bot commented Jun 8, 2026 •

edited

Loading

Uh oh!

LuQQiu commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

LuQQiu commented Jun 8, 2026

Summary

Testing

Uh oh!

wjones127 left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

LuQQiu commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Jun 8, 2026 •

edited

Loading