Skip to content

perf(index): cache parsed btree lookup state#7161

Merged
wjones127 merged 4 commits into
lance-format:mainfrom
LuQQiu:btree_cache
Jun 8, 2026
Merged

perf(index): cache parsed btree lookup state#7161
wjones127 merged 4 commits into
lance-format:mainfrom
LuQQiu:btree_cache

Conversation

@LuQQiu

@LuQQiu LuQQiu commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Summary

  • cache BTree lookup state as parsed Arc<BTreeLookup> instead of reparsing page_lookup.lance on cache hits
  • add a cache codec that regenerates the lookup RecordBatch only when serializing cache state
  • keep cache size accounting tied to the resident parsed lookup tree and add coverage for codec round-trip, cache sizing, plugin cache reuse, range-partitioned cache, and frag-reuse reconstruction

Testing

  • cargo test -p lance-index btree_index_state
  • cargo test -p lance-index btree_plugin_cache -- btree_range_partitioned_plugin_cache_roundtrip
  • cargo check -p lance-index
  • cargo clippy -p lance-index --all-targets

@github-actions github-actions Bot added A-index Vector index, linalg, tokenizer performance labels Jun 8, 2026

@wjones127 wjones127 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW my plan was to implement #6802

This can work as an interim solution, though.

@codecov

codecov Bot commented Jun 8, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 94.16342% with 15 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-index/src/scalar/btree.rs 94.16% 3 Missing and 12 partials ⚠️

📢 Thoughts on this report? Let us know!

@LuQQiu

LuQQiu commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

FWIW my plan was to implement #6802

This can work as an interim solution, though.

That would be wonderful, if we can directly query from arrow lookup batch while preserving the perf

@wjones127 wjones127 merged commit 9698bfb into lance-format:main Jun 8, 2026
30 checks passed
wjones127 added a commit that referenced this pull request Jun 11, 2026
Closes #6802.

`BTreeIndex` held both the `page_lookup.lance` batch and a parallel
`BTreeLookup` (a `BTreeMap<OrderableScalarValue, Vec<PageRecord>>` plus
null-page lists), duplicating every min/max value as owned
`ScalarValue`s alongside the Arrow buffers.

This PR rewrites `BTreeLookup` to wrap the lookup batch as the single
source of truth:

- Range searches binary-search the sorted `min` column via
`arrow_ord::make_comparator`, then scan forward filtering by `max` and
classify `Matches::Some`/`All` — same big-O as before, with packed Arrow
buffers instead of scattered tree nodes. Only the small all-null /
partial-null page index lists are precomputed.
- Equality / `IN` lookups go through a shared
`candidate_pages_for_values` that compares query values against the page
columns with a native, inlined comparator (no boxed `DynComparator`
vtable call per comparison). Dispatch is on the **physical storage
type**: logical types backed by the same native are reinterpreted to one
path — zero-copy when already that type, otherwise an O(1) `ArrayData`
relabel with no value copy — so every
`Date32`/`Time32`/`Decimal32`/`IntervalYearMonth` reuses the `i32` path
and every `Date64`/`Time64`/`Timestamp`/`Duration`/`Decimal64` reuses
the `i64` path, rather than each generating its own. Byte-like columns
(`Utf8`/`LargeUtf8`/`Binary`/`LargeBinary`/`FixedSizeBinary`) compare
lexicographically via `ArrayAccessor`; intervals with struct natives and
booleans fall back to `make_comparator`. Comparators are built once per
query and reused across all values. This keeps the scan to ~19
monomorphizations instead of one per logical type.
- `BTreeIndex` no longer stores a separate `lookup_batch`;
`statistics()` reads bounds from the batch and cache serialization
clones the batch out of `page_lookup`.
- Float ordering uses `total_cmp` (`ArrowNativeTypeOp::compare` on the
native path, `make_comparator` on the fallback), matching the previous
`OrderableScalarValue` ordering — so the NaN caveat in the issue is a
non-issue.

The first commit reverts #7161, which had taken the opposite approach
(caching the parsed tree and regenerating the batch on serialize); that
machinery is obsolete once the batch is the source of truth.

## Testing
- `cargo test -p lance-index --lib` (all pass, incl. NaN ordering, null
handling, range/fragment consistency)
- `test_btree_lookup_pages_between` covers duplicate `min`s, a
null-`min` straddling page, Some/All classification, and empty/inverted
ranges.
- `test_btree_lookup_pages_eq_bytes` covers the native byte path for
`Binary` and `FixedSizeBinary` (e.g. UUID columns).
- `test_btree_lookup_pages_eq_temporal` covers the physical-type
reinterpret (`Date32`→`i32`, `Timestamp`→`i64`).

## Benchmarks

`cargo bench -p lance-index --bench btree` (the existing suite: numeric
+ string, high/low cardinality, equality / range / `IN`, cached +
uncached). Compared against current `main` (which includes #7161) using
criterion's baseline significance test — `--load-baseline <pr>
--baseline main` — so the numbers below are criterion's own bootstrapped
change estimate `[lower, upper]` around the median, with its p-value.
Config: `sample_size = 10`, `measurement_time = 10s`, default 2% noise
threshold. Run on a single dev machine (macOS arm64), so absolute
timings are machine-specific; the verdicts are what matter.
Equality/`IN` are from the current HEAD; `range_*` exercises
`pages_between`, which the equality/IN dispatch work did not touch, so
those numbers reflect the same code.

### Wins — low-cardinality and load/deserialize-bound cases

No longer rebuilding a `BTreeMap` on load, plus the native comparator
inlining:

| case | change | p | verdict |
|---|---|---|---|
| equality/int_low_card/no_cache | −27.5% [−30.7, −24.3] | 0.00 |
improved |
| equality/string_low_card/no_cache | −26.6% [−29.3, −24.3] | 0.00 |
improved |
| range_few/int_low_card/no_cache | −23.7% [−26.5, −21.4] | 0.00 |
improved |
| range_few/string_low_card/no_cache | −23.5% [−24.7, −21.9] | 0.00 |
improved |
| equality/string_low_card/cached | −20.3% [−22.8, −18.5] | 0.00 |
improved |
| range_few/string_low_card/cached | −18.0% [−19.4, −16.3] | 0.00 |
improved |
| equality/int_low_card/cached | −6.64% [−7.90, −5.61] | 0.00 | improved
|
| range_few/int_low_card/cached | −6.34% [−8.02, −4.49] | 0.00 |
improved |

### Hot-path point / `IN` lookups (native physical-type dispatch)

An earlier iteration that queried the batch through the boxed
`make_comparator` (one vtable call per comparison) regressed warm-cache
high-cardinality lookups by up to +44% on a 30-value `IN`. The native
comparators remove that. **Vs `main`, all 32 equality+IN benchmarks are
improved or at parity — zero regressions.** High-cardinality warm-cache
equality is back to parity; warm-cache `IN` is improved:

| case | change vs main | p | verdict |
|---|---|---|---|
| equality/int_unique/cached | parity | — | no change |
| equality/string_unique/cached | parity | — | no change |
| in_30/int_unique/cached | −9.67% [−10.7, −9.0] | 0.00 | improved |
| in_20/int_unique/cached | −3.81% [−4.11, −3.57] | 0.00 | improved |
| in_10/int_unique/cached | −2.80% [−4.84, −1.27] | 0.00 | improved |
| in_20/string_low_card/cached | −5.40% [−8.39, −2.96] | 0.00 | improved
|
| in_10/string_low_card/cached | −3.05% [−4.36, −1.94] | 0.00 | improved
|
| in_30/string_low_card/cached | −2.06% [−3.15, −1.23] | 0.00 | improved
|

### Noise calibration

To measure the harness noise floor, comparing two runs of **identical
code** (same binary behavior) still produces "significant" verdicts:

| case (identical code) | change | p | verdict |
|---|---|---|---|
| in_10/string_unique/cached | −4.29% [−7.41, −1.86] | 0.01 | "improved"
|
| equality/int_unique/cached | −2.87% [−5.60, −0.68] | 0.02 | "improved"
|
| equality/int_unique/no_cache | −3.14% [−4.42, −2.02] | 0.00 |
"improved" |

The code did not change, so these p < 0.05 verdicts are run-to-run
variance. Consistent with that, `in_10/int_unique/cached` has read
anywhere from −4.2% to +9.3% across runs. At `sample_size = 10` on warm
multi-µs benchmarks, swings of ±5% earn p < 0.05 from variance alone, so
single-digit-percent deltas on the warm high-cardinality cases are not
reliable signal.

### Residual cost

`range_*/unique/cached` shows small regressions — e.g.
`range_few/string_unique/cached +2.7%`, `range_few/int_unique/cached
+3.8%`. `pages_between` (the range path) still uses `make_comparator`;
the native physical-type dispatch was added only to the equality/`IN`
path. These sit at the edge of the calibrated noise band but are
plausibly a small real cost — the deliberate tradeoff for not
duplicating every min/max as an owned `ScalarValue` in memory, and a
candidate for extending the physical-type dispatch to ranges in a
follow-up. All other cases report "no change in performance detected."

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-index Vector index, linalg, tokenizer performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants