Background
BTreeIndex reads its page_lookup.lance file into a RecordBatch with columns min | max | null_count | page_idx, then walks the batch once in try_from_serialized to build a separate BTreeLookup:
BTreeMap<OrderableScalarValue, Vec<PageRecord{max, page_number}>>
null_pages: Vec<u32>, all_null_pages: Vec<u32>
After #6793 the lookup batch is also retained on BTreeIndex so the index can serialize itself into a cache (BTreeIndexState) without re-reading from disk. This means min/max values are now stored twice — once as Arrow buffers in the batch, once as owned ScalarValues inside BTreeLookup. For wide value types or indices with many pages this is real memory.
Proposal
Rewrite BTreeLookup to query the batch directly:
- Sorted iteration: the batch rows are already ordered by ascending
min (training sorts the data and emits pages in order).
pages_between(lo, hi) becomes a binary search on the min column to find the starting row, then a linear scan filtering by max. Same big-O as today, better constants (packed arrow buffers vs scattered BTreeMap nodes).
- Store a small precomputed
Vec<u32> of all-null page indices (rows where max IS NULL) and a non-null row range for the binary-search bounds; the rest is derived from the batch.
Type dispatch: arrow_ord::ord::make_comparator
pub fn make_comparator(left: &dyn Array, right: &dyn Array, opts: SortOptions)
-> Result<DynComparator, ArrowError>;
pub type DynComparator = Box<dyn Fn(usize, usize) -> Ordering + Send + Sync>;
This is exactly the primitive we need. It handles every Arrow type out of the box: all primitives via downcast_primitive!, plus Boolean, Utf8/LargeUtf8/Utf8View, Binary/LargeBinary/BinaryView, FixedSizeBinary, and nested types (lists, structs, FSL) via a recursive fallback. Null ordering is configurable via SortOptions.
With it, the binary search is ~10 lines:
fn partition_point(mins: &dyn Array, query: &dyn Array /* 1-row */) -> Result<usize> {
let cmp = make_comparator(mins, query, SortOptions::default())?;
let mut lo = 0;
let mut hi = mins.len();
while lo < hi {
let mid = lo + (hi - lo) / 2;
if cmp(mid, 0).is_lt() { lo = mid + 1 } else { hi = mid }
}
Ok(lo)
}
Callers build the query as a 1-row Arrow array of the same type as the min column (ScalarValue::to_array_of_size(1) does this directly).
The TODO sits on the lookup_batch field on BTreeIndex in rust/lance-index/src/scalar/btree.rs.
Benefits
- Eliminates the min/max duplication; the batch becomes the single source of truth.
put_in_cache is trivial (clone the batch + the small all_null_pages Vec).
- Likely faster for numeric / string types due to cache-friendly packed buffers.
- Gives us per-page
null_count (currently thrown away on load) for free, which could enable better Matches::All vs Matches::Some classification.
Caveats
- NaN ordering:
OrderableScalarValue currently uses total_cmp for floats; make_comparator uses IEEE semantics (NaN != NaN). If we need total_cmp, route Float32/Float64 through a small manual path; everything else can go through make_comparator.
- Refactor is bounded but non-trivial — replacing
BTreeLookup's search/iteration paths and any callers depending on its public surface.
Acceptance
BTreeIndex no longer holds both the lookup batch and a parallel BTreeLookup structure.
- All existing
scalar::btree tests pass; add a benchmark covering numeric and string lookup-heavy workloads.
Background
BTreeIndexreads itspage_lookup.lancefile into aRecordBatchwith columnsmin | max | null_count | page_idx, then walks the batch once intry_from_serializedto build a separateBTreeLookup:BTreeMap<OrderableScalarValue, Vec<PageRecord{max, page_number}>>null_pages: Vec<u32>,all_null_pages: Vec<u32>After #6793 the lookup batch is also retained on
BTreeIndexso the index can serialize itself into a cache (BTreeIndexState) without re-reading from disk. This means min/max values are now stored twice — once as Arrow buffers in the batch, once as ownedScalarValues insideBTreeLookup. For wide value types or indices with many pages this is real memory.Proposal
Rewrite
BTreeLookupto query the batch directly:min(training sorts the data and emits pages in order).pages_between(lo, hi)becomes a binary search on themincolumn to find the starting row, then a linear scan filtering bymax. Same big-O as today, better constants (packed arrow buffers vs scatteredBTreeMapnodes).Vec<u32>of all-null page indices (rows wheremax IS NULL) and a non-null row range for the binary-search bounds; the rest is derived from the batch.Type dispatch:
arrow_ord::ord::make_comparatorThis is exactly the primitive we need. It handles every Arrow type out of the box: all primitives via
downcast_primitive!, plusBoolean,Utf8/LargeUtf8/Utf8View,Binary/LargeBinary/BinaryView,FixedSizeBinary, and nested types (lists, structs, FSL) via a recursive fallback. Null ordering is configurable viaSortOptions.With it, the binary search is ~10 lines:
Callers build the query as a 1-row Arrow array of the same type as the
mincolumn (ScalarValue::to_array_of_size(1)does this directly).The TODO sits on the
lookup_batchfield onBTreeIndexinrust/lance-index/src/scalar/btree.rs.Benefits
put_in_cacheis trivial (clone the batch + the smallall_null_pagesVec).null_count(currently thrown away on load) for free, which could enable betterMatches::AllvsMatches::Someclassification.Caveats
OrderableScalarValuecurrently usestotal_cmpfor floats;make_comparatoruses IEEE semantics (NaN != NaN). If we needtotal_cmp, route Float32/Float64 through a small manual path; everything else can go throughmake_comparator.BTreeLookup's search/iteration paths and any callers depending on its public surface.Acceptance
BTreeIndexno longer holds both the lookup batch and a parallelBTreeLookupstructure.scalar::btreetests pass; add a benchmark covering numeric and string lookup-heavy workloads.