Skip to content

Query the BTree lookup batch directly instead of building a separate BTreeLookup #6802

Description

@wjones127

Background

BTreeIndex reads its page_lookup.lance file into a RecordBatch with columns min | max | null_count | page_idx, then walks the batch once in try_from_serialized to build a separate BTreeLookup:

  • BTreeMap<OrderableScalarValue, Vec<PageRecord{max, page_number}>>
  • null_pages: Vec<u32>, all_null_pages: Vec<u32>

After #6793 the lookup batch is also retained on BTreeIndex so the index can serialize itself into a cache (BTreeIndexState) without re-reading from disk. This means min/max values are now stored twice — once as Arrow buffers in the batch, once as owned ScalarValues inside BTreeLookup. For wide value types or indices with many pages this is real memory.

Proposal

Rewrite BTreeLookup to query the batch directly:

  • Sorted iteration: the batch rows are already ordered by ascending min (training sorts the data and emits pages in order).
  • pages_between(lo, hi) becomes a binary search on the min column to find the starting row, then a linear scan filtering by max. Same big-O as today, better constants (packed arrow buffers vs scattered BTreeMap nodes).
  • Store a small precomputed Vec<u32> of all-null page indices (rows where max IS NULL) and a non-null row range for the binary-search bounds; the rest is derived from the batch.

Type dispatch: arrow_ord::ord::make_comparator

pub fn make_comparator(left: &dyn Array, right: &dyn Array, opts: SortOptions)
    -> Result<DynComparator, ArrowError>;
pub type DynComparator = Box<dyn Fn(usize, usize) -> Ordering + Send + Sync>;

This is exactly the primitive we need. It handles every Arrow type out of the box: all primitives via downcast_primitive!, plus Boolean, Utf8/LargeUtf8/Utf8View, Binary/LargeBinary/BinaryView, FixedSizeBinary, and nested types (lists, structs, FSL) via a recursive fallback. Null ordering is configurable via SortOptions.

With it, the binary search is ~10 lines:

fn partition_point(mins: &dyn Array, query: &dyn Array /* 1-row */) -> Result<usize> {
    let cmp = make_comparator(mins, query, SortOptions::default())?;
    let mut lo = 0;
    let mut hi = mins.len();
    while lo < hi {
        let mid = lo + (hi - lo) / 2;
        if cmp(mid, 0).is_lt() { lo = mid + 1 } else { hi = mid }
    }
    Ok(lo)
}

Callers build the query as a 1-row Arrow array of the same type as the min column (ScalarValue::to_array_of_size(1) does this directly).

The TODO sits on the lookup_batch field on BTreeIndex in rust/lance-index/src/scalar/btree.rs.

Benefits

  • Eliminates the min/max duplication; the batch becomes the single source of truth.
  • put_in_cache is trivial (clone the batch + the small all_null_pages Vec).
  • Likely faster for numeric / string types due to cache-friendly packed buffers.
  • Gives us per-page null_count (currently thrown away on load) for free, which could enable better Matches::All vs Matches::Some classification.

Caveats

  • NaN ordering: OrderableScalarValue currently uses total_cmp for floats; make_comparator uses IEEE semantics (NaN != NaN). If we need total_cmp, route Float32/Float64 through a small manual path; everything else can go through make_comparator.
  • Refactor is bounded but non-trivial — replacing BTreeLookup's search/iteration paths and any callers depending on its public surface.

Acceptance

  • BTreeIndex no longer holds both the lookup batch and a parallel BTreeLookup structure.
  • All existing scalar::btree tests pass; add a benchmark covering numeric and string lookup-heavy workloads.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions