Skip to content

feat(index): serializable cache for Bitmap and LabelList scalar indices#6874

Merged
wjones127 merged 1 commit into
lance-format:mainfrom
wjones127:cache-codec-bitmap-label-list
May 20, 2026
Merged

feat(index): serializable cache for Bitmap and LabelList scalar indices#6874
wjones127 merged 1 commit into
lance-format:mainfrom
wjones127:cache-codec-bitmap-label-list

Conversation

@wjones127

Copy link
Copy Markdown
Contributor

Adds CacheCodec impls so Bitmap and LabelList index cache entries survive through a persistent cache backend, mirroring the BTree work in #6793.

  • CacheCodecImpl for RowAddrTreeMap (delegates to existing serialize_into/deserialize_from), so per-value bitmap entries cached under BitmapKey are codec-backed.
  • BitmapIndexState captures the value→offset map (Arrow IPC), the null bitmap, and the value type. BitmapIndexPlugin overrides get_from_cache/put_in_cache to store this sized state.
  • LabelListIndexState wraps an inner BitmapIndexState plus list_nulls and gets the same plugin-level codec treatment.
  • open_scalar_index skips the LabelList compatibility check on cache hits, so a fully-cached LabelList query no longer pays an extra bitmap_page_lookup.lance open per call.

Tests

  • Unit codec round-trip for BitmapIndexState (empty + populated).
  • Integration tests test_{bitmap,label_list}_prewarm_with_serializing_backend_serves_query_with_no_io asserting zero IOPS after prewarm through a serializing cache backend.

Closes #6744

Adds `CacheCodec` impls so Bitmap and LabelList index cache entries survive
through a persistent cache backend, mirroring the BTree work in lance-format#6793.

- `CacheCodecImpl for RowAddrTreeMap` (delegates to existing
  `serialize_into`/`deserialize_from`), so per-value bitmap entries cached
  under `BitmapKey` are codec-backed.
- `BitmapIndexState` captures the value→offset map (Arrow IPC), the null
  bitmap, and the value type. `BitmapIndexPlugin` overrides
  `get_from_cache`/`put_in_cache` to store this sized state.
- `LabelListIndexState` wraps an inner `BitmapIndexState` plus `list_nulls`
  and gets the same plugin-level codec treatment.
- `open_scalar_index` skips the LabelList compatibility check on cache
  hits, so a fully-cached LabelList query no longer pays an extra
  `bitmap_page_lookup.lance` open per call.

Tests:
- Unit codec round-trip for `BitmapIndexState` (empty + populated).
- Integration tests `test_{bitmap,label_list}_prewarm_with_serializing_backend_serves_query_with_no_io`
  asserting zero IOPS after prewarm through a serializing cache backend.

Closes lance-format#6744
@github-actions github-actions Bot added the enhancement New feature or request label May 20, 2026
@wjones127 wjones127 marked this pull request as ready for review May 20, 2026 17:00

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@codecov

codecov Bot commented May 20, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 83.97790% with 29 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-index/src/scalar/bitmap.rs 85.83% 5 Missing and 12 partials ⚠️
rust/lance-index/src/scalar/label_list.rs 78.84% 3 Missing and 8 partials ⚠️
rust/lance/src/index/scalar.rs 66.66% 0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

@wjones127 wjones127 merged commit 4de5ce6 into lance-format:main May 20, 2026
28 checks passed
@wjones127 wjones127 deleted the cache-codec-bitmap-label-list branch May 20, 2026 19:23
wombatu-kun pushed a commit to wombatu-kun/lance that referenced this pull request May 21, 2026
Mainline added serializable scalar-index caching (lance-format#6793, lance-format#6874) and moved the TRACE_IO_EVENTS / record_index_load instrumentation from the outer call site into `scalar::open_scalar_index`. The relocated trace references a `uuid_str` local that no longer exists after the branch dropped the `&str` form, and the inner `index` binding is shadowed by the loaded plugin index. Capture `index.uuid` (a `Uuid`) before the shadowing and format it via Display.

Also re-add the `UnsizedCacheKey` import in `rust/lance/src/index.rs`; the new `ScalarIndexCacheKey` introduced by this branch implements it, but the import was lost when the auto-merge pruned the outer scalar-cache code that mainline migrated into the plugin layer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
wombatu-kun pushed a commit to wombatu-kun/lance that referenced this pull request May 24, 2026
Mainline added serializable scalar-index caching (lance-format#6793, lance-format#6874) and moved the TRACE_IO_EVENTS / record_index_load instrumentation from the outer call site into `scalar::open_scalar_index`. The relocated trace references a `uuid_str` local that no longer exists after the branch dropped the `&str` form, and the inner `index` binding is shadowed by the loaded plugin index. Capture `index.uuid` (a `Uuid`) before the shadowing and format it via Display.

Also re-add the `UnsizedCacheKey` import in `rust/lance/src/index.rs`; the new `ScalarIndexCacheKey` introduced by this branch implements it, but the import was lost when the auto-merge pruned the outer scalar-cache code that mainline migrated into the plugin layer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
wombatu-kun pushed a commit to wombatu-kun/lance that referenced this pull request May 25, 2026
Mainline added serializable scalar-index caching (lance-format#6793, lance-format#6874) and moved the TRACE_IO_EVENTS / record_index_load instrumentation from the outer call site into `scalar::open_scalar_index`. The relocated trace references a `uuid_str` local that no longer exists after the branch dropped the `&str` form, and the inner `index` binding is shadowed by the loaded plugin index. Capture `index.uuid` (a `Uuid`) before the shadowing and format it via Display.

Also re-add the `UnsizedCacheKey` import in `rust/lance/src/index.rs`; the new `ScalarIndexCacheKey` introduced by this branch implements it, but the import was lost when the auto-merge pruned the outer scalar-cache code that mainline migrated into the plugin layer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
wjones127 pushed a commit that referenced this pull request Jun 3, 2026
## Problem

Commit 4de5ce6 ("feat(index): serializable cache for Bitmap and
LabelList scalar indices #6874") introduced a performance regression in
`BitmapIndexPlugin::get_from_cache`. Every warm-cache hit against a
bitmap scalar index now pays O(N log N) cost where N is the number of
unique values in the column, instead of O(1).

The regression: the new implementation stored only the serializable
`BitmapIndexState` (an Arrow `RecordBatch`) in the cache and
reconstructed the full `BTreeMap<OrderableScalarValue, usize>` on every
cache hit by calling `parse_lookup_batch`. For a column with 10M unique
values this rebuilds the map on every query — including `IS NULL`, whose
actual bitmap lookup is `(*self.null_map).clone()` and is otherwise
O(1).

`parse_lookup_batch` is expensive because:
1. It calls `ScalarValue::try_from_array` for every row — one heap
allocation per unique value.
2. It inserts into a `BTreeMap` — O(log N) comparisons per insert, O(N
log N) total.

## Fix

**`BitmapIndex.index_map`**: Changed from
`BTreeMap<OrderableScalarValue, usize>` to
`Arc<BTreeMap<OrderableScalarValue, usize>>`. The map is immutable after
construction, so sharing it behind an `Arc` is safe, and cloning is
O(1).

**`BitmapIndexState`**: Added an `index_map: Arc<BTreeMap<...>>` field
that is **not serialized** — the wire format is unchanged. It is
populated eagerly:
- `from_index` (called by `put_in_cache`): `Arc::clone`s the map from
the live `BitmapIndex` — O(1).
- `deserialize` (disk-backed cache backends): calls `parse_lookup_batch`
once at deserialization time, which is already paying disk I/O cost.

**`into_bitmap_index`**: Now takes `&self` and simply `Arc::clone`s
`self.index_map` — always O(1), no reconstruction.

**`get_from_cache`**: The intermediate `(*state).clone()` is removed
since `into_bitmap_index` no longer consumes `self`.

`LabelListIndex` had the same dual-entry patch applied in a prior
iteration; that is also reverted to the original single-entry approach
(its `BitmapIndexState` path is unchanged by this PR).

## Test

Added `test_bitmap_cache_fast_path` to `bitmap.rs`:
- Creates a high-cardinality bitmap index (1 000 unique integers + 5
null rows)
- Calls `put_in_cache`, then `get_from_cache`
- Asserts `get_from_cache` returns `Some`
- Runs `IS NULL` and asserts the correct 5 null rows are returned

To measure the end-to-end impact, run the `bitmap / is_null / warm` case
in `python/python/ci_benchmarks/benchmarks/test_count_rows.py` — latency
should be close to `btree / is_null / warm`.

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
wombatu-kun pushed a commit to wombatu-kun/lance that referenced this pull request Jun 4, 2026
Mainline added serializable scalar-index caching (lance-format#6793, lance-format#6874) and moved the TRACE_IO_EVENTS / record_index_load instrumentation from the outer call site into `scalar::open_scalar_index`. The relocated trace references a `uuid_str` local that no longer exists after the branch dropped the `&str` form, and the inner `index` binding is shadowed by the loaded plugin index. Capture `index.uuid` (a `Uuid`) before the shadowing and format it via Display.

Also re-add the `UnsizedCacheKey` import in `rust/lance/src/index.rs`; the new `ScalarIndexCacheKey` introduced by this branch implements it, but the import was lost when the auto-merge pruned the outer scalar-cache code that mainline migrated into the plugin layer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
wombatu-kun pushed a commit to wombatu-kun/lance that referenced this pull request Jun 4, 2026
Mainline added serializable scalar-index caching (lance-format#6793, lance-format#6874) and moved the TRACE_IO_EVENTS / record_index_load instrumentation from the outer call site into `scalar::open_scalar_index`. The relocated trace references a `uuid_str` local that no longer exists after the branch dropped the `&str` form, and the inner `index` binding is shadowed by the loaded plugin index. Capture `index.uuid` (a `Uuid`) before the shadowing and format it via Display.

Also re-add the `UnsizedCacheKey` import in `rust/lance/src/index.rs`; the new `ScalarIndexCacheKey` introduced by this branch implements it, but the import was lost when the auto-merge pruned the outer scalar-cache code that mainline migrated into the plugin layer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
wombatu-kun pushed a commit to wombatu-kun/lance that referenced this pull request Jun 4, 2026
Mainline added serializable scalar-index caching (lance-format#6793, lance-format#6874) and moved the TRACE_IO_EVENTS / record_index_load instrumentation from the outer call site into `scalar::open_scalar_index`. The relocated trace references a `uuid_str` local that no longer exists after the branch dropped the `&str` form, and the inner `index` binding is shadowed by the loaded plugin index. Capture `index.uuid` (a `Uuid`) before the shadowing and format it via Display.

Also re-add the `UnsizedCacheKey` import in `rust/lance/src/index.rs`; the new `ScalarIndexCacheKey` introduced by this branch implements it, but the import was lost when the auto-merge pruned the outer scalar-cache code that mainline migrated into the plugin layer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
wombatu-kun pushed a commit to wombatu-kun/lance that referenced this pull request Jun 5, 2026
Mainline added serializable scalar-index caching (lance-format#6793, lance-format#6874) and moved the TRACE_IO_EVENTS / record_index_load instrumentation from the outer call site into `scalar::open_scalar_index`. The relocated trace references a `uuid_str` local that no longer exists after the branch dropped the `&str` form, and the inner `index` binding is shadowed by the loaded plugin index. Capture `index.uuid` (a `Uuid`) before the shadowing and format it via Display.

Also re-add the `UnsizedCacheKey` import in `rust/lance/src/index.rs`; the new `ScalarIndexCacheKey` introduced by this branch implements it, but the import was lost when the auto-merge pruned the outer scalar-cache code that mainline migrated into the plugin layer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
wombatu-kun pushed a commit to wombatu-kun/lance that referenced this pull request Jun 5, 2026
Mainline added serializable scalar-index caching (lance-format#6793, lance-format#6874) and moved the TRACE_IO_EVENTS / record_index_load instrumentation from the outer call site into `scalar::open_scalar_index`. The relocated trace references a `uuid_str` local that no longer exists after the branch dropped the `&str` form, and the inner `index` binding is shadowed by the loaded plugin index. Capture `index.uuid` (a `Uuid`) before the shadowing and format it via Display.

Also re-add the `UnsizedCacheKey` import in `rust/lance/src/index.rs`; the new `ScalarIndexCacheKey` introduced by this branch implements it, but the import was lost when the auto-merge pruned the outer scalar-cache code that mainline migrated into the plugin layer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
wombatu-kun pushed a commit to wombatu-kun/lance that referenced this pull request Jun 7, 2026
Mainline added serializable scalar-index caching (lance-format#6793, lance-format#6874) and moved the TRACE_IO_EVENTS / record_index_load instrumentation from the outer call site into `scalar::open_scalar_index`. The relocated trace references a `uuid_str` local that no longer exists after the branch dropped the `&str` form, and the inner `index` binding is shadowed by the loaded plugin index. Capture `index.uuid` (a `Uuid`) before the shadowing and format it via Display.

Also re-add the `UnsizedCacheKey` import in `rust/lance/src/index.rs`; the new `ScalarIndexCacheKey` introduced by this branch implements it, but the import was lost when the auto-merge pruned the outer scalar-cache code that mainline migrated into the plugin layer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
wombatu-kun pushed a commit to wombatu-kun/lance that referenced this pull request Jun 9, 2026
Mainline added serializable scalar-index caching (lance-format#6793, lance-format#6874) and moved the TRACE_IO_EVENTS / record_index_load instrumentation from the outer call site into `scalar::open_scalar_index`. The relocated trace references a `uuid_str` local that no longer exists after the branch dropped the `&str` form, and the inner `index` binding is shadowed by the loaded plugin index. Capture `index.uuid` (a `Uuid`) before the shadowing and format it via Display.

Also re-add the `UnsizedCacheKey` import in `rust/lance/src/index.rs`; the new `ScalarIndexCacheKey` introduced by this branch implements it, but the import was lost when the auto-merge pruned the outer scalar-cache code that mainline migrated into the plugin layer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement CacheCodec for BITMAP and LABEL_LIST indices

2 participants