fix(mem_wal): eliminate fresh-tier stale reads (dedup-on-scan + deletion-vector-on-flush) by hamersaw · Pull Request #6929 · lance-format/lance

hamersaw · 2026-05-24T14:14:59Z

Problem

The mem_wal LSM fresh tier could serve stale rows whenever a primary key was updated across a query's selection boundary — e.g. value goes non-null → null under WHERE value IS NOT NULL, or a vector moves so the newest version drops out of the KNN top-k. The root cause is structural: every read path applied the per-source selection (scalar filter / KNN top-k) before any dedup, so when a PK's newest version failed the selection, an older version that still passed leaked through.

Fix

Make every fresh-tier source present a newest-per-PK view so the selection only ever sees one row per PK, then resolve cross-generation duplicates with a single block-list — which makes the cross-source dedup machinery unnecessary. Delivered in 8 focused commits:

Active filtered-read fused dedup-before-predicate — reverse-scan the active memtable, collapse to newest-per-PK with a seen-set, then apply the predicate (the seen-set is updated independent of the predicate, so a newest row that fails the filter still suppresses its older versions). Also hardens compute_pk_hash (realistic type coverage + validate_pk_types at the scanner boundary, instead of silently collapsing unsupported PK types to one hash).
Deletion vector at flush — build a within-generation DV during flush so each flushed generation exposes newest-per-PK on every Lance scan path (filter/index pushdown, vector, FTS) with no index rebuild.
Forward-write flushed generations with a 1:1 index transfer (drops the reversed_pos remap), now that the DV makes physical row order irrelevant.
Unconditional cross-generation block-list on every read path (added to the filtered-read path; un-gated on the vector path).
Active vector self-containment — over-fetch + WithinSourceDedupExec + distance re-sort, so the active HNSW arm is independently newest-per-PK.
Teardown — collapse the filtered-read plan to Union + block-list, the vector plan to Union + distance merge, and delete the now-dead LsmGlobalPkDedupExec / LsmSourceTagExec / DeduplicateExec.

Tradeoffs (intentional)

Uniform 64-bit hash keying. The fresh tier (block-list, active dedup, flush DV) keys PKs on compute_pk_hash. The filtered-read path gives up its old collision-free exact-value dedup; a 64-bit collision over-blocks (a missing row, never a wrong value) and is astronomically unlikely at memtable scale.
Active vector search is probabilistically correct. The append-only HNSW keeps a live node per PK version; over-fetch + within-source dedup collapse them, but a fresh version evicted from the over-fetched top-k can still leak. Exact correctness needs in-graph within-generation eviction, deliberately deferred.

Testing

Each phase ships with tests. New coverage includes the within-gen phantom (exec + end-to-end), the flush deletion vector, forward-write index alignment, the filtered-read block-list, and active-vector self-dedup. cargo test -p lance --lib dataset::mem_wal → 326 passed, 0 failed (1 ignored: S3). cargo fmt --all and cargo clippy --all --tests --benches -- -D warnings are clean.

Follow-ups

In-graph HNSW within-generation eviction for exact active-vector correctness.
Trivial dead-code: the unused KeepMinRowAddr DedupDirection variant and the reverse-write BatchStore helpers.

🤖 Generated with Claude Code

…hantoms The active-memtable filtered-read arm pushed the predicate into the scan, so a primary key whose newest version fails the predicate had that version removed before any dedup could pick it — leaking an older version that still passes (a stale "phantom" row). Example: id inserted with value=100 then updated to NULL, queried with `value IS NOT NULL`, returned the stale 100. Fix the active tier by deduplicating before the selection: - Add MemTableDedupScanExec, which reverse-iterates the memtable newest-first, collapses to newest-per-PK with a cross-batch seen-set, then applies the predicate. The seen-set is updated from the newest occurrence independent of the predicate, so a newest row that fails the filter still suppresses its older versions. - Wire it into the active arm of plan_scan via MemTableScanner::create_dedup_plan (filter moves out of the raw scan into the fused operator). The active filtered arm forgoes the in-memory BTree skip since dedup must see every version. - Harden compute_pk_hash: cover Int8/16/32/64, UInt8/16/32/64, Boolean, Utf8/LargeUtf8, Binary/LargeBinary; add validate_pk_types to reject unsupported PK types at the scanner boundary instead of silently collapsing them to one hash; add a value-distinguishing fallback. Keep compute_pk_hash_from_scalars in lockstep so the bloom/dedup hashes stay consistent. Tests: 3 exec unit tests (within-batch phantom, cross-batch newest-per-PK with filter, IS NULL mirror), 1 end-to-end LsmScanner regression, plus updated plan-shape/btree-filter/without-base-table assertions for the new active node. Phase 1 of the WAL fresh-tier dedup plan (active filtered tier). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Flushed L0 generations could contain multiple versions of a primary key (insert + update before flush). Reads relied entirely on the LSM scanner's dedup to hide the stale copies; the flushed dataset itself was not clean. Build a within-generation deletion vector during flush and attach it to the generation's single fragment, so the flushed dataset exposes newest-per-PK on every Lance scan path (filter/index pushdown, vector, FTS) and on compaction reads — no index rebuild required, since the incrementally-built index is still reused as-is. - compute_dedup_deletions: one hash pass over the reverse-written rows; under reverse-write the newest version of each PK is at the smallest offset, so the first occurrence is kept and every later occurrence is marked deleted. write_data_file returns this bitmap alongside the row count. - finalize_generation: writes the deletion file via write_deletion_file and records it on the fragment in a single manifest rewrite, shared by both the data-only flush and flush_with_indexes paths (folding in the existing index manifest write). A no-op when there are no duplicates and no indexes. - PK columns are derived from the memtable's unenforced primary key. Hashing reuses compute_pk_hash, so the flush DV is consistent with the read path (collisions accepted uniformly). Test: flushing a generation with within-gen duplicate PKs writes a DV, and scanning the flushed dataset returns only the newest version of each PK. Phase 2 of the WAL fresh-tier dedup plan (deletion vector at flush). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ransfer With the deletion vector (phase 2) making flushed generations within-gen clean regardless of physical row order, the reverse-write trick is no longer needed. Write the flush data in insert (forward) order so the data file, the incrementally-built BTree/FTS/HNSW indexes, and the deletion-vector offsets all share one position space (newest = largest offset) with no remap. - write_data_file scans batches forward (scan_batches) instead of reversed. - compute_dedup_deletions keeps the LAST occurrence of each PK (largest offset = newest under forward-write) and deletes earlier ones. - Index transfer drops the `reversed_pos = total_rows - pos - 1` remap: BTree uses the existing forward to_training_batches; FTS to_index_builder_reversed becomes to_index_builder (insert-order doc positions); HNSW flush passes to_lance_hnsw(None). The now-unused total_rows params are removed. - Vector-search source tagging uses InsertOrder polarity for every source (flushed generations are forward-written; the DV yields one row per PK so freshness only matters across generations, resolved by generation). The reverse-only index-transfer variants are removed; the generic reverse BatchStore helpers remain (unused) for the phase-6 cleanup. Tests: existing BTree/FTS/HNSW flush integration tests confirm indexes line up 1:1 with the forward data file; the deletion-vector flush test confirms keep-last dedup; vector-search dedup tests confirm the polarity change. Phase 3 of the WAL fresh-tier dedup plan (forward-write). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The cross-generation block-list (drop a row whose PK lives in a newer generation) ran only on the vector path and only when stale-filtering was toggled on. Make it the unconditional, sole cross-generation mechanism so the sorted/global dedup can be removed next. - Filtered read (planner.rs): compute the per-source block-list and wrap each older generation's scan in a PkHashFilterExec before the merge (k=0: no top-k, so the under-fetch warning never fires). The newest (active) arm is never blocked. The sorted DeduplicateExec still runs alongside it for now. - Vector search (vector_search.rs): drop the `filter_stale = overfetch_factor >= 1.0` gate — the block-list is always built and applied. `overfetch_factor` now only tunes the over-fetch multiple and is clamped to >= 1.0 so a blocked source still yields k live candidates. Base refinement keys on whether any newer generation exists (`!block_lists.is_empty()`). Tests: a new filtered-read test asserts PkHashFilterExec is wired in and results stay newest-per-PK; the vector toggle test now asserts a sub-1.0 overfetch_factor no longer disables filtering (the stale row stays suppressed). Phase 4 of the WAL fresh-tier dedup plan (unconditional block-list). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…s removable The active memtable's HNSW is append-only: an updated PK keeps a live node per version, so a KNN can return several rows for one PK. Until now the cross-source LsmGlobalPkDedupExec was the only thing collapsing them. Make the active arm independently newest-per-PK so the global dedup can be removed next. - Over-fetch the active arm (fetch_k = ceil(k * overfetch_factor)), like a blocked source, so the within-source dedup still leaves k live candidates. - Wrap the active KNN in WithinSourceDedupExec(KeepMaxRowAddr) keyed on `_rowid` (the BatchStore offset; larger = newer) to keep the newest insert per PK. - Re-sort the arm by `_distance` (capped at k) via the new sort_by_distance helper, because the dedup emits rows unordered and the cross-source distance merge needs sorted inputs. This stays probabilistic: a fresh version evicted from the over-fetched top-k can still leak (exact correctness needs in-graph eviction, deferred). Flushed and base arms are unchanged (deletion vector + block-list already make them clean). Test: the within-active dedup test now also asserts WithinSourceDedupExec is present in the plan, so the arm no longer leans on the cross-source dedup. Phase 5 of the WAL fresh-tier dedup plan (active vector self-containment). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Every source is now independently newest-per-PK (active fused dedup / flushed deletion vector) and the unconditional cross-generation block-list drops any PK superseded by a newer generation, so each PK survives in exactly one source. The cross-source sorted dedup is therefore dead weight on the filtered-read path. Replace `SortExec(pk, _rowaddr DESC)` per source + `SortPreservingMergeExec(pk)` + `DeduplicateExec` with a plain `UnionExec` + `CoalescePartitionsExec` (the union yields one partition per arm; downstream reads only partition 0) and a final `project_to_canonical` (which also performs the internal-`_rowaddr` strip the dedup used to do). `_memtable_gen` / `_rowaddr` are still produced only behind the projection flags. Removed the now-dead `build_local_sort_exprs` / `build_merge_sort_exprs`. This gives up the filtered-read path's last collision-free (exact-PK-value) dedup in favor of the 64-bit-hash block-list — an accepted tradeoff. Tests: all filtered-read plan-shape assertions rewritten to the collapsed plan; result tests confirm newest-per-PK is preserved. A coalesce omission first surfaced as only the active arm's rows reaching the output, and the missing canonical projection as a leaked `_rowaddr` column — both fixed and covered. Phase 6 (part 1) of the WAL fresh-tier dedup plan: filtered-read teardown. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tor plan After phase 5 the active vector arm self-dedups (over-fetch + within-source dedup) and flushed/base arms are clean (deletion vector + block-list), so each PK reaches the union from exactly one source. The cross-source LsmGlobalPkDedupExec and the LsmSourceTagExec `_memtable_gen`/`_freshness` tagging are therefore dead weight. Collapse the vector merge from `union -> CoalescePartitions -> LsmGlobalPkDedupExec -> project -> SortExec -> SortPreservingMerge` to `union -> SortExec(_distance, preserve) -> SortPreservingMerge`. Each arm now projects straight to the canonical output schema (no internal `_memtable_gen`/`_freshness`); the base arm keeps its real `_rowid` for the post-rerank take, non-base arms NULL it as before. Removed the now-unused `canonical_internal_schema`. Tests: the strip-internal-columns test now asserts the dedup/tag nodes are gone and the per-arm dedup + distance merge are present; all stale-read / cross-gen / within-active dedup vector tests still pass. Phase 6 (part 2) of the WAL fresh-tier dedup plan: vector-path teardown. The now-unused exec structs (LsmGlobalPkDedupExec, LsmSourceTagExec, DeduplicateExec) are deleted in the follow-up cleanup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…nodes The fresh-tier read paths no longer reference these nodes (filtered read uses a plain union + block-list; vector uses per-arm dedup + distance merge), so delete them: - LsmGlobalPkDedupExec (global_pk_dedup.rs) — cross-source PK dedup. - LsmSourceTagExec + FreshnessPolarity + FRESHNESS_COLUMN (source_tag.rs) — the `_memtable_gen` / `_freshness` tagging that fed it. - DeduplicateExec + pk_equals (deduplicate.rs) — the filtered-read sorted dedup; its `ROW_ADDRESS_COLUMN` constant moves to pk.rs. Update the exec module doc and the one vector test that referenced the removed `FRESHNESS_COLUMN` constant (now the `_freshness` literal). Remaining dead code left as a trivial follow-up (pub, unused, no warnings): the `KeepMinRowAddr` DedupDirection variant (removing it would leave a single-variant enum) and the reverse-write BatchStore helpers (`scan_batches_reversed` / `to_vec_reversed` / `iter_reversed` chain). Phase 6 (part 3) of the WAL fresh-tier dedup plan: dead-node removal. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

codecov · 2026-05-24T14:49:32Z

Codecov Report

❌ Patch coverage is 80.63624% with 140 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...ataset/mem_wal/memtable/scanner/exec/dedup_scan.rs	77.85%	52 Missing and 10 partials ⚠️
rust/lance/src/dataset/mem_wal/scanner/exec/pk.rs	25.64%	20 Missing and 9 partials ⚠️
...ce/src/dataset/mem_wal/scanner/exec/bloom_guard.rs	0.00%	25 Missing ⚠️
rust/lance/src/dataset/mem_wal/memtable/flush.rs	92.85%	3 Missing and 7 partials ⚠️
...ce/src/dataset/mem_wal/memtable/scanner/builder.rs	75.00%	2 Missing and 6 partials ⚠️
...lance/src/dataset/mem_wal/scanner/vector_search.rs	84.37%	2 Missing and 3 partials ⚠️
rust/lance/src/dataset/mem_wal/scanner/planner.rs	99.39%	0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

hamersaw and others added 8 commits May 24, 2026 04:32

claude Bot reviewed May 24, 2026

View reviewed changes

github-actions Bot added the bug Something isn't working label May 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(mem_wal): eliminate fresh-tier stale reads (dedup-on-scan + deletion-vector-on-flush)#6929

fix(mem_wal): eliminate fresh-tier stale reads (dedup-on-scan + deletion-vector-on-flush)#6929
hamersaw wants to merge 8 commits into
lance-format:mainfrom
hamersaw:bug/wal-dedup-final

hamersaw commented May 24, 2026

Uh oh!

claude Bot left a comment

Uh oh!

codecov Bot commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hamersaw commented May 24, 2026

Problem

Fix

Tradeoffs (intentional)

Testing

Follow-ups

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

codecov Bot commented May 24, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant