Skip to content

fix(mem_wal): eliminate fresh-tier stale reads (dedup-on-scan + deletion-vector-on-flush)#6929

Open
hamersaw wants to merge 8 commits into
lance-format:mainfrom
hamersaw:bug/wal-dedup-final
Open

fix(mem_wal): eliminate fresh-tier stale reads (dedup-on-scan + deletion-vector-on-flush)#6929
hamersaw wants to merge 8 commits into
lance-format:mainfrom
hamersaw:bug/wal-dedup-final

Conversation

@hamersaw
Copy link
Copy Markdown
Contributor

Problem

The mem_wal LSM fresh tier could serve stale rows whenever a primary key was updated across a query's selection boundary — e.g. value goes non-null → null under WHERE value IS NOT NULL, or a vector moves so the newest version drops out of the KNN top-k. The root cause is structural: every read path applied the per-source selection (scalar filter / KNN top-k) before any dedup, so when a PK's newest version failed the selection, an older version that still passed leaked through.

Fix

Make every fresh-tier source present a newest-per-PK view so the selection only ever sees one row per PK, then resolve cross-generation duplicates with a single block-list — which makes the cross-source dedup machinery unnecessary. Delivered in 8 focused commits:

  1. Active filtered-read fused dedup-before-predicate — reverse-scan the active memtable, collapse to newest-per-PK with a seen-set, then apply the predicate (the seen-set is updated independent of the predicate, so a newest row that fails the filter still suppresses its older versions). Also hardens compute_pk_hash (realistic type coverage + validate_pk_types at the scanner boundary, instead of silently collapsing unsupported PK types to one hash).
  2. Deletion vector at flush — build a within-generation DV during flush so each flushed generation exposes newest-per-PK on every Lance scan path (filter/index pushdown, vector, FTS) with no index rebuild.
  3. Forward-write flushed generations with a 1:1 index transfer (drops the reversed_pos remap), now that the DV makes physical row order irrelevant.
  4. Unconditional cross-generation block-list on every read path (added to the filtered-read path; un-gated on the vector path).
  5. Active vector self-containment — over-fetch + WithinSourceDedupExec + distance re-sort, so the active HNSW arm is independently newest-per-PK.
  6. Teardown — collapse the filtered-read plan to Union + block-list, the vector plan to Union + distance merge, and delete the now-dead LsmGlobalPkDedupExec / LsmSourceTagExec / DeduplicateExec.

Tradeoffs (intentional)

  • Uniform 64-bit hash keying. The fresh tier (block-list, active dedup, flush DV) keys PKs on compute_pk_hash. The filtered-read path gives up its old collision-free exact-value dedup; a 64-bit collision over-blocks (a missing row, never a wrong value) and is astronomically unlikely at memtable scale.
  • Active vector search is probabilistically correct. The append-only HNSW keeps a live node per PK version; over-fetch + within-source dedup collapse them, but a fresh version evicted from the over-fetched top-k can still leak. Exact correctness needs in-graph within-generation eviction, deliberately deferred.

Testing

Each phase ships with tests. New coverage includes the within-gen phantom (exec + end-to-end), the flush deletion vector, forward-write index alignment, the filtered-read block-list, and active-vector self-dedup. cargo test -p lance --lib dataset::mem_wal326 passed, 0 failed (1 ignored: S3). cargo fmt --all and cargo clippy --all --tests --benches -- -D warnings are clean.

Follow-ups

  • In-graph HNSW within-generation eviction for exact active-vector correctness.
  • Trivial dead-code: the unused KeepMinRowAddr DedupDirection variant and the reverse-write BatchStore helpers.

🤖 Generated with Claude Code

hamersaw and others added 8 commits May 24, 2026 04:32
…hantoms

The active-memtable filtered-read arm pushed the predicate into the scan, so a
primary key whose newest version fails the predicate had that version removed
before any dedup could pick it — leaking an older version that still passes (a
stale "phantom" row). Example: id inserted with value=100 then updated to NULL,
queried with `value IS NOT NULL`, returned the stale 100.

Fix the active tier by deduplicating before the selection:

- Add MemTableDedupScanExec, which reverse-iterates the memtable newest-first,
  collapses to newest-per-PK with a cross-batch seen-set, then applies the
  predicate. The seen-set is updated from the newest occurrence independent of
  the predicate, so a newest row that fails the filter still suppresses its
  older versions.
- Wire it into the active arm of plan_scan via MemTableScanner::create_dedup_plan
  (filter moves out of the raw scan into the fused operator). The active filtered
  arm forgoes the in-memory BTree skip since dedup must see every version.
- Harden compute_pk_hash: cover Int8/16/32/64, UInt8/16/32/64, Boolean,
  Utf8/LargeUtf8, Binary/LargeBinary; add validate_pk_types to reject unsupported
  PK types at the scanner boundary instead of silently collapsing them to one
  hash; add a value-distinguishing fallback. Keep compute_pk_hash_from_scalars
  in lockstep so the bloom/dedup hashes stay consistent.

Tests: 3 exec unit tests (within-batch phantom, cross-batch newest-per-PK with
filter, IS NULL mirror), 1 end-to-end LsmScanner regression, plus updated
plan-shape/btree-filter/without-base-table assertions for the new active node.

Phase 1 of the WAL fresh-tier dedup plan (active filtered tier).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Flushed L0 generations could contain multiple versions of a primary key
(insert + update before flush). Reads relied entirely on the LSM scanner's
dedup to hide the stale copies; the flushed dataset itself was not clean.

Build a within-generation deletion vector during flush and attach it to the
generation's single fragment, so the flushed dataset exposes newest-per-PK on
every Lance scan path (filter/index pushdown, vector, FTS) and on compaction
reads — no index rebuild required, since the incrementally-built index is still
reused as-is.

- compute_dedup_deletions: one hash pass over the reverse-written rows; under
  reverse-write the newest version of each PK is at the smallest offset, so the
  first occurrence is kept and every later occurrence is marked deleted.
  write_data_file returns this bitmap alongside the row count.
- finalize_generation: writes the deletion file via write_deletion_file and
  records it on the fragment in a single manifest rewrite, shared by both the
  data-only flush and flush_with_indexes paths (folding in the existing index
  manifest write). A no-op when there are no duplicates and no indexes.
- PK columns are derived from the memtable's unenforced primary key. Hashing
  reuses compute_pk_hash, so the flush DV is consistent with the read path
  (collisions accepted uniformly).

Test: flushing a generation with within-gen duplicate PKs writes a DV, and
scanning the flushed dataset returns only the newest version of each PK.

Phase 2 of the WAL fresh-tier dedup plan (deletion vector at flush).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ransfer

With the deletion vector (phase 2) making flushed generations within-gen clean
regardless of physical row order, the reverse-write trick is no longer needed.
Write the flush data in insert (forward) order so the data file, the
incrementally-built BTree/FTS/HNSW indexes, and the deletion-vector offsets all
share one position space (newest = largest offset) with no remap.

- write_data_file scans batches forward (scan_batches) instead of reversed.
- compute_dedup_deletions keeps the LAST occurrence of each PK (largest offset =
  newest under forward-write) and deletes earlier ones.
- Index transfer drops the `reversed_pos = total_rows - pos - 1` remap: BTree
  uses the existing forward to_training_batches; FTS to_index_builder_reversed
  becomes to_index_builder (insert-order doc positions); HNSW flush passes
  to_lance_hnsw(None). The now-unused total_rows params are removed.
- Vector-search source tagging uses InsertOrder polarity for every source
  (flushed generations are forward-written; the DV yields one row per PK so
  freshness only matters across generations, resolved by generation).

The reverse-only index-transfer variants are removed; the generic reverse
BatchStore helpers remain (unused) for the phase-6 cleanup.

Tests: existing BTree/FTS/HNSW flush integration tests confirm indexes line up
1:1 with the forward data file; the deletion-vector flush test confirms
keep-last dedup; vector-search dedup tests confirm the polarity change.

Phase 3 of the WAL fresh-tier dedup plan (forward-write).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cross-generation block-list (drop a row whose PK lives in a newer
generation) ran only on the vector path and only when stale-filtering was
toggled on. Make it the unconditional, sole cross-generation mechanism so the
sorted/global dedup can be removed next.

- Filtered read (planner.rs): compute the per-source block-list and wrap each
  older generation's scan in a PkHashFilterExec before the merge (k=0: no top-k,
  so the under-fetch warning never fires). The newest (active) arm is never
  blocked. The sorted DeduplicateExec still runs alongside it for now.
- Vector search (vector_search.rs): drop the `filter_stale = overfetch_factor >=
  1.0` gate — the block-list is always built and applied. `overfetch_factor` now
  only tunes the over-fetch multiple and is clamped to >= 1.0 so a blocked source
  still yields k live candidates. Base refinement keys on whether any newer
  generation exists (`!block_lists.is_empty()`).

Tests: a new filtered-read test asserts PkHashFilterExec is wired in and results
stay newest-per-PK; the vector toggle test now asserts a sub-1.0 overfetch_factor
no longer disables filtering (the stale row stays suppressed).

Phase 4 of the WAL fresh-tier dedup plan (unconditional block-list).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…s removable

The active memtable's HNSW is append-only: an updated PK keeps a live node per
version, so a KNN can return several rows for one PK. Until now the cross-source
LsmGlobalPkDedupExec was the only thing collapsing them. Make the active arm
independently newest-per-PK so the global dedup can be removed next.

- Over-fetch the active arm (fetch_k = ceil(k * overfetch_factor)), like a
  blocked source, so the within-source dedup still leaves k live candidates.
- Wrap the active KNN in WithinSourceDedupExec(KeepMaxRowAddr) keyed on `_rowid`
  (the BatchStore offset; larger = newer) to keep the newest insert per PK.
- Re-sort the arm by `_distance` (capped at k) via the new sort_by_distance
  helper, because the dedup emits rows unordered and the cross-source distance
  merge needs sorted inputs.

This stays probabilistic: a fresh version evicted from the over-fetched top-k
can still leak (exact correctness needs in-graph eviction, deferred). Flushed and
base arms are unchanged (deletion vector + block-list already make them clean).

Test: the within-active dedup test now also asserts WithinSourceDedupExec is
present in the plan, so the arm no longer leans on the cross-source dedup.

Phase 5 of the WAL fresh-tier dedup plan (active vector self-containment).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Every source is now independently newest-per-PK (active fused dedup / flushed
deletion vector) and the unconditional cross-generation block-list drops any PK
superseded by a newer generation, so each PK survives in exactly one source. The
cross-source sorted dedup is therefore dead weight on the filtered-read path.

Replace `SortExec(pk, _rowaddr DESC)` per source + `SortPreservingMergeExec(pk)`
+ `DeduplicateExec` with a plain `UnionExec` + `CoalescePartitionsExec` (the
union yields one partition per arm; downstream reads only partition 0) and a
final `project_to_canonical` (which also performs the internal-`_rowaddr` strip
the dedup used to do). `_memtable_gen` / `_rowaddr` are still produced only
behind the projection flags. Removed the now-dead `build_local_sort_exprs` /
`build_merge_sort_exprs`.

This gives up the filtered-read path's last collision-free (exact-PK-value)
dedup in favor of the 64-bit-hash block-list — an accepted tradeoff.

Tests: all filtered-read plan-shape assertions rewritten to the collapsed plan;
result tests confirm newest-per-PK is preserved. A coalesce omission first
surfaced as only the active arm's rows reaching the output, and the missing
canonical projection as a leaked `_rowaddr` column — both fixed and covered.

Phase 6 (part 1) of the WAL fresh-tier dedup plan: filtered-read teardown.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tor plan

After phase 5 the active vector arm self-dedups (over-fetch + within-source
dedup) and flushed/base arms are clean (deletion vector + block-list), so each
PK reaches the union from exactly one source. The cross-source
LsmGlobalPkDedupExec and the LsmSourceTagExec `_memtable_gen`/`_freshness`
tagging are therefore dead weight.

Collapse the vector merge from
`union -> CoalescePartitions -> LsmGlobalPkDedupExec -> project -> SortExec ->
SortPreservingMerge` to `union -> SortExec(_distance, preserve) ->
SortPreservingMerge`. Each arm now projects straight to the canonical output
schema (no internal `_memtable_gen`/`_freshness`); the base arm keeps its real
`_rowid` for the post-rerank take, non-base arms NULL it as before. Removed the
now-unused `canonical_internal_schema`.

Tests: the strip-internal-columns test now asserts the dedup/tag nodes are gone
and the per-arm dedup + distance merge are present; all stale-read / cross-gen /
within-active dedup vector tests still pass.

Phase 6 (part 2) of the WAL fresh-tier dedup plan: vector-path teardown. The
now-unused exec structs (LsmGlobalPkDedupExec, LsmSourceTagExec, DeduplicateExec)
are deleted in the follow-up cleanup.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nodes

The fresh-tier read paths no longer reference these nodes (filtered read uses a
plain union + block-list; vector uses per-arm dedup + distance merge), so delete
them:

- LsmGlobalPkDedupExec (global_pk_dedup.rs) — cross-source PK dedup.
- LsmSourceTagExec + FreshnessPolarity + FRESHNESS_COLUMN (source_tag.rs) — the
  `_memtable_gen` / `_freshness` tagging that fed it.
- DeduplicateExec + pk_equals (deduplicate.rs) — the filtered-read sorted dedup;
  its `ROW_ADDRESS_COLUMN` constant moves to pk.rs.

Update the exec module doc and the one vector test that referenced the removed
`FRESHNESS_COLUMN` constant (now the `_freshness` literal).

Remaining dead code left as a trivial follow-up (pub, unused, no warnings): the
`KeepMinRowAddr` DedupDirection variant (removing it would leave a single-variant
enum) and the reverse-write BatchStore helpers (`scan_batches_reversed` /
`to_vec_reversed` / `iter_reversed` chain).

Phase 6 (part 3) of the WAL fresh-tier dedup plan: dead-node removal.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@github-actions github-actions Bot added the bug Something isn't working label May 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant