fix(mem_wal): eliminate fresh-tier stale reads (dedup-on-scan + deletion-vector-on-flush)#6929
Open
hamersaw wants to merge 8 commits into
Open
fix(mem_wal): eliminate fresh-tier stale reads (dedup-on-scan + deletion-vector-on-flush)#6929hamersaw wants to merge 8 commits into
hamersaw wants to merge 8 commits into
Conversation
…hantoms The active-memtable filtered-read arm pushed the predicate into the scan, so a primary key whose newest version fails the predicate had that version removed before any dedup could pick it — leaking an older version that still passes (a stale "phantom" row). Example: id inserted with value=100 then updated to NULL, queried with `value IS NOT NULL`, returned the stale 100. Fix the active tier by deduplicating before the selection: - Add MemTableDedupScanExec, which reverse-iterates the memtable newest-first, collapses to newest-per-PK with a cross-batch seen-set, then applies the predicate. The seen-set is updated from the newest occurrence independent of the predicate, so a newest row that fails the filter still suppresses its older versions. - Wire it into the active arm of plan_scan via MemTableScanner::create_dedup_plan (filter moves out of the raw scan into the fused operator). The active filtered arm forgoes the in-memory BTree skip since dedup must see every version. - Harden compute_pk_hash: cover Int8/16/32/64, UInt8/16/32/64, Boolean, Utf8/LargeUtf8, Binary/LargeBinary; add validate_pk_types to reject unsupported PK types at the scanner boundary instead of silently collapsing them to one hash; add a value-distinguishing fallback. Keep compute_pk_hash_from_scalars in lockstep so the bloom/dedup hashes stay consistent. Tests: 3 exec unit tests (within-batch phantom, cross-batch newest-per-PK with filter, IS NULL mirror), 1 end-to-end LsmScanner regression, plus updated plan-shape/btree-filter/without-base-table assertions for the new active node. Phase 1 of the WAL fresh-tier dedup plan (active filtered tier). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Flushed L0 generations could contain multiple versions of a primary key (insert + update before flush). Reads relied entirely on the LSM scanner's dedup to hide the stale copies; the flushed dataset itself was not clean. Build a within-generation deletion vector during flush and attach it to the generation's single fragment, so the flushed dataset exposes newest-per-PK on every Lance scan path (filter/index pushdown, vector, FTS) and on compaction reads — no index rebuild required, since the incrementally-built index is still reused as-is. - compute_dedup_deletions: one hash pass over the reverse-written rows; under reverse-write the newest version of each PK is at the smallest offset, so the first occurrence is kept and every later occurrence is marked deleted. write_data_file returns this bitmap alongside the row count. - finalize_generation: writes the deletion file via write_deletion_file and records it on the fragment in a single manifest rewrite, shared by both the data-only flush and flush_with_indexes paths (folding in the existing index manifest write). A no-op when there are no duplicates and no indexes. - PK columns are derived from the memtable's unenforced primary key. Hashing reuses compute_pk_hash, so the flush DV is consistent with the read path (collisions accepted uniformly). Test: flushing a generation with within-gen duplicate PKs writes a DV, and scanning the flushed dataset returns only the newest version of each PK. Phase 2 of the WAL fresh-tier dedup plan (deletion vector at flush). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ransfer With the deletion vector (phase 2) making flushed generations within-gen clean regardless of physical row order, the reverse-write trick is no longer needed. Write the flush data in insert (forward) order so the data file, the incrementally-built BTree/FTS/HNSW indexes, and the deletion-vector offsets all share one position space (newest = largest offset) with no remap. - write_data_file scans batches forward (scan_batches) instead of reversed. - compute_dedup_deletions keeps the LAST occurrence of each PK (largest offset = newest under forward-write) and deletes earlier ones. - Index transfer drops the `reversed_pos = total_rows - pos - 1` remap: BTree uses the existing forward to_training_batches; FTS to_index_builder_reversed becomes to_index_builder (insert-order doc positions); HNSW flush passes to_lance_hnsw(None). The now-unused total_rows params are removed. - Vector-search source tagging uses InsertOrder polarity for every source (flushed generations are forward-written; the DV yields one row per PK so freshness only matters across generations, resolved by generation). The reverse-only index-transfer variants are removed; the generic reverse BatchStore helpers remain (unused) for the phase-6 cleanup. Tests: existing BTree/FTS/HNSW flush integration tests confirm indexes line up 1:1 with the forward data file; the deletion-vector flush test confirms keep-last dedup; vector-search dedup tests confirm the polarity change. Phase 3 of the WAL fresh-tier dedup plan (forward-write). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cross-generation block-list (drop a row whose PK lives in a newer generation) ran only on the vector path and only when stale-filtering was toggled on. Make it the unconditional, sole cross-generation mechanism so the sorted/global dedup can be removed next. - Filtered read (planner.rs): compute the per-source block-list and wrap each older generation's scan in a PkHashFilterExec before the merge (k=0: no top-k, so the under-fetch warning never fires). The newest (active) arm is never blocked. The sorted DeduplicateExec still runs alongside it for now. - Vector search (vector_search.rs): drop the `filter_stale = overfetch_factor >= 1.0` gate — the block-list is always built and applied. `overfetch_factor` now only tunes the over-fetch multiple and is clamped to >= 1.0 so a blocked source still yields k live candidates. Base refinement keys on whether any newer generation exists (`!block_lists.is_empty()`). Tests: a new filtered-read test asserts PkHashFilterExec is wired in and results stay newest-per-PK; the vector toggle test now asserts a sub-1.0 overfetch_factor no longer disables filtering (the stale row stays suppressed). Phase 4 of the WAL fresh-tier dedup plan (unconditional block-list). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…s removable The active memtable's HNSW is append-only: an updated PK keeps a live node per version, so a KNN can return several rows for one PK. Until now the cross-source LsmGlobalPkDedupExec was the only thing collapsing them. Make the active arm independently newest-per-PK so the global dedup can be removed next. - Over-fetch the active arm (fetch_k = ceil(k * overfetch_factor)), like a blocked source, so the within-source dedup still leaves k live candidates. - Wrap the active KNN in WithinSourceDedupExec(KeepMaxRowAddr) keyed on `_rowid` (the BatchStore offset; larger = newer) to keep the newest insert per PK. - Re-sort the arm by `_distance` (capped at k) via the new sort_by_distance helper, because the dedup emits rows unordered and the cross-source distance merge needs sorted inputs. This stays probabilistic: a fresh version evicted from the over-fetched top-k can still leak (exact correctness needs in-graph eviction, deferred). Flushed and base arms are unchanged (deletion vector + block-list already make them clean). Test: the within-active dedup test now also asserts WithinSourceDedupExec is present in the plan, so the arm no longer leans on the cross-source dedup. Phase 5 of the WAL fresh-tier dedup plan (active vector self-containment). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Every source is now independently newest-per-PK (active fused dedup / flushed deletion vector) and the unconditional cross-generation block-list drops any PK superseded by a newer generation, so each PK survives in exactly one source. The cross-source sorted dedup is therefore dead weight on the filtered-read path. Replace `SortExec(pk, _rowaddr DESC)` per source + `SortPreservingMergeExec(pk)` + `DeduplicateExec` with a plain `UnionExec` + `CoalescePartitionsExec` (the union yields one partition per arm; downstream reads only partition 0) and a final `project_to_canonical` (which also performs the internal-`_rowaddr` strip the dedup used to do). `_memtable_gen` / `_rowaddr` are still produced only behind the projection flags. Removed the now-dead `build_local_sort_exprs` / `build_merge_sort_exprs`. This gives up the filtered-read path's last collision-free (exact-PK-value) dedup in favor of the 64-bit-hash block-list — an accepted tradeoff. Tests: all filtered-read plan-shape assertions rewritten to the collapsed plan; result tests confirm newest-per-PK is preserved. A coalesce omission first surfaced as only the active arm's rows reaching the output, and the missing canonical projection as a leaked `_rowaddr` column — both fixed and covered. Phase 6 (part 1) of the WAL fresh-tier dedup plan: filtered-read teardown. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tor plan After phase 5 the active vector arm self-dedups (over-fetch + within-source dedup) and flushed/base arms are clean (deletion vector + block-list), so each PK reaches the union from exactly one source. The cross-source LsmGlobalPkDedupExec and the LsmSourceTagExec `_memtable_gen`/`_freshness` tagging are therefore dead weight. Collapse the vector merge from `union -> CoalescePartitions -> LsmGlobalPkDedupExec -> project -> SortExec -> SortPreservingMerge` to `union -> SortExec(_distance, preserve) -> SortPreservingMerge`. Each arm now projects straight to the canonical output schema (no internal `_memtable_gen`/`_freshness`); the base arm keeps its real `_rowid` for the post-rerank take, non-base arms NULL it as before. Removed the now-unused `canonical_internal_schema`. Tests: the strip-internal-columns test now asserts the dedup/tag nodes are gone and the per-arm dedup + distance merge are present; all stale-read / cross-gen / within-active dedup vector tests still pass. Phase 6 (part 2) of the WAL fresh-tier dedup plan: vector-path teardown. The now-unused exec structs (LsmGlobalPkDedupExec, LsmSourceTagExec, DeduplicateExec) are deleted in the follow-up cleanup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nodes The fresh-tier read paths no longer reference these nodes (filtered read uses a plain union + block-list; vector uses per-arm dedup + distance merge), so delete them: - LsmGlobalPkDedupExec (global_pk_dedup.rs) — cross-source PK dedup. - LsmSourceTagExec + FreshnessPolarity + FRESHNESS_COLUMN (source_tag.rs) — the `_memtable_gen` / `_freshness` tagging that fed it. - DeduplicateExec + pk_equals (deduplicate.rs) — the filtered-read sorted dedup; its `ROW_ADDRESS_COLUMN` constant moves to pk.rs. Update the exec module doc and the one vector test that referenced the removed `FRESHNESS_COLUMN` constant (now the `_freshness` literal). Remaining dead code left as a trivial follow-up (pub, unused, no warnings): the `KeepMinRowAddr` DedupDirection variant (removing it would leave a single-variant enum) and the reverse-write BatchStore helpers (`scan_batches_reversed` / `to_vec_reversed` / `iter_reversed` chain). Phase 6 (part 3) of the WAL fresh-tier dedup plan: dead-node removal. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The mem_wal LSM fresh tier could serve stale rows whenever a primary key was updated across a query's selection boundary — e.g.
valuegoes non-null → null underWHERE value IS NOT NULL, or a vector moves so the newest version drops out of the KNN top-k. The root cause is structural: every read path applied the per-source selection (scalar filter / KNN top-k) before any dedup, so when a PK's newest version failed the selection, an older version that still passed leaked through.Fix
Make every fresh-tier source present a newest-per-PK view so the selection only ever sees one row per PK, then resolve cross-generation duplicates with a single block-list — which makes the cross-source dedup machinery unnecessary. Delivered in 8 focused commits:
compute_pk_hash(realistic type coverage +validate_pk_typesat the scanner boundary, instead of silently collapsing unsupported PK types to one hash).reversed_posremap), now that the DV makes physical row order irrelevant.WithinSourceDedupExec+ distance re-sort, so the active HNSW arm is independently newest-per-PK.Union + block-list, the vector plan toUnion + distance merge, and delete the now-deadLsmGlobalPkDedupExec/LsmSourceTagExec/DeduplicateExec.Tradeoffs (intentional)
compute_pk_hash. The filtered-read path gives up its old collision-free exact-value dedup; a 64-bit collision over-blocks (a missing row, never a wrong value) and is astronomically unlikely at memtable scale.Testing
Each phase ships with tests. New coverage includes the within-gen phantom (exec + end-to-end), the flush deletion vector, forward-write index alignment, the filtered-read block-list, and active-vector self-dedup.
cargo test -p lance --lib dataset::mem_wal→ 326 passed, 0 failed (1 ignored: S3).cargo fmt --allandcargo clippy --all --tests --benches -- -D warningsare clean.Follow-ups
KeepMinRowAddrDedupDirectionvariant and the reverse-writeBatchStorehelpers.🤖 Generated with Claude Code