fix(mem_wal): dedupe duplicate primary keys in LSM point lookup#6880
Conversation
298681d to
71348ac
Compare
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
| //! the same memtable. The active memtable stores rows in insert order (larger | ||
| //! `_rowaddr` = newer), while flushed memtables are reverse-written so that | ||
| //! within a flushed file the smallest `_rowid` is the newest insert (see | ||
| //! `memtable/flush.rs:152` and `hnsw/storage.rs:307`). Vector search and |
There was a problem hiding this comment.
this is no longer used by vector search
There was a problem hiding this comment.
Done — rewrote the module doc to describe only the point-lookup usage; the vector-search / LsmGlobalPkDedupExec references are gone (c8c5f17).
| pub enum DedupDirection { | ||
| /// Keep the row with the largest freshness value (active memtable: larger | ||
| /// `_rowaddr` = inserted later). | ||
| KeepMaxFreshness, |
There was a problem hiding this comment.
freshness is not the right word, because we always want to keep the max freshness. it's just the row address to keep is max or min.
There was a problem hiding this comment.
Renamed throughout: DedupDirection::KeepMax/MinFreshness → KeepMax/MinRowAddr, and freshness_column → row_addr_column. The enum doc now states the kept row is always the freshest — only the row address (_rowaddr/_rowid) extreme used to find it differs by source (c8c5f17).
| /// Hash a row's primary key. Kept in sync with the variants supported by | ||
| /// [`super::LsmGlobalPkDedupExec`] and [`super::BloomFilterGuardExec`] so | ||
| /// a single PK produces the same hash regardless of which exec consumes it. | ||
| fn compute_pk_hash(batch: &RecordBatch, pk_indices: &[usize], row_idx: usize) -> u64 { |
There was a problem hiding this comment.
should probably refactor to use the same function across all invocations
There was a problem hiding this comment.
Extracted resolve_pk_indices + compute_pk_hash into a shared exec::pk module. Both WithinSourceDedupExec and LsmGlobalPkDedupExec now call it; the duplicated copies are removed (c8c5f17).
71348ac to
c8c5f17
Compare
A primary key written multiple times into one active memtable used to leak through to the user as distinct rows: FilterExec + LIMIT 1 over an insert-ordered scan returned the oldest match among duplicates. The active arm now runs a WithinSourceDedupExec(KeepMaxRowAddr) that collapses by PK, keeping the row with the newest row address. Flushed and base arms still rely on LIMIT 1 under the reverse-write / forward-write conventions. PK resolution and row hashing are shared via a new exec::pk module so WithinSourceDedupExec and LsmGlobalPkDedupExec resolve and hash a key identically. Split from lance-format#6856 (lance-format#6856). Co-Authored-By: Jack Ye <yezhaoqin@gmail.com>
c8c5f17 to
6612429
Compare
Split from #6856 — point-lookup portion.
A primary key written multiple times into one active memtable used to leak through to the user as distinct rows:
FilterExec + LIMIT 1over an insert-ordered scan returned the oldest match among duplicates.The active arm now runs
WithinSourceDedupExec(KeepMaxFreshness), which collapses by PK and keeps the freshest row. Flushed and base arms still rely onLIMIT 1under the reverse-write / forward-write conventions.Part of splitting #6856 into focused PRs. Co-authored with @jackye1995.