Skip to content

fix(mem_wal): dedupe duplicate primary keys in LSM point lookup#6880

Merged
jackye1995 merged 1 commit into
lance-format:mainfrom
touch-of-grey:PointLookupFix
May 21, 2026
Merged

fix(mem_wal): dedupe duplicate primary keys in LSM point lookup#6880
jackye1995 merged 1 commit into
lance-format:mainfrom
touch-of-grey:PointLookupFix

Conversation

@touch-of-grey

Copy link
Copy Markdown
Contributor

Split from #6856 — point-lookup portion.

A primary key written multiple times into one active memtable used to leak through to the user as distinct rows: FilterExec + LIMIT 1 over an insert-ordered scan returned the oldest match among duplicates.

The active arm now runs WithinSourceDedupExec(KeepMaxFreshness), which collapses by PK and keeps the freshest row. Flushed and base arms still rely on LIMIT 1 under the reverse-write / forward-write conventions.

Part of splitting #6856 into focused PRs. Co-authored with @jackye1995.

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@codecov

codecov Bot commented May 21, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 77.54569% with 86 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...ataset/mem_wal/scanner/exec/within_source_dedup.rs 70.83% 75 Missing and 9 partials ⚠️
.../lance/src/dataset/mem_wal/scanner/point_lookup.rs 97.89% 0 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

//! the same memtable. The active memtable stores rows in insert order (larger
//! `_rowaddr` = newer), while flushed memtables are reverse-written so that
//! within a flushed file the smallest `_rowid` is the newest insert (see
//! `memtable/flush.rs:152` and `hnsw/storage.rs:307`). Vector search and

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is no longer used by vector search

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — rewrote the module doc to describe only the point-lookup usage; the vector-search / LsmGlobalPkDedupExec references are gone (c8c5f17).

pub enum DedupDirection {
/// Keep the row with the largest freshness value (active memtable: larger
/// `_rowaddr` = inserted later).
KeepMaxFreshness,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

freshness is not the right word, because we always want to keep the max freshness. it's just the row address to keep is max or min.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed throughout: DedupDirection::KeepMax/MinFreshnessKeepMax/MinRowAddr, and freshness_columnrow_addr_column. The enum doc now states the kept row is always the freshest — only the row address (_rowaddr/_rowid) extreme used to find it differs by source (c8c5f17).

/// Hash a row's primary key. Kept in sync with the variants supported by
/// [`super::LsmGlobalPkDedupExec`] and [`super::BloomFilterGuardExec`] so
/// a single PK produces the same hash regardless of which exec consumes it.
fn compute_pk_hash(batch: &RecordBatch, pk_indices: &[usize], row_idx: usize) -> u64 {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should probably refactor to use the same function across all invocations

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracted resolve_pk_indices + compute_pk_hash into a shared exec::pk module. Both WithinSourceDedupExec and LsmGlobalPkDedupExec now call it; the duplicated copies are removed (c8c5f17).

A primary key written multiple times into one active memtable used to
leak through to the user as distinct rows: FilterExec + LIMIT 1 over an
insert-ordered scan returned the oldest match among duplicates.

The active arm now runs a WithinSourceDedupExec(KeepMaxRowAddr) that
collapses by PK, keeping the row with the newest row address. Flushed and
base arms still rely on LIMIT 1 under the reverse-write / forward-write
conventions.

PK resolution and row hashing are shared via a new exec::pk module so
WithinSourceDedupExec and LsmGlobalPkDedupExec resolve and hash a key
identically.

Split from lance-format#6856 (lance-format#6856).

Co-Authored-By: Jack Ye <yezhaoqin@gmail.com>
@jackye1995 jackye1995 merged commit bd58ad0 into lance-format:main May 21, 2026
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants