Skip to content

fix: set _row_created_at_version to new version for MERGE INTO INSERT rows#6774

Merged
wjones127 merged 2 commits into
lance-format:mainfrom
jerryjch:merge-into-insert-rows
May 19, 2026
Merged

fix: set _row_created_at_version to new version for MERGE INTO INSERT rows#6774
wjones127 merged 2 commits into
lance-format:mainfrom
jerryjch:merge-into-insert-rows

Conversation

@jerryjch

@jerryjch jerryjch commented May 13, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Fixes Newly inserted rows in a fragment-rewrite commit (e.g. MERGE INTO) get _row_created_at_version from a prior version instead of the commit version #6735.
  • transaction.rsresolve_update_version_metadata: in the per-row created_at
    mapping, check row_id_to_source.contains_key(&rid) before calling
    resolve_created_at_version. Rows not in the map are INSERT branch rows (no source
    in existing fragments); they now receive new_version as created_at instead of the
    previous fallback of UNKNOWN_CREATED_AT_VERSION (1).
  • resolve_created_at_version doc comment updated to clarify it is only called for
    UPDATE branch rows (source confirmed present). The unmapped-row-ID branch inside the
    function is unused when called from resolve_update_version_metadata; UNKNOWN (1)
    still applies for UPDATE rows whose source fragment has missing or bad
    created_at_version_meta (cache miss, decode failure, or out-of-range offset).
  • Two existing tests updated to assert the corrected behavior; one new test added.

Background

MERGE INTO commits through Operation::Update and produces both UPDATE branch rows
(rewritten into new_fragments with a source row in the previous manifest, stable row
ID present in row_id_to_source) and INSERT branch rows (new rows also in
new_fragments, stable row ID assigned fresh, not present in any existing fragment).

Before this change, resolve_update_version_metadata built the per-row
created_at_versions vector by calling resolve_created_at_version for every row ID.
For UPDATE branch rows that function correctly copies created_at from the source
fragment. For INSERT branch rows the map lookup fails and the function returns
UNKNOWN_CREATED_AT_VERSION = 1, producing a wrong historical version for every newly
inserted row. CDF consumers cannot distinguish merge-inserted rows from updated rows via
_row_created_at_version, and the value 1 is meaningless for rows that first appeared
in a recent commit.

The fix is a single guard at the call site: only call resolve_created_at_version for
rows confirmed to have a source (UPDATE branch); for all other rows use new_version
directly.

Implementation notes

  • The guard uses row_id_to_source.contains_key(&rid), which is an O(1) hash lookup
    on the same map already built for the UPDATE branch path — no additional data
    structures or iteration.
  • No lance-spark changes are needed. The Spark commit path (SparkPositionDeltaWrite)
    already attaches RowIdMeta to new fragment rows. This change activates the correct
    behavior automatically for all callers of Operation::Update, including lance-spark
    MERGE INTO.
  • lane-spark test update test: update MergeIntoTest lance-spark#530

Test plan

  • test_update_version_tracking_insert_branch_gets_new_version (renamed from
    test_update_version_tracking_unknown_row_id_defaults_to_1): new fragment with one
    UPDATE branch row (ID 10, source created_at = 5) and one INSERT branch row (ID 999);
    asserts created_at = [5, 5] — UPDATE branch copies from source, INSERT branch gets
    new_version (5).
  • test_update_version_tracking_merge_into_distinguishes_insert_and_update_branch (new):
    new fragment interleaves UPDATE branch rows (IDs 10, 11, source created_at = 3) and
    INSERT branch rows (IDs 500, 501); asserts created_at = [3, 5, 3, 5] to verify
    per-row correctness across both branches in the same fragment.
  • test_update_version_tracking_no_row_id_meta_fallback: assertion updated from
    [1, 1, 1] to [5, 5, 5] — a fragment with no row_id_meta gets fresh stable IDs
    assigned by assign_row_ids; those IDs have no source and are INSERT branch rows,
    so created_at equals new_version.
  • test_update_version_tracking_source_fragment_no_created_at_defaults_to_1 (unchanged):
    confirms that UPDATE branch rows whose source fragment has no created_at_version_meta
    still fall back to UNKNOWN (1) — the remaining reachable path through
    resolve_created_at_version.

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@github-actions github-actions Bot added the bug Something isn't working label May 13, 2026
@jerryjch jerryjch force-pushed the merge-into-insert-rows branch from ec7d5a6 to f9fe827 Compare May 13, 2026 23:33
@jerryjch

Copy link
Copy Markdown
Contributor Author

cc @hamersaw

@codecov

codecov Bot commented May 14, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@hamersaw hamersaw left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix, looks great!

@wjones127 wjones127 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working this. These changes make sense 👍

@wjones127 wjones127 merged commit 29a8f92 into lance-format:main May 19, 2026
30 checks passed
hamersaw pushed a commit to lance-format/lance-spark that referenced this pull request May 26, 2026
## Depends on

* Lance PR **lance-format/lance#6774**
* Requires a lance release that includes #6774.


## Summary

* `BaseMergeIntoTest#testMergeIntoTracksVersionColumnsPerBranch`: flips
the INSERT branch
"known gap" assertions and updates the UPDATE branch assertion to match
the corrected
  behavior from lance#6774.


## What changed in the test

INSERT branch (ids 5, 6) — assertion flipped from known-gap pin to
correct behavior:

Before: `assertTrue(createdAt <= initialInsertVersion)` (pinned the bug)
After: `assertEquals(mergeCommitLastUpdated, createdAt)` — both
`_row_created_at_version`
and `_row_last_updated_at_version` must equal the merge commit version.

UPDATE branch (id 1) — assertion updated to reflect actual behavior:

Before: `assertTrue(createdAt <= beforeLastUpdated.get(1))` — expected
created_at not to
          jump forward (was passing only because UNKNOWN=1 ≤ 2).
After: `assertEquals(lastUpdated, createdAt)` — created_at equals the
merge commit
version (same as last_updated), because SparkPositionDeltaWrite assigns
new stable
row IDs to rewritten rows rather than preserving the originals; Lance
therefore
          treats them as new rows with no prior source.

Untouched rows (ids 3, 4) and the row-count assertion are unchanged.


## Background

`testMergeIntoTracksVersionColumnsPerBranch` was added in commit
`8f72de8` to pin down
current CDF version-column behavior across all MERGE INTO branches, with
explicit
"known gap" comments and assertions for the INSERT branch pointing at
lance#6735 and
lance#6774.

Before lance#6774, INSERT branch rows (NOT MATCHED) went through the
same
`resolve_update_version_metadata` path as UPDATE branch rows but had no
source in existing
fragments, causing `_row_created_at_version` to fall back to
`UNKNOWN_CREATED_AT_VERSION`
(1). lance#6774 fixes this by detecting rows with no source and setting
`created_at` to
the new commit version instead.

The UPDATE branch assertion change is a side effect of the same fix:
because
SparkPositionDeltaWrite assigns new stable row IDs to rewritten rows in
MERGE INTO (rather
than preserving the originals), UPDATE branch rows also have no
traceable source in
`row_id_to_source`. They therefore receive `created_at = new_version` as
well. The old
assertion `createdAt <= beforeLastUpdated.get(1)` was only accidentally
passing because
`UNKNOWN(1) ≤ 2`.


## Test plan

* `MergeIntoTest#testMergeIntoTracksVersionColumnsPerBranch`

Co-authored-by: Jing chen He <jingh@adobe.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Newly inserted rows in a fragment-rewrite commit (e.g. MERGE INTO) get _row_created_at_version from a prior version instead of the commit version

3 participants