fix: set _row_created_at_version to new version for MERGE INTO INSERT rows#6774
Merged
Merged
Conversation
ec7d5a6 to
f9fe827
Compare
Contributor
Author
|
cc @hamersaw |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
hamersaw
approved these changes
May 16, 2026
hamersaw
left a comment
Contributor
There was a problem hiding this comment.
Thanks for the fix, looks great!
wjones127
approved these changes
May 19, 2026
wjones127
left a comment
Contributor
There was a problem hiding this comment.
Thanks for working this. These changes make sense 👍
hamersaw
pushed a commit
to lance-format/lance-spark
that referenced
this pull request
May 26, 2026
## Depends on * Lance PR **lance-format/lance#6774** * Requires a lance release that includes #6774. ## Summary * `BaseMergeIntoTest#testMergeIntoTracksVersionColumnsPerBranch`: flips the INSERT branch "known gap" assertions and updates the UPDATE branch assertion to match the corrected behavior from lance#6774. ## What changed in the test INSERT branch (ids 5, 6) — assertion flipped from known-gap pin to correct behavior: Before: `assertTrue(createdAt <= initialInsertVersion)` (pinned the bug) After: `assertEquals(mergeCommitLastUpdated, createdAt)` — both `_row_created_at_version` and `_row_last_updated_at_version` must equal the merge commit version. UPDATE branch (id 1) — assertion updated to reflect actual behavior: Before: `assertTrue(createdAt <= beforeLastUpdated.get(1))` — expected created_at not to jump forward (was passing only because UNKNOWN=1 ≤ 2). After: `assertEquals(lastUpdated, createdAt)` — created_at equals the merge commit version (same as last_updated), because SparkPositionDeltaWrite assigns new stable row IDs to rewritten rows rather than preserving the originals; Lance therefore treats them as new rows with no prior source. Untouched rows (ids 3, 4) and the row-count assertion are unchanged. ## Background `testMergeIntoTracksVersionColumnsPerBranch` was added in commit `8f72de8` to pin down current CDF version-column behavior across all MERGE INTO branches, with explicit "known gap" comments and assertions for the INSERT branch pointing at lance#6735 and lance#6774. Before lance#6774, INSERT branch rows (NOT MATCHED) went through the same `resolve_update_version_metadata` path as UPDATE branch rows but had no source in existing fragments, causing `_row_created_at_version` to fall back to `UNKNOWN_CREATED_AT_VERSION` (1). lance#6774 fixes this by detecting rows with no source and setting `created_at` to the new commit version instead. The UPDATE branch assertion change is a side effect of the same fix: because SparkPositionDeltaWrite assigns new stable row IDs to rewritten rows in MERGE INTO (rather than preserving the originals), UPDATE branch rows also have no traceable source in `row_id_to_source`. They therefore receive `created_at = new_version` as well. The old assertion `createdAt <= beforeLastUpdated.get(1)` was only accidentally passing because `UNKNOWN(1) ≤ 2`. ## Test plan * `MergeIntoTest#testMergeIntoTracksVersionColumnsPerBranch` Co-authored-by: Jing chen He <jingh@adobe.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
transaction.rs—resolve_update_version_metadata: in the per-rowcreated_atmapping, check
row_id_to_source.contains_key(&rid)before callingresolve_created_at_version. Rows not in the map are INSERT branch rows (no sourcein existing fragments); they now receive
new_versionascreated_atinstead of theprevious fallback of
UNKNOWN_CREATED_AT_VERSION(1).resolve_created_at_versiondoc comment updated to clarify it is only called forUPDATE branch rows (source confirmed present). The unmapped-row-ID branch inside the
function is unused when called from
resolve_update_version_metadata;UNKNOWN(1)still applies for UPDATE rows whose source fragment has missing or bad
created_at_version_meta(cache miss, decode failure, or out-of-range offset).Background
MERGE INTO commits through
Operation::Updateand produces both UPDATE branch rows(rewritten into
new_fragmentswith a source row in the previous manifest, stable rowID present in
row_id_to_source) and INSERT branch rows (new rows also innew_fragments, stable row ID assigned fresh, not present in any existing fragment).Before this change,
resolve_update_version_metadatabuilt the per-rowcreated_at_versionsvector by callingresolve_created_at_versionfor every row ID.For UPDATE branch rows that function correctly copies
created_atfrom the sourcefragment. For INSERT branch rows the map lookup fails and the function returns
UNKNOWN_CREATED_AT_VERSION = 1, producing a wrong historical version for every newlyinserted row. CDF consumers cannot distinguish merge-inserted rows from updated rows via
_row_created_at_version, and the value 1 is meaningless for rows that first appearedin a recent commit.
The fix is a single guard at the call site: only call
resolve_created_at_versionforrows confirmed to have a source (UPDATE branch); for all other rows use
new_versiondirectly.
Implementation notes
row_id_to_source.contains_key(&rid), which is an O(1) hash lookupon the same map already built for the UPDATE branch path — no additional data
structures or iteration.
already attaches
RowIdMetato new fragment rows. This change activates the correctbehavior automatically for all callers of
Operation::Update, including lance-sparkMERGE INTO.
Test plan
test_update_version_tracking_insert_branch_gets_new_version(renamed fromtest_update_version_tracking_unknown_row_id_defaults_to_1): new fragment with oneUPDATE branch row (ID 10, source
created_at = 5) and one INSERT branch row (ID 999);asserts
created_at = [5, 5]— UPDATE branch copies from source, INSERT branch getsnew_version(5).test_update_version_tracking_merge_into_distinguishes_insert_and_update_branch(new):new fragment interleaves UPDATE branch rows (IDs 10, 11, source
created_at = 3) andINSERT branch rows (IDs 500, 501); asserts
created_at = [3, 5, 3, 5]to verifyper-row correctness across both branches in the same fragment.
test_update_version_tracking_no_row_id_meta_fallback: assertion updated from[1, 1, 1]to[5, 5, 5]— a fragment with norow_id_metagets fresh stable IDsassigned by
assign_row_ids; those IDs have no source and are INSERT branch rows,so
created_atequalsnew_version.test_update_version_tracking_source_fragment_no_created_at_defaults_to_1(unchanged):confirms that UPDATE branch rows whose source fragment has no
created_at_version_metastill fall back to
UNKNOWN(1) — the remaining reachable path throughresolve_created_at_version.