Skip to content

Core: Don't bump DV snapshot ID for deleted_positions and replaced_positions#16823

Closed
gaborkaszab wants to merge 1 commit into
apache:mainfrom
gaborkaszab:main_position_not_set_dv_snapshot
Closed

Core: Don't bump DV snapshot ID for deleted_positions and replaced_positions#16823
gaborkaszab wants to merge 1 commit into
apache:mainfrom
gaborkaszab:main_position_not_set_dv_snapshot

Conversation

@gaborkaszab

Copy link
Copy Markdown
Contributor

Since deleted_positions and replaced_positions in Tracking are relevant only for the current snapshot, it is redundant to also bump the DV snapshot ID when setting these fields.

@gaborkaszab

Copy link
Copy Markdown
Contributor Author

This came up during the V4 AMT sync today. Seems reasonable not to bump dv snapshot ID when setting deleted and replaced positions, because they are fields that relevant for the current snapshot only and we don't carry them forward anyway.

cc @stevenzwu @amogh-jahagirdar @anoopj @rdblue

@github-actions github-actions Bot added the core label Jun 15, 2026
@anoopj

anoopj commented Jun 17, 2026

Copy link
Copy Markdown
Member

Sorry, I missed the AMT sync this week.

Seems reasonable not to bump dv snapshot ID when setting deleted and replaced positions, because they are fields that relevant for the current snapshot only and we don't carry them forward anyway.

dv_snapshot_id and deleted/replaced_positions have different lifetimes. The latter gets dropped in the next snapshot, while dv_snapshot_id tells us which snapshot changed the manifest DV (this is useful information). So it doesn't look redundant?

Setting it is also consistent with how dv_snapshot_id works on data DVs. The corresponding REPLACED entry with the old DV entry is transient, but dv_snapshot_id on the MODIFIED entry sticks.

cc @stevenzwu who originally suggested changing dv_snapshot_id for manifest DV changes (not sure Steven was at the sync)

@stevenzwu

Copy link
Copy Markdown
Contributor

yeah. I was at the Monday sync when Amogh brought up this question. Amogh's point was that deletedPositions and replacedPositions should only be set for the snapshot when the change was made. In the next snapshot, they should be reset to null if there is no new change. With that strict reset behavior, dvSnapshotId doesn't necessarily need to advance, as change detection won't rely on the check of dvSnapshotId vs currentSnapshotId.

@anoopj

anoopj commented Jun 17, 2026

Copy link
Copy Markdown
Member

@stevenzwu Does this mean that dv_snapshot_id will be effectively always be null for manifest entries? (ie after this PR gets merged)

@gaborkaszab

Copy link
Copy Markdown
Contributor Author

@stevenzwu Does this mean that dv_snapshot_id will be effectively always be null for manifest entries? (ie after this PR gets merged)

Practically yes. This is in line with the spec PR: Snapshot ID where the deletion vector was added. Inherited when null. Must be null when deletion_vector is null.

@amogh-jahagirdar amogh-jahagirdar left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the way I've reasoned about this is we have the following options:

1.) We could say that implementations must update dv_snapshot_id if replaced or deleted positions (the "diff" DVs are set). But if it's a diff representing the changes in that snapshot and nulled out when there are no changes, then what's the point of setting the dv_snapshot_id? The diff itself represents the changes in that snapshot. I guess one argument is consistency with how data DVs work but it feels like an extra spec rule that doesn't help anything on the read or change detection side, and makes write side more complex.

Let me know if I'm missing a case where the information of dv_snapshot id for leaf manifest DVs helps.

2.) We could not be strict in the spec about this and allow implementations to set it. Then it's one of those weird "optional" areas in the spec that I think we should avoid and it probably causes more confusion as there gets to be multiple implementations ("do we set it or not" is a debate that'd be nice to avoid a year from now).

3.) We could just strictly disallow setting dvSnapshotId for leaf manifests. I feel like this is the best option. It's less burden on implementations, less to get wrong on the change detection path. For leaf manifests, we currently track the total DV + the diff via replaced and deleted positions, so figuring out changes is just inherent in what's persisted in metadata at least in the current spec design.

Let me know if I'm missing something @anoop @stevenzwu @rdblue

Tracking addedSource = manifestSourceTracking();
Tracking modified =
TrackingBuilder.from(addedSource, 999L).deletedPositions(deletedBytes).build();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Can we remove the newline, we can avoid conflicts

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, done


assertThat(modified.status()).isEqualTo(EntryStatus.MODIFIED);
// the entry snapshot id is preserved; only the DV snapshot id advances to the commit snapshot
// the entry snapshot id is preserved; dv snapshot id is not relevant for manifest entries

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DV snapshot ID must not be set?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. DV snapshot ID must not be set for leaf manifest entries.

I would also split this comment line and write the 2nd half of the comment to be just above line 463.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestions! done


assertThat(modified.status()).isEqualTo(EntryStatus.MODIFIED);
// the entry snapshot id is preserved; only the DV snapshot id advances to the commit snapshot
// the entry snapshot id is preserved; dv snapshot id is not relevant for manifest entries

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. DV snapshot ID must not be set for leaf manifest entries.

I would also split this comment line and write the 2nd half of the comment to be just above line 463.

…sitions

Since deleted_positions and replaced_positions in Tracking are relevant
only for the current snapshot, it is redundant to also bump the
DV snapshot ID when setting these fields.
@gaborkaszab gaborkaszab force-pushed the main_position_not_set_dv_snapshot branch from 5574cf6 to 5fede4e Compare June 18, 2026 10:22

@gaborkaszab gaborkaszab left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look! Addressed the comments


assertThat(modified.status()).isEqualTo(EntryStatus.MODIFIED);
// the entry snapshot id is preserved; only the DV snapshot id advances to the commit snapshot
// the entry snapshot id is preserved; dv snapshot id is not relevant for manifest entries

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestions! done

Tracking addedSource = manifestSourceTracking();
Tracking modified =
TrackingBuilder.from(addedSource, 999L).deletedPositions(deletedBytes).build();

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, done

@rdblue

rdblue commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

I agree that setting dv_snapshot_id is duplicative for these updates. We can still detect changes by checking if either deleted_positions or replaced_positions is set. I don't think this itself is a strong justification.

There is still an argument for setting dv_snapshot_id: it would explicit state the snapshot when the deleted_positions and replaced_positions changes occurred. That has value because we could decide to carry forward what changed at that time and not have a hard requirement on dropping the bitmaps immediately, which are likely to be small. That also means that the format is insulated from implementations that accidentally keep the bitmaps around, although that is unlikely. In this case, the spec would state that the bitmaps reflect the changes in the snapshot identified by dv_snapshot_id for manifest entries.

The argument against using dv_snapshot_id is that it is simpler: do not set this unless the (data) DV changed. But is this more clear? Another way of looking at this is that dv_snapshot_id is the snapshot when the tracked file's DV changed. If that tracked file is a manifest and the DV is in ManifestInfo, why is that different?

After thinking through it, I'm not convinced that this is a good idea. I think I'd prefer keeping the update to dv_snapshot_id and make it required in the spec.

@gaborkaszab

Copy link
Copy Markdown
Contributor Author

closing this because we agreed on bumping dv_snapshot_id for replaced/deleted positions too

@github-project-automation github-project-automation Bot moved this from In review to Done in V4: metadata tree Jun 26, 2026

@amogh-jahagirdar amogh-jahagirdar left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went back and forth on this but I think I agree with @rdblue . It is indeed more clear to keep set dv_snapshot_id for leaf manifests just as we do for leaf data file DV changes. I also do think that it's not that much more difficult to have that set, especially if the alternative is a requirement to drop the diff DV when there are no changes, it's an equal requirement to just set the dv_snapshot_id (or in spec terms, the dvs represent the changes in the snapshot dv_snapshot_id)

I also thought about this statement

In this case, the spec would state that the bitmaps reflect the changes in the snapshot identified by dv_snapshot_id for manifest entries and not have a hard requirement on dropping the bitmaps immediately, which are likely to be small.

So in this case, we wouldn't be using the diff DV presence to determine changes, but rather we would check if dv_snapshot_id is set to the current snapshot , then we know one of replaced_positions or deleted_positions has changed. And to be fair this is similar in principle to the check we'd do for regular data DVs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants