Core: Add EntryStatus.MODIFIED and TrackingBuilder status derivation#16689
Conversation
gaborkaszab
left a comment
There was a problem hiding this comment.
Thanks for the addition, @anoopj!
I went through this, seems goon in general, just left some nits.
| return EntryStatus.ADDED; | ||
| } | ||
|
|
||
| boolean sameSnapshot = source.snapshotId() != null && source.snapshotId() == newSnapshotId; |
There was a problem hiding this comment.
Reading this I have the impression that this PR contains 2 different functionalities:
- the "sameSnapshot" case for ADDED status
- Introduction of MODIFIED and all the transitions to/from
No strong opinion, but I usually prefer keeping a single purpose for my PRs. LMK WDYT
There was a problem hiding this comment.
Just for my benefit: What is the use-case for the "sameSnapshot" case? I figure that Tracking is created whenever we create a new TrackedFile. Not sure I see, within the same snapshot where we want to change Tracking again (or recreate the TrackedFile with different Tracking). I probably miss, how this integrates into the big picture.
There was a problem hiding this comment.
The added/same snapshot is not a separate scope. This is for the use case we discussed in #16408 where a a writer constructs an ADDED tracking and pipelines a DV attach in the same commit.
There was a problem hiding this comment.
Do you mean this conversation? If yes, it was about whether to return Tracking or TrackingBuilder from added(long). The argument was to return the latter, because we might want to add some DVs on top using the same builder.
However, here it seems slightly different: We first build a Tracking object with ADDED status, then use it as a source to add DVs on top using another builder. For this, we could use Tracking as a return value from added(long)
(I'll think this through tomorrow again)
| REPLACED(3); | ||
| /** | ||
| * Non-live entry recording that a prior file version was superseded by another live entry. Added | ||
| * in v4. |
There was a problem hiding this comment.
We need a term that is better than "non-live". It's probably implied by "superceded by another live entry", although I think we can improve on that still.
What about this?
The starting (replaced) state of an entry that is modified.
There was a problem hiding this comment.
Is old better than starting in this conext?
The old (replaced) state of en entry that is modified. Paired with MODIFIED. Added in v4.
There was a problem hiding this comment.
Done. Went with the old/new pairing Steven proposed above.
| "snapshot_id", | ||
| Types.LongType.get(), | ||
| "Snapshot ID where the file was added or deleted"); | ||
| "Snapshot ID where the file was added, deleted, replaced, or modified"); |
There was a problem hiding this comment.
Have we agreed to modify the snapshot ID for a replaced entry? I thought that we were not going to change replaced entries.
We change the snapshot ID for deleted entries, but not for existing entries so there's precedent both ways. If you're scanning for changes, the snapshot ID is useful for filtering out changes that are left-over from older snapshots. For instance, I may rewrite a manifest and delete a file in it. If I'm later scanning that file for changes, I would be able to check whether the delete entry is for the snapshot ID I'm getting changes for.
The counter-argument is that the manifest would probably only be scanned for changes if you're looking for changes that would match. In order to scan that manifest, you'd first check its snapshot ID (when it was added) and not scan otherwise.
Overall, I think the right thing is to update the snapshot ID as you have here. That way if any implementation reads files it doesn't need to, it has enough information to filter out the entries.
Good to note in the spec @stevenzwu and @amogh-jahagirdar.
There was a problem hiding this comment.
TrackingBuilder.terminal() writes newSnapshotId for both DELETED and REPLACED today, so the doc matches the current code.
To me, deleted and replaced entries should behave the same. replaced is just a special flavor of deleted. Since v1-v3 already modify the snapshot ID for a deleted entry, it seems reasonable to maintain the same behavior.
Also for change detection, it is probably better to update the snapshot id in tracking for deleted or replaced entries. If the tracking snapshot_id doesn't match the current snapshot id, the entries should be ignored for change detection. This is important because the current spec is silent on lifetime of deleted entries.
| "snapshot_id", | ||
| Types.LongType.get(), | ||
| "Snapshot ID where the file was added or deleted"); | ||
| "Snapshot ID where the file was added, deleted, replaced, or modified"); |
There was a problem hiding this comment.
TrackingBuilder.terminal() writes newSnapshotId for both DELETED and REPLACED today, so the doc matches the current code.
To me, deleted and replaced entries should behave the same. replaced is just a special flavor of deleted. Since v1-v3 already modify the snapshot ID for a deleted entry, it seems reasonable to maintain the same behavior.
Also for change detection, it is probably better to update the snapshot id in tracking for deleted or replaced entries. If the tracking snapshot_id doesn't match the current snapshot id, the entries should be ignored for change detection. This is important because the current spec is silent on lifetime of deleted entries.
| REPLACED(3); | ||
| /** | ||
| * Non-live entry recording that a prior file version was superseded by another live entry. Added | ||
| * in v4. |
There was a problem hiding this comment.
Is old better than starting in this conext?
The old (replaced) state of en entry that is modified. Paired with MODIFIED. Added in v4.
| class TrackingBuilder { | ||
| private final EntryStatus status; | ||
| private final Long snapshotId; | ||
| // null for the fresh-added path; non-null for the source-based path. |
There was a problem hiding this comment.
Let's remove this. I'm not sure what it means.
There was a problem hiding this comment.
Done. Removed with the revert of the refactor.
| TrackingBuilder deletedPositions(ByteBuffer positions) { | ||
| Preconditions.checkState( | ||
| status == EntryStatus.EXISTING, "Cannot set deleted positions on %s entry", status); | ||
| Preconditions.checkState(source != null, "Cannot set deleted positions on ADDED entry"); |
There was a problem hiding this comment.
The error message should agree with the check. In this case, I think the check is better if you use status == ADDED.
There was a problem hiding this comment.
Done. TrackingBuilder now holds status and snapshotId as fields instead of source.
| } | ||
|
|
||
| private TrackingBuilder(long newSnapshotId) { | ||
| this.status = EntryStatus.ADDED; |
There was a problem hiding this comment.
I think this should still use status rather than mutated and source. There's no need to keep the source around instead of storing its snapshot ID individually. And we don't want to make inferences like state is added when source is null.
| } | ||
|
|
||
| /** Derives the output status from the source, the snapshot, and any mutations. */ | ||
| private EntryStatus deriveStatus() { |
There was a problem hiding this comment.
I think that this should go back to updating status in the builder config methods rather than this. The impulse to co-locate logic for entry status is good, but I think it is more readable not to do this refactor. When modifications are made, the status can be validated and updated inline. That's simpler and more clear, in my opinion, rather than trying to detect what happened in the configuration phase and produce the correct status.
There was a problem hiding this comment.
Done. Reverted the refactor.
Co-authored-by: Steven Zhen Wu <stevenz3wu@gmail.com>
stevenzwu
left a comment
There was a problem hiding this comment.
LGTM. Just a minor comment for Javadoc
| /** Indicates an entry that has been replaced by a column update or DV change. Added in v4. */ | ||
| REPLACED(3); | ||
| /** | ||
| * The old (replaced) state of an entry that has been modified. Paired with MODIFIED. Added in v4. |
There was a problem hiding this comment.
I recall on the last V4 AMT sync there was an idea to make the leaf manifests MODIFIED when adding a new DV, without having a corresponding REPLACED entry. Not sure how far we got with that idea, but it seems to go against the comments here, where the comments seem to be relevant for data entries only.
There was a problem hiding this comment.
Manifest entries will not have a paired REPLACED entry to compare against. This statement is still correct, but the one below (that MODIFIED is paired with REPLACED) is misleading.
There was a problem hiding this comment.
The current comment state seems right to me and implicitly covers the case of leaf manifests that @gaborkaszab was talking about. REPLACED is paired with MODIFIED, but MODIFIED is not neccessarily paired with REPLACED (don't need to explicitly say that last part).
| assertThat(tracking.isLive()).isFalse(); | ||
| assertThat(TrackingBuilder.added(42L).build().isLive()).isTrue(); | ||
| assertThat(TrackingBuilder.from(source, 999L).build().isLive()).isTrue(); | ||
| assertThat(TrackingBuilder.from(source, 999L).dvUpdated().build().isLive()).isTrue(); |
There was a problem hiding this comment.
nit: maybe add an assert on the status of the entry too to make it easier for the readers to understand which status is live and which one isn't?
There was a problem hiding this comment.
Removed the test after I moved isLive to enum.
| * The old (replaced) state of an entry that has been modified. Paired with MODIFIED. Added in v4. | ||
| */ | ||
| REPLACED(3), | ||
| /** The new (live) state of an entry that has been modified. Paired with REPLACED. Added in v4. */ |
There was a problem hiding this comment.
Instead of trying to use documentation to handle live status, why not just have a isLive() method on this enum?
| return status() == EntryStatus.ADDED || status() == EntryStatus.EXISTING; | ||
| return status() == EntryStatus.ADDED | ||
| || status() == EntryStatus.EXISTING | ||
| || status() == EntryStatus.MODIFIED; |
There was a problem hiding this comment.
As I said above, we should probably just add isLive() to EntryStatus. Then we can delegate here.
There was a problem hiding this comment.
Added it to EntryStatus and removed the method since it's no longer needed here.
There was a problem hiding this comment.
Maybe for convenience we can still have Tracked.isLive() that simply delegates to EntryStatus as Ryan suggested:
default boolean isLive() {
return status().isLive();
}
There was a problem hiding this comment.
I was not sure if we want another method while we can just call tracking.status().isLive(), so I removed it. Added it back.
|
|
||
| /** | ||
| * Promotes an EXISTING entry to MODIFIED on mutation. Fresh-add builders (status = ADDED) are | ||
| * preserved — covers the same-commit append + DV attach case without a special branch. |
There was a problem hiding this comment.
AI artifacts like — are a red flag for me that indicates the comment was not edited for correctness and clarity. Could you revise this?
There was a problem hiding this comment.
This was added as a review suggestion. I applied it through Github without editing it. We have removed the method, so we don't need it anymore.
| * preserved — covers the same-commit append + DV attach case without a special branch. | ||
| */ | ||
| private void promoteToModified() { | ||
| if (status == EntryStatus.EXISTING) { |
There was a problem hiding this comment.
I don't understand this check. If the status is ADDED, would we not update to MODIFIED? I know that ADDED will never get here, but that's half the point: this assumes status checks elsewhere. Why split status handling in multiple places? This also implies that it's perfectly fine to call this on a DELETED entry, which I also do not expect.
The code here is not worth the confusion over why this logic is here. This is two lines (and snapshotId should not be modified).
There was a problem hiding this comment.
if the status is ADDED (the same-commit append + DV attach case), we want to keep the status as ADDED.
Flink streaming writer can hit this scenario of adding a new data file with DV attached.
There was a problem hiding this comment.
if the status is ADDED (the same-commit append + DV attach case), we want to keep the status as ADDED.
Yes, my point is really that this is dividing the logic of how status evolves across multiple places and the only benefit is not copying one line of code? I don't think this method is a good idea.
There was a problem hiding this comment.
Got your point now. You are suggesting just removing this method and directly code the status change in the mutation method like dvUpdated, deletedPositions, and replacedPositions. That makes sense.
There was a problem hiding this comment.
Thanks for the feedback/discussion! Incorporated.
| private void promoteToModified() { | ||
| if (status == EntryStatus.EXISTING) { | ||
| this.status = EntryStatus.MODIFIED; | ||
| this.snapshotId = newSnapshotId; |
There was a problem hiding this comment.
Why is this setting the snapshot ID of a MODIFIED entry? The modification should set a specific snapshot ID, like the dvSnapshotId. We don't set the entry's snapshot ID when it is changed because we still want to know when the base file itself was added.
There was a problem hiding this comment.
good catch.
We also discussed offline that the deletePositions and replacedPositions should also advance the dvSnapshotId to the newSnapshotId. both are only used for leaf manifest entries in the root manifest file.
We are unlikely to compute and set them for data file entries in the root or leaf manifest files. Since data DVs are stored in external Puffin files, it is too expensive to compute and store the delta deletedPositions.
In the future, when column files metadata are added, we will have new transition/mutation APIs and new latestColumnFileSnapshotId.
There was a problem hiding this comment.
Thank you. Incorporated. I removed the method and added tests to validate dvSnapshotId
cd0e558 to
446f24e
Compare
gaborkaszab
left a comment
There was a problem hiding this comment.
Thank you @anoopj !
LGTM, just left some minor comments.
|
Thanks for the quick turnaround, @anoopj ! |
| Preconditions.checkState( | ||
| status == EntryStatus.EXISTING, "Cannot set deleted positions on %s entry", status); | ||
| // DV applies to data files; deleted positions apply to manifest files | ||
| Preconditions.checkState( |
There was a problem hiding this comment.
Why was this state check removed? Is it no longer valid?
There was a problem hiding this comment.
the dvSnapshotId == null check is removed.
the other check is updated from status == EntryStatus.EXISTING to status != EntryStatus.ADDED. The change seems correct to me.
There was a problem hiding this comment.
I don't think we can checkdvSnapshotId == null anymore, because both deletedPositions() and replacedPositions() set it. We wouldn't be able to set both of them through the same builder.
When building a TrackedFile (I'm putting together a PR for that) we can enforce that deleted and replaced positions are only valid for manifests.
Anyway, setting dvSnapshotId for these is a new behavior, not entirely related to introducing the MODIFIED status. Was this discussed somewhere that I missed? I checked both the spec PR and the proposal doc, and they say that dvSnapshotId is when the deletion_vector is added, they don't mention deletedPositions o replacedPositions.
Update: I see the comment for this, so it's intentional. Would be nice to document it, cc @amogh-jahagirdar
There was a problem hiding this comment.
My question: These deleted and replaced positions aren't carried over when creating a new Tracking, and by definition are relevant only for the current snapshot (Bitmap of positions deleted in this snapshot). Is there any benefit setting dvSnapshotId for them?
Additionaly, we carry over dvSnapshotId when creating from source, but we don't the positions.
There was a problem hiding this comment.
I don't think we can check
dvSnapshotId == nullanymore, because bothdeletedPositions()andreplacedPositions()set it.
Thanks, I think this is the answer I was looking for.
Capture rdblue's instruction from PR #16689: Javadoc should describe a class or method's purpose as if it were defined by an interface, not restate or leak implementation details that create churn when the implementation changes. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
amogh-jahagirdar
left a comment
There was a problem hiding this comment.
Overall looks good to me after all the recent updates, just noticed some typos on the docs
| /** Indicates an entry that has been replaced by a column update or DV change. Added in v4. */ | ||
| REPLACED(3); | ||
| /** | ||
| * The old (replaced) state of an entry that has been modified. Paired with MODIFIED. Added in v4. |
There was a problem hiding this comment.
The current comment state seems right to me and implicitly covers the case of leaf manifests that @gaborkaszab was talking about. REPLACED is paired with MODIFIED, but MODIFIED is not neccessarily paired with REPLACED (don't need to explicitly say that last part).
| TrackingStruct tracking = new TrackingStruct(Tracking.schema()); | ||
| tracking.set(STATUS_ORDINAL, EntryStatus.MODIFIED.id()); | ||
| tracking.set(DATA_SEQUENCE_NUMBER_ORDINAL, 5L); | ||
| tracking.set(FILE_SEQUENCE_NUMBER_ORDINAL, 6L); |
There was a problem hiding this comment.
Because these are set, this wouldn't be inherited even if MODIFIED inherited the values.
|
|
||
| @Test | ||
| void testAddedBuilder() { | ||
| void testAddedWithSameCommitDvStaysAdded() { |
There was a problem hiding this comment.
Style: Abbreviations should not be converted to CamelCase in the project. DV should stay DV in names.
rdblue
left a comment
There was a problem hiding this comment.
This seems about ready to me.
a column update or DV change.