Core: Refactor v4 struct builders to improve validation#16408
Conversation
443982e to
2a42fb0
Compare
8570d98 to
fbbdd10
Compare
506f8a5 to
b6da983
Compare
261a302 to
e6bc278
Compare
|
|
||
| static Builder builder() { | ||
| return new Builder(); | ||
| /** Creates a builder for a newly added file in the given snapshot. */ |
There was a problem hiding this comment.
Should we have a replaced() builder also?
There was a problem hiding this comment.
Good question. I think that we probably do since the caller is going to create one corresponding to every MODIFIED entry. We should probably also have a modified builder.
There was a problem hiding this comment.
Added a replaced() builder. I'll add MODIFIED, corresponding builder etc as followup
| private Builder(EntryStatus status, Tracking source, Long snapshotId) { | ||
| Preconditions.checkArgument( | ||
| source.dataSequenceNumber() != null, | ||
| "Source tracking has no data sequence number; cannot mark as %s", |
There was a problem hiding this comment.
What about Invalid tracking source: data sequence number is null? I don't think that we need to know what was intended for the new status. The main issue is that the source tracking was not read from a manifest (or some other issue leading to missing sequence numbers).
| source.fileSequenceNumber() != null, | ||
| "Source tracking has no file sequence number; cannot mark as %s", | ||
| status); | ||
| this.status = status; |
There was a problem hiding this comment.
Can we verify status with a private checkStatus(EntryStatus from, EntryStatus to)?
Allowed transitions are:
ADDEDtoEXISTING,MODIFIED,DELETED, andREPLACEDEXISTINGtoMODIFIED,DELETED, orREPLACEDMODIFIEDtoEXISTING,DELETED, orREPLACED
I don't think we would allow transitions from DELETED and REPLACED, or to ADDED.
There was a problem hiding this comment.
Done. Added a TODO to handle MODIFIED. Another question is whether we should allow EXISTING to EXISTING. The current code blocks it, but we might need it for manifest compaction?
There was a problem hiding this comment.
It depends on whether we want to build a new tracking for every written file or not. There's a good argument that we do want to so that we can remove things that don't need to be carried forward like the deleted and replaced bitmaps. But EXISTING entries probably don't need to be updated (just MODIFIED entries). It may still be cleaner to build a new one every time though.
This seems fine for now.
| Builder fileSequenceNumber(Long sequenceNumber) { | ||
| this.fileSequenceNumber = sequenceNumber; | ||
| return this; | ||
| private Builder(EntryStatus status, Tracking source, Long snapshotId) { |
There was a problem hiding this comment.
Isn't snapshotId always known?
There was a problem hiding this comment.
snapshotId is not allowed to be passed for EXISTING or MODIFIED because those inherit the snapshot ID from source. That means we should add another constructor for this case. I know that the current existing(...) method passes source.snapshotId(), but that moves logic from the builder into the builder factory methods. I would prefer to keep it all in the builder.
I don't have a preference for whether the new constructor calls this after verifying status or if it is a different constructor.
There was a problem hiding this comment.
Another option is to always pass the current operation's snapshot. Then we could use it to automatically set dvSnapshotId (and later, columnUpdateSnapshotId) when this is configured to have changed the DV or column set.
I kind of like this idea because it puts more of the correctness work on this builder.
There was a problem hiding this comment.
For now, I added a new constructor. I like the idea of passing the current operation's snapshot, but probably will tackle when we add MODIFIED.
| this.snapshotId = snapshotId; | ||
| this.dataSequenceNumber = source.dataSequenceNumber(); | ||
| this.fileSequenceNumber = source.fileSequenceNumber(); | ||
| this.firstRowId = source.firstRowId(); |
There was a problem hiding this comment.
This is also always non-null for data and manifest files. We should consider adding something to signal that intent, but I think it is correct for now not to verify it is non-null.
| this.fileSequenceNumber = source.fileSequenceNumber(); | ||
| this.firstRowId = source.firstRowId(); | ||
| this.dvSnapshotId = source.dvSnapshotId(); | ||
| ByteBuffer sourceDeleted = source.deletedPositions(); |
There was a problem hiding this comment.
These DVs (both deleted and replaced) should never be carried forward into the next Tracking object. They represent the positions of a manifest that were deleted (or, similarly, replaced) in the current commit. Subsequent commits clear these so they can be set for those commits.
There was a problem hiding this comment.
That makes sense. Fixed, and added a test testSourceDvPositionsAreNotCarriedForward()
|
|
||
| Builder deletedPositions(ByteBuffer positions) { | ||
| Preconditions.checkState( | ||
| status != EntryStatus.DELETED, "Cannot set deleted positions on a DELETED entry"); |
There was a problem hiding this comment.
Deleted or replaced positions are only valid for manifest files that have been changed. That means that ADDED is also not allowed here. When the status is DELETED or REPLACED, the tracking does not need to include these bitmaps (even if the copied entry did) so I would also not allow them for those statuses.
If the status is EXISTING then it must not have changed, so it also doesn't allow these bitmaps. This one is slightly different though, because we could choose to automatically detect and set EXISTING or MODIFIED based on changes to the status. If a DV is updated by calling dvSnapshotId then we could change the status to MODIFIED, or if these MDV methods are called we could also update the status to MODIFIED.
Lastly, this is used only for manifests and dvSnapshotId is used only for data files, so we should ensure that they are not set on the same builder.
There was a problem hiding this comment.
Done. I added some TODOs to handle MODIFIED.
| Builder dv(byte[] buffer) { | ||
| this.dv = buffer; | ||
| Preconditions.checkArgument(buffer != null, "Invalid DV: null"); | ||
| this.dv = ByteBuffers.toByteArray(buffer); |
There was a problem hiding this comment.
This is not guaranteed to be a deep copy, but consistent with BaseFile, and it would be unusual for DV arrays to be mutated
| this.fileSequenceNumber = sequenceNumber; | ||
| return this; | ||
| // TODO: extend allowed transitions once MODIFIED status is added. | ||
| private static void checkStatus(EntryStatus from, EntryStatus to) { |
There was a problem hiding this comment.
I think the most readable way to write this method is to have a Precondition.checkState for each from status. That will allow us to update it easily and it also makes more sense when you're trying to figure out what is valid.
|
@stevenzwu Thanks for the review! I've updated the PR based on your feedback. |
c7b4a59 to
012d74d
Compare
| Builder dvCardinality(long cardinality) { | ||
| Preconditions.checkArgument( | ||
| cardinality > 0, "Invalid DV cardinality: %s (must be positive)", cardinality); | ||
| cardinality >= 0, "Invalid DV cardinality: %s (must be >= 0)", cardinality); |
There was a problem hiding this comment.
I am actually wondering if it needs to be strictly positive. Does it make much sense to add an inline manifest DV with zero bits set? just leave the dv and dvCardinality as null in this case.
There was a problem hiding this comment.
We shouldn't worry about this. The field is going to be removed.
| @@ -210,38 +427,85 @@ void testProjectedStructLike() { | |||
| } | |||
There was a problem hiding this comment.
While we're on the projection / set path: TrackingStruct.internalSet silently ignores unknown ordinals to support reading newer formats (the default branch with the "must be from a newer version" comment). That forward-compat behavior has no test in any of the three structs (TrackingStruct, ManifestInfoStruct, DeletionVectorStruct). One regression test that calls set with an out-of-range ordinal and asserts no throw and no observable state change would protect the contract.
There was a problem hiding this comment.
Added test covering the forward-compat in all 3 structs.
| validateStatusTransition(source.status(), to); | ||
| return new TrackingStruct( | ||
| to, | ||
| newSnapshotId, |
There was a problem hiding this comment.
nit: this is against the description of Tracking.SNAPSHOT_ID that says Snapshot ID where the file was added or deleted. In fact, I think we should include REPLACED and MODIFIED too in the description based on the design here.
|
|
||
| @Test | ||
| void testSourceBuildersRejectSourceWithoutSequenceNumbers() { | ||
| Tracking missingBoth = TrackingBuilder.added(42L).build(); |
There was a problem hiding this comment.
Let me know what I miss here: Isn't it valid to have Tracking with ADDED status and null sequence numbers? That case the design doc says we'll inherit the sequence numbers. This test says that we can't use such a Tracking as a source in the builder.
What am I missing here?
There was a problem hiding this comment.
After giving this some extra thought: The 'source' is a result of reading the manifest through a ManifestReader, that I assume will read the sequence number from disk (null if ADDED) and if null it inherits the sequence number from the manifest before setting it to Tracking. Is this right? Then the 'source' won't have null as sequence numbers.
|
|
||
| @Test | ||
| void testInternalSetIgnoresUnknownOrdinal() { | ||
| TrackingStruct tracking = new TrackingStruct(Tracking.schema()); |
There was a problem hiding this comment.
nit: use sourceTracking + set the extra fields?
There was a problem hiding this comment.
I feel current code is fine since it tests internalSet anyway. I am going to merge this PR. we can follow up on this later if needed
|
Thanks @anoopj for the contribution, and @rdblue and @gaborkaszab for the review |
|
Thanks for reviewing and merging, @stevenzwu! |
This is a follow-up to last round of feedback on #16092 Improves the validation in the v4 struct builders to do early errors guide the state transition.