Core: Adjustments for TrackedFileBuilder#16964
Conversation
gaborkaszab
left a comment
There was a problem hiding this comment.
Carried over the unclosed comments from the original PR + some random extra explanation
|
|
||
| // optional fields | ||
| private Integer specId = null; | ||
| private PartitionData partition = null; |
There was a problem hiding this comment.
carry-over comment: "PartitionData should be StructLike. We don't want to rely on concrete classes here."
TrackedFileStruct constructor seem to accept PartitionData so we have to do the conversion at some point. Seemed straightforward to keep the same type in the builder as expected by the constructed class.
| TrackedFileBuilder partition(PartitionData newPartitionData) { | ||
| Preconditions.checkArgument(newPartitionData != null, "Invalid partition: null"); | ||
| this.partitionData = newPartitionData; | ||
| TrackedFileBuilder partition(PartitionSpec spec, StructLike newPartition) { |
There was a problem hiding this comment.
carry-over comment: "I think that this should be combined with specId: partition(int newSpecId, StructLike partitionTuple). If the tuple is set, then spec ID is required so that we know how to interpret it. I suspect that this should also be responsible for wrapping the partition tuple with a projection to the union type."
carry-over answer: "I probably miss something here, but I see that TrackedFile has a PartitionData member. If we want to produce that, is specId and partitionTuple enough? Don't we need a PartitionSpec to produce PartitionData?"
| this.recordCount = source.recordCount(); | ||
| this.fileSizeInBytes = source.fileSizeInBytes(); | ||
| this.partitionData = (PartitionData) source.partition(); | ||
| this.partition = (PartitionData) source.partition(); |
There was a problem hiding this comment.
carry-over comment: "We should not need unchecked casts like this."
carry-over answer: "
The difficulty here is the TrackedFile.partition() return StructLike while the TrackedFileStruct constructor accepts PartitionData. Since internally in TrackedFile we also have PartitionData this seemed safe"
There was a problem hiding this comment.
Casting partition() to PartitionData where the concrete type is actually needed is already the established pattern in core. For example, ManifestFilterManager:
PartitionData partition = (PartitionData) file.partition();We can't narrow TrackedFile.partition() to return PartitionData to avoid the cast either. Every file and scan type in the codebase exposes partition() as StructLike — ContentFile, BaseFile, V1Metadata/V2Metadata, PartitionScanTask, ContentScanTask, etc.
| private static final PartitionData PARTITION = partition("books"); | ||
|
|
||
| // Tracking field ordinals, looked up from the schema so the tests do not hard-code offsets. | ||
| private static final int STATUS_ORDINAL = ordinalOf(Tracking.schema(), "status"); |
There was a problem hiding this comment.
I reverted the part of the test that constructs Tracking to the state before the builder, because that seemed independent of this change. Hence these show as appeared, but in fact restored
| this.recordCount = source.recordCount(); | ||
| this.fileSizeInBytes = source.fileSizeInBytes(); | ||
| this.partitionData = (PartitionData) source.partition(); | ||
| this.partition = (PartitionData) source.partition(); |
There was a problem hiding this comment.
Casting partition() to PartitionData where the concrete type is actually needed is already the established pattern in core. For example, ManifestFilterManager:
PartitionData partition = (PartitionData) file.partition();We can't narrow TrackedFile.partition() to return PartitionData to avoid the cast either. Every file and scan type in the codebase exposes partition() as StructLike — ContentFile, BaseFile, V1Metadata/V2Metadata, PartitionScanTask, ContentScanTask, etc.
| Preconditions.checkArgument( | ||
| this.deletionVector == null || !this.deletionVector.equals(newDeletionVector), | ||
| "The same deletion vector already added"); | ||
| deletionVector == null || deletionVector.cardinality() < newDeletionVector.cardinality(), |
There was a problem hiding this comment.
I don't believe the spec requiring increasing cardinality. A writer could reserialize the DVs into a new puffin blob. Should we allow equal cardinality?
There was a problem hiding this comment.
this is a very interesting scenario, like Puffin file compaction. It may not be useful to compact DV Puffin files due to the access pattern to DV blobs.
There was a problem hiding this comment.
Additional context: This was the original comment.
The intention was to prevent adding the same DV through the builder, because that bumps the dv_snapshot_id while not changing the DV itself. I first introduced an override for equals in DeletionVectorStruct but we decided to avoid such overrides.
As a workaround @rdblue suggested that we can "mimic" equality checks here by requiring 2 things:
- if DV cardinality doesn't increase than we probably add the same DV again
- if cardinality increased, that's great, but expectation is that we don't write the new DV in-place.
I haven't thought about DV compaction, seems something we might want to discuss in the future. If we decide to support it, we can still remove/restructure the conditions here. WDYT?
There was a problem hiding this comment.
+1 to keep the stricter validation for now and relax in the future if needed.
| deletionVector == null | ||
| || !deletionVector.location().equals(newDeletionVector.location()) | ||
| || deletionVector.offset() != newDeletionVector.offset(), | ||
| "Invalid DV update: same location and offset has changed cardinality"); |
There was a problem hiding this comment.
This is moot if we allow equal cardinality, but the error message part about cardinality is dependent on prior precondition check. It can be a bit confusing.
There was a problem hiding this comment.
I agree. Simplified the message here, and don't mention cardinality
| if (spec.isUnpartitioned()) { | ||
| Preconditions.checkArgument( | ||
| newPartition == null, "Invalid partition: must be null for unpartitioned spec"); | ||
| this.specId = spec.specId(); |
There was a problem hiding this comment.
Minor: this is in both branches of the if. Hoist it out?
There was a problem hiding this comment.
thx, was going to, but then I forgot :)
| "The same deletion vector already added"); | ||
| deletionVector == null || deletionVector.cardinality() < newDeletionVector.cardinality(), | ||
| "Invalid DV update, cardinality must increase: existing=%s, new=%s", | ||
| deletionVector == null ? null : deletionVector.cardinality(), |
There was a problem hiding this comment.
| deletionVector == null ? null : deletionVector.cardinality(), | |
| deletionVector.cardinality(), |
There was a problem hiding this comment.
ternary is not needed because your precondition check already validates null
There was a problem hiding this comment.
That's what I thought too, but apparently these Preconditions parameters aren't lazily evaluated. Without the ternary every test that wanted to set a DV gave NPE on this.
| class TrackedFileBuilder { | ||
| private final long snapshotId; | ||
| private final FileContent contentType; | ||
| private final TrackingBuilder trackingBuilder; |
There was a problem hiding this comment.
I can align my PR with this new shape to avoid the individual setters that you commented.
#16936
There was a problem hiding this comment.
technically, nothing changes from the user's perspective, it's just how we create Tracking internally.
0514210 to
390dc8c
Compare
gaborkaszab
left a comment
There was a problem hiding this comment.
Thanks for taking a look, @stevenzwu @anoopj !
One open question remained from this set of comments: Do we want to allow came dv cardinality for DV compaction?
| class TrackedFileBuilder { | ||
| private final long snapshotId; | ||
| private final FileContent contentType; | ||
| private final TrackingBuilder trackingBuilder; |
There was a problem hiding this comment.
technically, nothing changes from the user's perspective, it's just how we create Tracking internally.
| if (spec.isUnpartitioned()) { | ||
| Preconditions.checkArgument( | ||
| newPartition == null, "Invalid partition: must be null for unpartitioned spec"); | ||
| this.specId = spec.specId(); |
There was a problem hiding this comment.
thx, was going to, but then I forgot :)
| Preconditions.checkArgument( | ||
| this.deletionVector == null || !this.deletionVector.equals(newDeletionVector), | ||
| "The same deletion vector already added"); | ||
| deletionVector == null || deletionVector.cardinality() < newDeletionVector.cardinality(), |
There was a problem hiding this comment.
Additional context: This was the original comment.
The intention was to prevent adding the same DV through the builder, because that bumps the dv_snapshot_id while not changing the DV itself. I first introduced an override for equals in DeletionVectorStruct but we decided to avoid such overrides.
As a workaround @rdblue suggested that we can "mimic" equality checks here by requiring 2 things:
- if DV cardinality doesn't increase than we probably add the same DV again
- if cardinality increased, that's great, but expectation is that we don't write the new DV in-place.
I haven't thought about DV compaction, seems something we might want to discuss in the future. If we decide to support it, we can still remove/restructure the conditions here. WDYT?
| "The same deletion vector already added"); | ||
| deletionVector == null || deletionVector.cardinality() < newDeletionVector.cardinality(), | ||
| "Invalid DV update, cardinality must increase: existing=%s, new=%s", | ||
| deletionVector == null ? null : deletionVector.cardinality(), |
There was a problem hiding this comment.
That's what I thought too, but apparently these Preconditions parameters aren't lazily evaluated. Without the ternary every test that wanted to set a DV gave NPE on this.
| deletionVector == null | ||
| || !deletionVector.location().equals(newDeletionVector.location()) | ||
| || deletionVector.offset() != newDeletionVector.offset(), | ||
| "Invalid DV update: same location and offset has changed cardinality"); |
There was a problem hiding this comment.
I agree. Simplified the message here, and don't mention cardinality
390dc8c to
116b838
Compare
| Preconditions.checkArgument( | ||
| this.deletionVector == null || !this.deletionVector.equals(newDeletionVector), | ||
| "The same deletion vector already added"); | ||
| deletionVector == null || deletionVector.cardinality() < newDeletionVector.cardinality(), |
There was a problem hiding this comment.
+1 to keep the stricter validation for now and relax in the future if needed.
116b838 to
a0f84d3
Compare
|
Thanks for taking a look @stevenzwu and @anoopj ! I believe all your comments so far have been addressed. @rdblue Would you mind taking a look? The |
a0f84d3 to
36cf104
Compare
|
Note, I added an extra check that setting partition is not allowed for manifest entries. Let me know if I got this requirement wrong. Once the "optional partition" PR is merged, some of the expected values in the builder tests will be null instead of empty partition. |
| Preconditions.checkArgument( | ||
| deletionVector.cardinality() < newDeletionVector.cardinality(), | ||
| "Invalid DV update, cardinality must increase: existing=%s, new=%s", | ||
| deletionVector == null ? null : deletionVector.cardinality(), |
There was a problem hiding this comment.
is deletionVector nullable here? I think line 236 ensure only the nonnull will lands here? So we can use deletionVector.cardinality() directly
There was a problem hiding this comment.
also noticed there's similar flag before in #16964 (comment), but drop the redundant null-ternary
| deletionVector == null ? null : deletionVector.cardinality(), | |
| deletionVector.cardinality(), |
in the errorMessages seem to pass all UTs
There was a problem hiding this comment.
you're right. Now that I moved the Precondition into the "is not null" if, the ternary is not needed anymore.
| Preconditions.checkArgument(recordCount != null, "Missing required field: record count"); | ||
| Preconditions.checkArgument( | ||
| fileSizeInBytes != null, "Missing required field: file size in bytes"); | ||
| Preconditions.checkArgument(partitionData != null, "Missing required field: partition data"); |
There was a problem hiding this comment.
curious, do we need anything to help check if forgot to set the partition spec.
Something like
Preconditions.checkArgument(
isLeafManifest(contentType) || specId != null,
"Missing required field: spec ID");There was a problem hiding this comment.
I think it's valid not to set partition if the data file is unpartitioned. See my other PR that sets partition optional in the schema for exactly this.
| .fileFormat(FileFormat.PARQUET) | ||
| .recordCount(2000L) | ||
| .fileSizeInBytes(12345L) | ||
| .partition(unpartitioned, null) |
There was a problem hiding this comment.
I think if we omit this line .partition(unpartitioned, null), build will succeed as well without spec id
There was a problem hiding this comment.
it will, I verify here that specId could be null if the PartitionSpec is unpartitioned. Alternatively, we could omit calling partition on the builder, but this test is not for that purpose.
36cf104 to
7ef4aeb
Compare
Original PR: #16769
This one is a follow-up with adjustments addressing further comments.