Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions core/src/main/java/org/apache/iceberg/DeletionVectorStruct.java
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
package org.apache.iceberg;

import java.io.Serializable;
import java.util.Objects;
import org.apache.iceberg.avro.SupportsIndexProjection;
import org.apache.iceberg.relocated.com.google.common.base.MoreObjects;
import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
Expand Down Expand Up @@ -128,6 +129,26 @@ static Builder builder() {
return new Builder();
}

@Override
public boolean equals(Object other) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StructLike classes in Iceberg often don't implement the equals method. but in this case, I am ok with it for usage in the TrackedFileBuilder

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should remove this. There are multiple ways of considering something "equal" and we try to avoid equals methods when it is unclear which one should be used. That's the case here, where it would be reasonable to assume that two DeletionVector instances are the same if they represent the same DV (location and offset are equal) but also reasonable to assume two instances are equal if they have the same values (mostly implemented here).

In addition to the problem of what this means, there's another problem that this definition only works for other DeletionVectorStruct instances. That means the definition of equality is more murky: two DeletionVector instances can be equal by either definition above but still may not be equal here. The purpose of using interfaces for the public API is to make it so that implementations are interchangeable and this breaks that principle.

I think this was introduced to be able to detect changes applied to the builder, but we should find a better way to solve that problem rather than introducing this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue the comparison was introduced due to my comment here. Do you think it is sth we should guard against? if not, we can maybe just add Javadoc to explain the usage. if yes, should we just do a reference comparison like this.deletionVector != newDeletionVector or we want a util method like `DelectionVectorUtil.sameContents(DeletionVector, DeletionVector)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dropped these overrides here, and went for the verification suggested by Ryan within the builder class.

Comment thread
rdblue marked this conversation as resolved.
if (this == other) {
return true;
} else if (!(other instanceof DeletionVectorStruct)) {
return false;
}

DeletionVectorStruct that = (DeletionVectorStruct) other;
return Objects.equals(location, that.location)
&& offset == that.offset
&& sizeInBytes == that.sizeInBytes
&& cardinality == that.cardinality;
}

@Override
public int hashCode() {
return Objects.hash(location, offset, sizeInBytes, cardinality);
}

@Override
public String toString() {
return MoreObjects.toStringHelper(this)
Expand Down
364 changes: 364 additions & 0 deletions core/src/main/java/org/apache/iceberg/TrackedFileBuilder.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,364 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.apache.iceberg;

import java.nio.ByteBuffer;
import java.util.List;
import org.apache.iceberg.relocated.com.google.common.base.Preconditions;

class TrackedFileBuilder {
private final long snapshotId;
private final FileContent contentType;

// Required fields
private Integer writerFormatVersion = null;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this ever be null? I'm starting from the assumption that in v4 we will only write v4 metadata files. If that's true, then I think there are two cases when building a TrackedFile instance: either we are writing a v4 file or we are adapting an existing v1-3 file to be tracked in v4. The first case is easy: always set this to 4.

Does the second case go through this builder? I would expect to use a ManifestFile to TrackedFile adapter, like the adapters @anoopj introduced for TrackedFile to DataFile, DeleteFile, etc. That is a good pattern to follow because we never want a caller that is moving tracked files around to have the logic for migrating from v3 to v4.

@stevenzwu stevenzwu Jun 23, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I have a follow-up PR (draft status now) to adapt v3 shapes (like ManifestFile and DataFile) to v4 TrackedFile. It would be good to use builder for those adaptions too, unless you are thinking about using the constructor directly.

Regarding the null or not, current proposal/spec PR said it is not null (0 for pre-v4 and format versions for v4+)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this more, would we use adapters to wrap DataFile and DeleteFile anyway? Those are what our writers produce already and I doubt that we will want to go through and update all the plumbing to produce TrackedFile directly. The utility of this class may be very limited.

I could see us producing instances of TrackedFile by reading and wrapping v3 DataFile instances, or wrapping a new DataFile instance for v4. We should verify there are cases where we would use this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently, as Steven noted, we could use the TrackedFileBuilder for the wrappers around V3 metadata structs, and there we could use this writer_format_version field through the builder. See the draft PR linked by Steven and let me know if it makes sense.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we concluded on (writer_)format_version related question on the other dedicated PR. Let me know if there is still something outstanding here!
Checking again Steven's PR on the wrappers, I think it makes sense to set this through the builder.

private String location = null;
private FileFormat fileFormat = null;
private Long recordCount = null;
private Long fileSizeInBytes = null;
private PartitionData partitionData = null;

@rdblue rdblue Jun 23, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PartitionData should be StructLike. We don't want to rely on concrete classes here.

Also, this is in the "required fields" section, but I think that we want to allow this to be null for unpartitioned cases. The partition tuple will be null when specId is null or an unpartitioned spec, because the actual partition type that we store is the union schema of all valid partition specs in the table. And support for empty struct types and instances is unreliable.


// optional fields
private Integer specId = null;
private ContentStats contentStats = null;
private Integer sortOrderId = null;
private DeletionVector deletionVector = null;
private ManifestInfo manifestInfo = null;
private ByteBuffer keyMetadata = null;
private List<Long> splitOffsets = null;
private List<Integer> equalityIds = null;

// tracking-related fields
private Tracking sourceTracking = null;
private boolean dvUpdated = false;
private ByteBuffer deletedPositions = null;
private ByteBuffer replacedPositions = null;

/**
* Creates a builder for a newly added data file entry.
*
* @param newSnapshotId the snapshot ID in which the new tracked file will be committed
*/
static TrackedFileBuilder data(long newSnapshotId) {
return new TrackedFileBuilder(FileContent.DATA, newSnapshotId);
}

/**
* Creates a builder for a newly added equality delete file entry.
*
* @param newSnapshotId the snapshot ID in which the new tracked file will be committed
*/
static TrackedFileBuilder equalityDelete(long newSnapshotId) {
return new TrackedFileBuilder(FileContent.EQUALITY_DELETES, newSnapshotId);
}

/**
* Creates a builder for a newly added data manifest entry.
*
* @param newSnapshotId the snapshot ID in which the new tracked file will be committed
*/
static TrackedFileBuilder dataManifest(long newSnapshotId) {
return new TrackedFileBuilder(FileContent.DATA_MANIFEST, newSnapshotId);
}

/**
* Creates a builder for a newly added delete manifest entry.
*
* @param newSnapshotId the snapshot ID in which the new tracked file will be committed
*/
static TrackedFileBuilder deleteManifest(long newSnapshotId) {
return new TrackedFileBuilder(FileContent.DELETE_MANIFEST, newSnapshotId);
}

/**
* Creates a builder for a tracked file derived from {@code source}.
*
* @param source source tracked file to copy fields from
* @param newSnapshotId the snapshot ID in which the new tracked file will be committed
*/
static TrackedFileBuilder from(TrackedFile source, long newSnapshotId) {
Preconditions.checkArgument(source != null, "Invalid source: null");
return new TrackedFileBuilder(source, newSnapshotId);
}

/**
* Returns a DELETED tracked file derived from {@code source}.
*
* @param source source tracked file
* @param newSnapshotId the snapshot ID in which the new tracked file will be committed
*/
static TrackedFile deleted(TrackedFile source, long newSnapshotId) {
Preconditions.checkArgument(source != null, "Invalid source: null");
return terminal(source, TrackingBuilder.deleted(source.tracking(), newSnapshotId));
}

/**
* Returns a REPLACED tracked file derived from {@code source}.
*
* <p>Manifest entries cannot transition to REPLACED.
*
* @param source source tracked file
* @param newSnapshotId the snapshot ID in which the new tracked file will be committed
*/
static TrackedFile replaced(TrackedFile source, long newSnapshotId) {
Preconditions.checkArgument(source != null, "Invalid source: null");
Preconditions.checkArgument(
!isLeafManifest(source.contentType()),
"Manifest entries cannot transition to REPLACED, but entry type is: %s",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is less direct than we typically prefer, but is not bad. This states a general rule and then says that it applies. We would normally say that at once: "Cannot transition %s to REPLACED".

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx! I think we can even remove the content type parameter of the message:
Cannot transition manifest files to REPLACED

source.contentType());
return terminal(source, TrackingBuilder.replaced(source.tracking(), newSnapshotId));
}

private static TrackedFile terminal(TrackedFile source, Tracking tracking) {
return new TrackedFileStruct(
tracking,
source.contentType(),
source.writerFormatVersion(),
source.location(),
source.fileFormat(),
(PartitionData) source.partition(),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not need unchecked casts like this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difficulty here is the TrackedFile.partition() return StructLike while the TrackedFileStruct constructor accepts PartitionData. Since internally in TrackedFile we also have PartitionData this seemed safe

source.recordCount(),
source.fileSizeInBytes(),
source.specId(),
source.contentStats(),
source.sortOrderId(),
source.deletionVector(),
source.manifestInfo(),
source.keyMetadata(),
source.splitOffsets(),
source.equalityIds());
}

private TrackedFileBuilder(FileContent contentType, long snapshotId) {
this.contentType = contentType;
this.snapshotId = snapshotId;
}

private TrackedFileBuilder(TrackedFile source, long snapshotId) {
this.contentType = source.contentType();
this.snapshotId = snapshotId;
this.writerFormatVersion = source.writerFormatVersion();
this.location = source.location();
this.fileFormat = source.fileFormat();
this.recordCount = source.recordCount();
this.fileSizeInBytes = source.fileSizeInBytes();
this.partitionData = (PartitionData) source.partition();
this.specId = source.specId();
this.contentStats = source.contentStats();
this.sortOrderId = source.sortOrderId();
this.deletionVector = source.deletionVector();
this.manifestInfo = source.manifestInfo();
this.keyMetadata = source.keyMetadata();
this.splitOffsets = source.splitOffsets();
this.equalityIds = source.equalityIds();
this.sourceTracking = source.tracking();
}

TrackedFileBuilder writerFormatVersion(int newWriterFormatVersion) {
Preconditions.checkArgument(
newWriterFormatVersion >= 0,
"Invalid writer format version: %s (must be >= 0)",
newWriterFormatVersion);
this.writerFormatVersion = newWriterFormatVersion;
return this;
}

TrackedFileBuilder location(String newLocation) {
Preconditions.checkArgument(newLocation != null, "Invalid location: null");
this.location = newLocation;
return this;
}

TrackedFileBuilder fileFormat(FileFormat newFileFormat) {
Preconditions.checkArgument(newFileFormat != null, "Invalid file format: null");
this.fileFormat = newFileFormat;
return this;
}

TrackedFileBuilder recordCount(long newRecordCount) {
Preconditions.checkArgument(
newRecordCount >= 0, "Invalid record count: %s (must be >= 0)", newRecordCount);
this.recordCount = newRecordCount;
return this;
}

TrackedFileBuilder fileSizeInBytes(long newFileSizeInBytes) {
Preconditions.checkArgument(
newFileSizeInBytes >= 0,
"Invalid file size in bytes: %s (must be >= 0)",
newFileSizeInBytes);
this.fileSizeInBytes = newFileSizeInBytes;
return this;
}

TrackedFileBuilder specId(int newSpecId) {
Preconditions.checkArgument(newSpecId >= 0, "Invalid spec ID: %s (must be >= 0)", newSpecId);
this.specId = newSpecId;
return this;
}

TrackedFileBuilder partition(PartitionData newPartitionData) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this should be combined with specId: partition(int newSpecId, StructLike partitionTuple). If the tuple is set, then spec ID is required so that we know how to interpret it. I suspect that this should also be responsible for wrapping the partition tuple with a projection to the union type.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably miss something here, but I see that TrackedFile has a PartitionData member. If we want to produce that, is specId and partitionTuple enough? Don't we need a PartitionSpec to produce PartitionData?

Preconditions.checkArgument(newPartitionData != null, "Invalid partition: null");
this.partitionData = newPartitionData;
return this;
}

TrackedFileBuilder contentStats(ContentStats newContentStats) {
Preconditions.checkArgument(newContentStats != null, "Invalid content stats: null");
Comment thread
gaborkaszab marked this conversation as resolved.
this.contentStats = newContentStats;
return this;
}

TrackedFileBuilder sortOrderId(int newSortOrderId) {
Preconditions.checkArgument(
!isLeafManifest(contentType),
"Sort order ID cannot be added to manifest entries, but entry type is: %s",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Cannot set sort order for manifest files"?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

contentType);
Preconditions.checkArgument(
newSortOrderId >= 0, "Invalid sort order ID: %s (must be >= 0)", newSortOrderId);
this.sortOrderId = newSortOrderId;
return this;
}

TrackedFileBuilder deletionVector(DeletionVector newDeletionVector) {
Preconditions.checkArgument(newDeletionVector != null, "Invalid deletion vector: null");
Preconditions.checkArgument(
contentType == FileContent.DATA,
"Deletion vector can only be added to DATA entries, but entry type is: %s",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Cannot add deletion vector for file with content: %s"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, and found couple other similar messages, fixed them too.

contentType);
Preconditions.checkArgument(
this.deletionVector == null || !this.deletionVector.equals(newDeletionVector),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: this. is used when assigning to a field, but not when accessing a field.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

"The same deletion vector already added");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check that I would expect is deletionVector == null || newDeletionVector.cardinality() > deletionVector.cardinality(). That ensures that the deletion vector was updated, without relying on an equality check. Otherwise, bad data could still make it through just by using a different concrete class (which is not equal).

The cardinality check is closer to what we want although not foolproof. The underlying location and offset could be the same, although it is unlikely that would be the case. If you want to make sure that this is actually a different deletion vector, then I think the right thing to do is to do that check here:

Preconditions.checkArgument(
    deletionVector == null ||
        deletionVector.cardinality() < newDeletionVector.cardinality(),
    "Invalid DV update, cardinality must increase: existing=%s, new=%s", ...);
Preconditions.checkArgument(
    deletionVector == null ||
        !deletionVector.location().equals(newDeletionVector.location()) ||
        !deletionVector.offset().equals(newDeletionVector.offset()),
    "Invalid DV update: same offset and location has changed cardinality");

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, borrowed these checks!

this.deletionVector = newDeletionVector;
this.dvUpdated = true;
return this;
Comment thread
stevenzwu marked this conversation as resolved.
}

TrackedFileBuilder manifestInfo(ManifestInfo newManifestInfo) {
Preconditions.checkArgument(newManifestInfo != null, "Invalid manifest info: null");
Preconditions.checkArgument(
isLeafManifest(contentType),
"Manifest info can only be added to manifests, but entry type is: %s",
contentType);
this.manifestInfo = newManifestInfo;
return this;
}

TrackedFileBuilder keyMetadata(ByteBuffer newKeyMetadata) {
Preconditions.checkArgument(newKeyMetadata != null, "Invalid key metadata: null");
this.keyMetadata = newKeyMetadata;
return this;
}

TrackedFileBuilder splitOffsets(List<Long> newSplitOffsets) {
Preconditions.checkArgument(newSplitOffsets != null, "Invalid split offsets: null");
Preconditions.checkArgument(
!isLeafManifest(contentType),
"Split offsets cannot be added to manifest entries, but entry type is: %s",
contentType);
this.splitOffsets = newSplitOffsets;
return this;
}

TrackedFileBuilder equalityIds(List<Integer> newEqualityIds) {
Preconditions.checkArgument(newEqualityIds != null, "Invalid equality IDs: null");
Preconditions.checkArgument(
contentType == FileContent.EQUALITY_DELETES,
"Equality IDs can only be added to EQUALITY_DELETES entries, but entry type is: %s",
contentType);
this.equalityIds = newEqualityIds;
return this;
}

TrackedFileBuilder deletedPositions(ByteBuffer newDeletedPositions) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not against this since I haven't seen how it is used yet. But I find these methods that capture data that will be passed through to Tracking suspicious. Why not pass in a new Tracking and not mirror its builder's API here? Also, if this is the pattern we want, then why not wrap a TrackingBuilder and pass this directly?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Felt a bit odd for me when I implemented this but went this way for two reasons:

  1. I tried to avoid allow passing Tracking in the TrackedFileBuilder for the users because similar reasons we don't allow passing status for TrackingBuilder: seems to represent an internal state and users might get it wrong to construct. We could guard against malformed Tracking objects with checks, but seemed that we already have everything to construct it internally and not push one more responsibility to the users of the builder.
  2. TrackingBuilder itself can't check for all of the constraints. For instance it doesn't know if it belongs to a data file or a delete manifest, hence can't verify that deleted/replaced positions are only allowed for manifests. We can perform such a check here, in the TrackedFileBuilder.

Introduced a TrackingBuilder member to the TrackedFileBuilder and we can delegate all these calls right away without keeping track the required information.

Preconditions.checkArgument(newDeletedPositions != null, "Invalid deleted positions: null");
Preconditions.checkArgument(
isLeafManifest(contentType),
"Deleted positions can only be added to manifest entries, but entry type is: %s",
contentType);
this.deletedPositions = newDeletedPositions;
return this;
}

TrackedFileBuilder replacedPositions(ByteBuffer newReplacedPositions) {
Preconditions.checkArgument(newReplacedPositions != null, "Invalid replaced positions: null");
Preconditions.checkArgument(
isLeafManifest(contentType),
"Replaced positions can only be added to manifest entries, but entry type is: %s",
contentType);
this.replacedPositions = newReplacedPositions;
return this;
}

private static boolean isLeafManifest(FileContent contentType) {
return contentType == FileContent.DATA_MANIFEST || contentType == FileContent.DELETE_MANIFEST;
}

TrackedFile build() {
Preconditions.checkArgument(
writerFormatVersion != null, "Missing required field: writer format version");
Preconditions.checkArgument(location != null, "Missing required field: location");
Preconditions.checkArgument(fileFormat != null, "Missing required field: file format");
Preconditions.checkArgument(recordCount != null, "Missing required field: record count");
Preconditions.checkArgument(
fileSizeInBytes != null, "Missing required field: file size in bytes");
Preconditions.checkArgument(partitionData != null, "Missing required field: partition data");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is correct.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx!
I change partition not to be required through the builder

Preconditions.checkArgument(
!isLeafManifest(contentType) || manifestInfo != null,
"Missing required field: manifest info");
Preconditions.checkArgument(
contentType != FileContent.EQUALITY_DELETES || equalityIds != null,
"Missing required field: equality IDs");

TrackingBuilder trackingBuilder =
sourceTracking == null
? TrackingBuilder.added(snapshotId)
: TrackingBuilder.from(sourceTracking, snapshotId);

if (dvUpdated) {
trackingBuilder.dvUpdated();
}

if (deletedPositions != null) {
trackingBuilder.deletedPositions(deletedPositions);
}

if (replacedPositions != null) {
trackingBuilder.replacedPositions(replacedPositions);
}

return new TrackedFileStruct(
trackingBuilder.build(),
contentType,
writerFormatVersion,
location,
fileFormat,
partitionData,
recordCount,
fileSizeInBytes,
specId,
contentStats,
sortOrderId,
deletionVector,
manifestInfo,
keyMetadata,
splitOffsets,
equalityIds);
}
}
Loading