Skip to content

Core: Add v4 TrackedFileAdapters to bridge Data/Delete Files#16100

Merged
rdblue merged 21 commits into
apache:mainfrom
anoopj:v4-tracked-file-struct-adapters
Jun 11, 2026
Merged

Core: Add v4 TrackedFileAdapters to bridge Data/Delete Files#16100
rdblue merged 21 commits into
apache:mainfrom
anoopj:v4-tracked-file-struct-adapters

Conversation

@anoopj

@anoopj anoopj commented Apr 24, 2026

Copy link
Copy Markdown
Member

The adapter bridges TrackedFile to existing DataFile/DeleteFile APIs and would allow to minimize the v4 related code changes during scan planning and commits.

Closes #16222

@github-actions github-actions Bot added the core label Apr 24, 2026
@anoopj anoopj moved this to In review in V4: metadata tree Apr 24, 2026
return new TrackedDeleteFile(file, spec);
}

// TODO: TrackedFile will likely get an explicit partition tuple field (using a union partition

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will change after the approach to store partition tuple is settled.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Steven and Russell replied on the dev list that they're good with a tracked partition tuple. I also confirmed with Dan and Amogh, so I think we are probably good to go. Maybe we should start getting the changes to the interfaces and implementations done? This would then only need to run the projection.

@anoopj anoopj changed the title [core] v4: Add TrackedFileAdapters to bridge Data/Delete Files Core: Add v4 TrackedFileAdapters to bridge Data/Delete Files Apr 24, 2026
return result.isEmpty() ? null : result;
}

static Map<Integer, Long> nullValueCounts(ContentStats stats) {

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An open question is whether it's worth caching the stats (lazy/eager). I don't see a lot of repeated reads, so may not be worth it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably fine. I have a comment about this below.

Comment thread core/src/main/java/org/apache/iceberg/TrackedFileAdapters.java
@anoopj anoopj requested a review from stevenzwu April 30, 2026 15:10

private TrackedFileAdapters() {}

static DataFile asDataFile(TrackedFile file, PartitionSpec spec) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was your reason not to make this a method of TrackedFile?

Also, PartitionSpec is tracked by the file itself, so is very strange to pass it in here. At a minimum, I would expect this to have a validation that the spec's ID matches the file's spec ID. But it would also be better to have some way to look up spec by ID instead of forcing the caller to do it and then validate that the caller did it correctly.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intentionally kept the adapters outside of TrackedFile to avoid coupling it with Data/DeleteFile. It seemed like an adapter concern.

The tracked file keeps track of the spec ID, but not the spec itself. I saw some v3 code paths that followed a similar pattern of passing around specsById. Pretty much open to changing it if you have suggestions.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the methods to take in specsById map instead.

Comment thread core/src/main/java/org/apache/iceberg/TrackedFileAdapters.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/TrackedFileAdapters.java
// each reported the full file size).
@Override
public long fileSizeInBytes() {
return dv.sizeInBytes();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate the comment. The decision here seems reasonable to me since we don't know the total Puffin file size.

We should also consider whether we want to have a field in the tracked_file struct for this. We originally wanted to use file_size_in_bytes for the Puffin file size so that we could determine when DVs should be compacted. But variance in the footer was a problem that prevented it from being used as intended.

@aokolnychyi should we revisit this?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even if we know the total Puffin file size, dv.sizeInBytes still seems the correct value here. A puffin file may contain thousands of DVs. A logical DeleteFile should only contain one DV. The DeleteFile size should be the DV size.

Comment thread core/src/main/java/org/apache/iceberg/TrackedFileAdapters.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/TrackedFileAdapters.java
Comment thread core/src/main/java/org/apache/iceberg/TrackedFileAdapters.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/TrackedFileAdapters.java

@Override
public Map<Integer, ByteBuffer> lowerBounds() {
return TrackedFileAdapters.lowerBounds(file.contentStats());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is okay, but a little concerning because the helper method here creates a new map every time it is called. Creating the map is an expensive operation because it allocates buffers to hold each column bound and serializes into that buffer. Then the map is thrown away rather than reused.

The evaluators themselves (for example, InclusiveMetricsEvaluator) only call these methods once to evaluate for the data file, but if multiple evaluators are used then we should expect a performance degredation.

On the other hand, we want to have evaluators that work directly with ContentStats instead of going through these methods. I think I'm fine with leaving this as-is for now, but we need to make sure that we use the right evaluators everywhere.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Will build a new evaluator that operates on content stats, as a followup. cc @nastra

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #16218 to track this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's hold off on doing it right now. I think the ContentStats API is going to change.

Comment thread core/src/main/java/org/apache/iceberg/TrackedFileAdapters.java Outdated
return new TrackedDataFile(file, spec);
}

static DeleteFile asDVDeleteFile(TrackedFile file, PartitionSpec spec) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will also need a way to wrap TrackedFile as a v2 DeleteFile for position deletes.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. I will create a followup PR for this so that the size is manageable.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added TrackedPositionDeleteFile

private final Tracking tracking;
private final PartitionSpec spec;

private AbstractTrackedContentFile(TrackedFile file, PartitionSpec spec) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than Abstract, we typically use Base for the prefix. For example, BaseAction or BaseContentStats.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can even drop the Base here with just TrackedContentFile, because ContentFile is a base class. This is also consistent with other classes like TrackedDataFile, TrackedDVDeleteFile.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dropped the Base. @rdblue let me know if thats OK.

Comment thread core/src/main/java/org/apache/iceberg/TrackedFileAdapters.java Outdated

@Override
public int specId() {
return spec.specId();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the spec ID is set on file then it should be returned.

The last version was this:

    return file.specId() != null ? file.specId() : 0;

The problem wasn't that it was returning file.specId(). When that spec ID is set, it is canonical and is the ID that was used to look up spec. The problem was that it was guessing ID 0 for the unpartitioned spec, which is not correct. The updated version should be this:

    return file.specId() != null ? file.specId() : spec.specId();

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it? It looks the same to me.

@anoopj anoopj Jun 11, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had incorporated it in May. But recently reverted it yesterday due to a conflicting review feedback. My bad - I forgot our original discussion on why we added it. But with the new validation we added, just using spec.specId is provably correct. Let me know if that is reasonable.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reference. I agree with @stevenzwu that it should be the same. And now that we are checking it in the constructor, this should be fine.

Comment thread core/src/main/java/org/apache/iceberg/TrackedFileAdapters.java Outdated
* <p>Subclasses provide {@code content()}, {@code firstRowId()}, {@code equalityFieldIds()}, and
* the copy methods.
*/
private abstract static class AbstractTrackedContentFile<F extends ContentFile<F>>

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful to implement the version for v2 position deletes here as well. That would make it easier to evaluate the implementations here, although I suspect it will be fine.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added it. It's actually fairly light.

Comment thread core/src/main/java/org/apache/iceberg/TrackedFileAdapters.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/TrackedFileAdapters.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/TrackedFileAdapters.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/TrackedFileAdapters.java Outdated
tracking.set(3, 11L);
tracking.set(5, 1000L);
tracking.setManifestLocation("s3://bucket/manifest.avro");
tracking.set(8, 7L);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switch to the builder added recently?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 as I feel setting all stuff by position is quite brittle

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

private final Tracking tracking;
private final PartitionSpec spec;

private AbstractTrackedContentFile(TrackedFile file, PartitionSpec spec) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can even drop the Base here with just TrackedContentFile, because ContentFile is a base class. This is also consistent with other classes like TrackedDataFile, TrackedDVDeleteFile.

Comment thread core/src/main/java/org/apache/iceberg/TrackedFileAdapters.java
// each reported the full file size).
@Override
public long fileSizeInBytes() {
return dv.sizeInBytes();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even if we know the total Puffin file size, dv.sizeInBytes still seems the correct value here. A puffin file may contain thousands of DVs. A logical DeleteFile should only contain one DV. The DeleteFile size should be the DV size.

@anoopj anoopj force-pushed the v4-tracked-file-struct-adapters branch from 5ed9aeb to 9fa6e79 Compare June 8, 2026 10:34
@anoopj anoopj requested a review from rdblue June 8, 2026 10:35
Comment thread core/src/main/java/org/apache/iceberg/TrackedFileAdapters.java Outdated
}
}

private static PartitionSpec resolveSpec(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a note for future. if this util method is useful in other settings, we can consider move it to a util class like TrackedFileUtil. but not now.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree

Comment thread core/src/test/java/org/apache/iceberg/TestTrackedFileAdapters.java Outdated
Comment thread core/src/test/java/org/apache/iceberg/TestTrackedFileAdapters.java
@anoopj anoopj requested a review from stevenzwu June 10, 2026 00:06
@stevenzwu

Copy link
Copy Markdown
Contributor

The CVE scan failure are noises. They are not caused by this PR.

The fix has been merged. #16749

Comment thread core/src/main/java/org/apache/iceberg/TrackedFileAdapters.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/TrackedFileAdapters.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/TrackedFileAdapters.java
Comment thread core/src/main/java/org/apache/iceberg/TrackedFileAdapters.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/TrackedFileAdapters.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/TrackedFileAdapters.java
@anoopj anoopj requested a review from rdblue June 11, 2026 05:03
@rdblue rdblue merged commit 9c57bb5 into apache:main Jun 11, 2026
53 checks passed
@github-project-automation github-project-automation Bot moved this from In review to Done in V4: metadata tree Jun 11, 2026
@anoopj anoopj deleted the v4-tracked-file-struct-adapters branch June 13, 2026 02:27
@nssalian nssalian added this to the Iceberg 1.12.0 milestone Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

v4: Replace transform-based partition derivation in TrackedFileAdapters with explicit partition tuple

6 participants