Skip to content

Core: Add writer_format_version field to TrackedFile#16688

Merged
amogh-jahagirdar merged 1 commit into
apache:mainfrom
gaborkaszab:main_writer_format_version
Jun 18, 2026
Merged

Core: Add writer_format_version field to TrackedFile#16688
amogh-jahagirdar merged 1 commit into
apache:mainfrom
gaborkaszab:main_writer_format_version

Conversation

@gaborkaszab

@gaborkaszab gaborkaszab commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Introduce writer_format_version field into TrackedFile.

One unrelated on-liner adjustment: in TrackedFile.internalSet() incoming record_count is converted into object Long, however the member itself is primitive long. Change the conversion to primitive.

@github-actions github-actions Bot added the core label Jun 5, 2026
@gaborkaszab

Copy link
Copy Markdown
Contributor Author

This field was in the spec PR and also in the proposal docs, but missing from the Java implementation. cc @amogh-jahagirdar @rdblue @anoopj

Types.NestedField.required(
104, "file_size_in_bytes", Types.LongType.get(), "Total file size in bytes");
Types.NestedField WRITER_FORMAT_VERSION =
Types.NestedField.required(

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is optional per the proposal doc. An alternative is to it required - ie require the table upgrades to populate it. Probably worth a discussion

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in the doc it's optional, but in the spec PR it's required. Maybe @amogh-jahagirdar could give us more context.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stevenzwu One of your other comments said we should follow the spec PR instead of the doc. Would it also apply here? Then this field should be required as in this PR already

@stevenzwu stevenzwu Jun 8, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should discuss these two things together.

  1. if we define the writer_format_version as 0 (for pre-v4) and 1 (for v4) that are not aligned with table versions, this can be a required field
  2. if we define the writer_format_version as 4 and null for pre-v4, then this has to be an optional field because we don't know the table version when the old manifest files were written.

personally, I favor option 2 a little because we don't need to introduce another versioning system.

I will resolve my other comment to consolidate on this thread.

@amogh-jahagirdar amogh-jahagirdar Jun 8, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me make sure the doc is up to date, but I think it's best for this to be a required field. We always know if a manifest is pre-v4 or if it's v4. On upgrade, a writer would just need to write the existing leaf manifests with pre-v4 in the root. I feel like there's no reason for it to not be required, it's known by writers and easy to determine and at the storage level it's going to have very little footprint . If we made it optional, it basically makes it impossible for readers to really know about the manifest version without poking at the schema / checking k/v metadata.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be 0: pre-V4, 4: V4

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, thanks for calling out! I made the adjustment, and added and extra test suite to test the enum values

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have concerns about the WriterFormatVersion enum approach.

Existing precedent for table format version is plain int — TableMetadata uses int formatVersion with MIN_FORMAT_VERSION_* constants and direct compares (formatVersion >= 3).

The enum hurts here:

  1. Likely call shape is >= 4 checks (gate v4 metadata behavior); enum forces .id() >= 4 at every site.
  2. WriterFormatVersion.fromId throws on unknown ids — a v4 reader will crash deserializing a v5-written leaf. Plain int + >= 4 gate is forward-compatible.
  3. PRE_V4 as a constant ages awkwardly when v5 ships.

Suggest int writerFormatVersion() paralleling int TableMetadata.formatVersion(), with a named constant for any v4 boundary check.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback, @stevenzwu !

I personally am fine with points 1. and 3. , but forward-compatibility is no doubt an issue. Thanks for pointing that out! I'll get back with a non-enum implementation soon.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the representation to primitive int

Comment thread core/src/main/java/org/apache/iceberg/TrackedFileStruct.java Outdated
Comment thread core/src/test/java/org/apache/iceberg/TestTrackedFileStruct.java Outdated
Comment thread core/src/test/java/org/apache/iceberg/TestTrackedFileStruct.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/TrackedFile.java Outdated
@pvary pvary moved this to In review in V4: metadata tree Jun 8, 2026
@gaborkaszab gaborkaszab force-pushed the main_writer_format_version branch from 0767a6e to f38ebb5 Compare June 8, 2026 12:07
@gaborkaszab gaborkaszab force-pushed the main_writer_format_version branch 2 times, most recently from 5364261 to fb2bc13 Compare June 9, 2026 14:56
Comment thread core/src/test/java/org/apache/iceberg/TestWriterFormatVersion.java Outdated
@gaborkaszab gaborkaszab force-pushed the main_writer_format_version branch from fb2bc13 to f464242 Compare June 11, 2026 06:57
@gaborkaszab

Copy link
Copy Markdown
Contributor Author

@stevenzwu @rdblue @amogh-jahagirdar
Would you mind taking a look?

@gaborkaszab gaborkaszab force-pushed the main_writer_format_version branch from f464242 to 9668f23 Compare June 11, 2026 16:56
TrackedFileStruct(
Tracking tracking,
FileContent contentType,
int writerFormatVersion,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this location in the constructor intentional? Feels a bit weird to have to pass this in before passing in a location and FileFormat etc?

@gaborkaszab gaborkaszab Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially it was at the end, @stevenzwu asked to move it before location in the schema. Made sense to follow the order of the schema with the constructor.
Initial comment

Comment on lines +399 to +403
file.set(SPEC_ID_ORDINAL, 0);
file.set(SORT_ORDER_ID_ORDINAL, 1);
file.set(DELETION_VECTOR_ORDINAL, dv);
file.set(KEY_METADATA_ORDINAL, ByteBuffer.wrap(new byte[] {1, 2, 3}));
file.set(SPLIT_OFFSETS_ORDINAL, ImmutableList.of(50L));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for cleaning all this up!

Comment on lines -247 to -248
// projected position 0 maps to internal position 2 (location)
// projected position 1 maps to internal position 5 (file_size_in_bytes)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to change this comment? I know we changed the rest of the test class but those seemed like a legitimate cleanup. Here, I think we can avoid some needless conflicts.

@gaborkaszab gaborkaszab Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I made the ORDINAL variable introduction via AI, I overlooked the changes in these comments.

@gaborkaszab gaborkaszab force-pushed the main_writer_format_version branch from 9668f23 to b0db631 Compare June 16, 2026 11:42
@gaborkaszab gaborkaszab force-pushed the main_writer_format_version branch from b0db631 to e004037 Compare June 16, 2026 12:33

@amogh-jahagirdar amogh-jahagirdar left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gaborkaszab I think we can just go ahead and merge this

@amogh-jahagirdar amogh-jahagirdar merged commit 8924046 into apache:main Jun 18, 2026
53 checks passed
@github-project-automation github-project-automation Bot moved this from In review to Done in V4: metadata tree Jun 18, 2026
@gaborkaszab

Copy link
Copy Markdown
Contributor Author

Thanks for the reviews @anoopj @stevenzwu @amogh-jahagirdar !

@nssalian nssalian added this to the Iceberg 1.12.0 milestone Jun 23, 2026
"Type of content: 0=DATA, 2=EQUALITY_DELETES, 3=DATA_MANIFEST, 4=DELETE_MANIFEST");
Types.NestedField WRITER_FORMAT_VERSION =
Types.NestedField.required(
157, "writer_format_version", Types.IntegerType.get(), "Writer format version");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that this field name makes sense. Writers don't have a format version. The Java implementation can write any format version.

I think this is trying to introduce the format version used to produce a metadata file, in which case it would be the file's format version. We need to be specific in these cases so that it is not unclear to anyone implementing the spec or working with the metadata.

I think that we should rename this to format_version. It is the format version that defines the file's compatibility. Any v4 reader (or newer) can read the file.

Next, should this be required or should we use null to signal that the value is unknown? Why write a non-null value like 0 if we don't really know the format version? Using null seems to fit "unknown" better, at least to me.

Last, why should this be a part of TrackedFile rather than ManifestInfo? Is this required for data or delete files? If it is required, what is the value of writing it? The simplest solution is to make this only apply to manifests, in which case it should be located in ManifestInfo.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I can get behind the rename to format_version but for the other points,

1.) At the time of upgrade we always know that any existing manifests are pre-v4. Yes we don't know which exact format version but it's not truly unknown. Any subsequent writes will produce v4 metadata. It feels a bit odd to persist null in metadata for the pre-v4 cases but in the end either approach can be hidden behind the API. From a metadata footprint perspective I think 0 and null are effectively the same here.

2.) Yes the intent behind putting in TrackedFile and having it required for data files is so that we know exactly which versions produced that. This is largely for consistency of the metadata with leaf manifests (without sacrificing metadata footprint) and also as a minor point being able to be more flexible with cases like a small V3 table with a single manifest (or very few entries) and just moving all those entries to a root manifest without losing the fidelity that they were produced by V3 to begin with. Or on upgrade just knowing which files were from the previous version of the table. This would help with debuggability in case of issues on upgrade where we need to figure out "hey what are possibly impacted files". The latter point is minor though. If we're confident that we only need it for manifests I think that's OK but then we're in an awkward spot for future evolutions of the format where we need higher fidelity for leaf entries on the format versdion (because we'd have 2 places, and more writer rules etc).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks to @amogh-jahagirdar for pointing out the thread where null/0 were discussed. I think it came down to this comment:

The enum [WriterFormatVersion] hurts here:

  1. Likely call shape is >= 4 checks (gate v4 metadata behavior); enum forces .id() >= 4 at every site.
  2. WriterFormatVersion.fromId throws on unknown ids — a v4 reader will crash deserializing a v5-written leaf. Plain int + >= 4 gate is forward-compatible.
  3. PRE_V4 as a constant ages awkwardly when v5 ships.

My replies:

  1. Can't we define and use greaterThan(int) to handle this? We don't have to get the ID and compare directly.
  2. Similar to 1, fromId can return PRE_V4 or UNKNOWN for null
  3. I liked the earlier discussion that used "unknown" to solve this issue. For that we can use either ID 0 or null.

I think either 0 or null would be reasonable. I don't think that these points lead me to conclude that 0 is better. I think it's okay either way and that this is probably the smallest concern. I do agree that we should use 4 to represent v4, 5 for v5, etc.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a metadata footprint perspective I think 0 and null are effectively the same here.

I agree. We can go either way and it's probably fine to do that.

Or on upgrade just knowing which files were from the previous version of the table.

This is a good point. +1 for including this in TrackedFile.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback @rdblue and @amogh-jahagirdar !

The points discussed so far:

  1. I'll open a follow-up PR to rename the field to format_version

  2. required + 0 or optional + null for pre-v4 written files
    I don't have a strong opinion, either seems fine. Maybe required + 0 forces the users/writers to think through more what value to write?

  3. Will keep the field in TrackedFile instead of moving to ManifestInfo

  4. enum instead of int
    I think forward-compatibility is still an open question for the enum representation. For instance in a V4 reader the enum could look like this:

enum FormatVersion {
  PRE_V4(0),
  V4(4);
...
  public static FormatVersion fromId(int id) { ... }
}

What would be the expectation of reading metadata written by V5 writers, like FormatVersion.fromId(5);?
Should we keep an enum value for a later unknown version? It'd be "lossy" as it won't necessary keep the original value. We could have a class instead of an enum that can also keep the original value and also provide fromId(int) and greaterThan(int) functions.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use required and 0=unknown. That seems clean and gives us a nice invariant for implementations.

jakelong95 pushed a commit to jakelong95/iceberg that referenced this pull request Jun 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

7 participants