Core: Add writer_format_version field to TrackedFile#16688
Conversation
|
This field was in the spec PR and also in the proposal docs, but missing from the Java implementation. cc @amogh-jahagirdar @rdblue @anoopj |
| Types.NestedField.required( | ||
| 104, "file_size_in_bytes", Types.LongType.get(), "Total file size in bytes"); | ||
| Types.NestedField WRITER_FORMAT_VERSION = | ||
| Types.NestedField.required( |
There was a problem hiding this comment.
I believe this is optional per the proposal doc. An alternative is to it required - ie require the table upgrades to populate it. Probably worth a discussion
There was a problem hiding this comment.
Yes, in the doc it's optional, but in the spec PR it's required. Maybe @amogh-jahagirdar could give us more context.
There was a problem hiding this comment.
@stevenzwu One of your other comments said we should follow the spec PR instead of the doc. Would it also apply here? Then this field should be required as in this PR already
There was a problem hiding this comment.
I think we should discuss these two things together.
- if we define the writer_format_version as
0(for pre-v4) and1(for v4) that are not aligned with table versions, this can be a required field - if we define the writer_format_version as
4andnullfor pre-v4, then this has to be an optional field because we don't know the table version when the old manifest files were written.
personally, I favor option 2 a little because we don't need to introduce another versioning system.
I will resolve my other comment to consolidate on this thread.
There was a problem hiding this comment.
Let me make sure the doc is up to date, but I think it's best for this to be a required field. We always know if a manifest is pre-v4 or if it's v4. On upgrade, a writer would just need to write the existing leaf manifests with pre-v4 in the root. I feel like there's no reason for it to not be required, it's known by writers and easy to determine and at the storage level it's going to have very little footprint . If we made it optional, it basically makes it impossible for readers to really know about the manifest version without poking at the schema / checking k/v metadata.
There was a problem hiding this comment.
should be 0: pre-V4, 4: V4
There was a problem hiding this comment.
Indeed, thanks for calling out! I made the adjustment, and added and extra test suite to test the enum values
There was a problem hiding this comment.
I have concerns about the WriterFormatVersion enum approach.
Existing precedent for table format version is plain int — TableMetadata uses int formatVersion with MIN_FORMAT_VERSION_* constants and direct compares (formatVersion >= 3).
The enum hurts here:
- Likely call shape is
>= 4checks (gate v4 metadata behavior); enum forces.id() >= 4at every site. WriterFormatVersion.fromIdthrows on unknown ids — a v4 reader will crash deserializing a v5-written leaf. Plain int +>= 4gate is forward-compatible.PRE_V4as a constant ages awkwardly when v5 ships.
Suggest int writerFormatVersion() paralleling int TableMetadata.formatVersion(), with a named constant for any v4 boundary check.
There was a problem hiding this comment.
Thanks for the feedback, @stevenzwu !
I personally am fine with points 1. and 3. , but forward-compatibility is no doubt an issue. Thanks for pointing that out! I'll get back with a non-enum implementation soon.
There was a problem hiding this comment.
I changed the representation to primitive int
0767a6e to
f38ebb5
Compare
5364261 to
fb2bc13
Compare
fb2bc13 to
f464242
Compare
|
@stevenzwu @rdblue @amogh-jahagirdar |
f464242 to
9668f23
Compare
| TrackedFileStruct( | ||
| Tracking tracking, | ||
| FileContent contentType, | ||
| int writerFormatVersion, |
There was a problem hiding this comment.
Is this location in the constructor intentional? Feels a bit weird to have to pass this in before passing in a location and FileFormat etc?
There was a problem hiding this comment.
Initially it was at the end, @stevenzwu asked to move it before location in the schema. Made sense to follow the order of the schema with the constructor.
Initial comment
| file.set(SPEC_ID_ORDINAL, 0); | ||
| file.set(SORT_ORDER_ID_ORDINAL, 1); | ||
| file.set(DELETION_VECTOR_ORDINAL, dv); | ||
| file.set(KEY_METADATA_ORDINAL, ByteBuffer.wrap(new byte[] {1, 2, 3})); | ||
| file.set(SPLIT_OFFSETS_ORDINAL, ImmutableList.of(50L)); |
There was a problem hiding this comment.
Thanks for cleaning all this up!
| // projected position 0 maps to internal position 2 (location) | ||
| // projected position 1 maps to internal position 5 (file_size_in_bytes) |
There was a problem hiding this comment.
Do we really need to change this comment? I know we changed the rest of the test class but those seemed like a legitimate cleanup. Here, I think we can avoid some needless conflicts.
There was a problem hiding this comment.
You're right. I made the ORDINAL variable introduction via AI, I overlooked the changes in these comments.
9668f23 to
b0db631
Compare
b0db631 to
e004037
Compare
amogh-jahagirdar
left a comment
There was a problem hiding this comment.
Thanks @gaborkaszab I think we can just go ahead and merge this
|
Thanks for the reviews @anoopj @stevenzwu @amogh-jahagirdar ! |
| "Type of content: 0=DATA, 2=EQUALITY_DELETES, 3=DATA_MANIFEST, 4=DELETE_MANIFEST"); | ||
| Types.NestedField WRITER_FORMAT_VERSION = | ||
| Types.NestedField.required( | ||
| 157, "writer_format_version", Types.IntegerType.get(), "Writer format version"); |
There was a problem hiding this comment.
I don't think that this field name makes sense. Writers don't have a format version. The Java implementation can write any format version.
I think this is trying to introduce the format version used to produce a metadata file, in which case it would be the file's format version. We need to be specific in these cases so that it is not unclear to anyone implementing the spec or working with the metadata.
I think that we should rename this to format_version. It is the format version that defines the file's compatibility. Any v4 reader (or newer) can read the file.
Next, should this be required or should we use null to signal that the value is unknown? Why write a non-null value like 0 if we don't really know the format version? Using null seems to fit "unknown" better, at least to me.
Last, why should this be a part of TrackedFile rather than ManifestInfo? Is this required for data or delete files? If it is required, what is the value of writing it? The simplest solution is to make this only apply to manifests, in which case it should be located in ManifestInfo.
There was a problem hiding this comment.
I think I can get behind the rename to format_version but for the other points,
1.) At the time of upgrade we always know that any existing manifests are pre-v4. Yes we don't know which exact format version but it's not truly unknown. Any subsequent writes will produce v4 metadata. It feels a bit odd to persist null in metadata for the pre-v4 cases but in the end either approach can be hidden behind the API. From a metadata footprint perspective I think 0 and null are effectively the same here.
2.) Yes the intent behind putting in TrackedFile and having it required for data files is so that we know exactly which versions produced that. This is largely for consistency of the metadata with leaf manifests (without sacrificing metadata footprint) and also as a minor point being able to be more flexible with cases like a small V3 table with a single manifest (or very few entries) and just moving all those entries to a root manifest without losing the fidelity that they were produced by V3 to begin with. Or on upgrade just knowing which files were from the previous version of the table. This would help with debuggability in case of issues on upgrade where we need to figure out "hey what are possibly impacted files". The latter point is minor though. If we're confident that we only need it for manifests I think that's OK but then we're in an awkward spot for future evolutions of the format where we need higher fidelity for leaf entries on the format versdion (because we'd have 2 places, and more writer rules etc).
There was a problem hiding this comment.
Thanks to @amogh-jahagirdar for pointing out the thread where null/0 were discussed. I think it came down to this comment:
The enum [
WriterFormatVersion] hurts here:
- Likely call shape is >= 4 checks (gate v4 metadata behavior); enum forces .id() >= 4 at every site.
- WriterFormatVersion.fromId throws on unknown ids — a v4 reader will crash deserializing a v5-written leaf. Plain int + >= 4 gate is forward-compatible.
- PRE_V4 as a constant ages awkwardly when v5 ships.
My replies:
- Can't we define and use
greaterThan(int)to handle this? We don't have to get the ID and compare directly. - Similar to 1,
fromIdcan returnPRE_V4orUNKNOWNfor null - I liked the earlier discussion that used "unknown" to solve this issue. For that we can use either ID 0 or null.
I think either 0 or null would be reasonable. I don't think that these points lead me to conclude that 0 is better. I think it's okay either way and that this is probably the smallest concern. I do agree that we should use 4 to represent v4, 5 for v5, etc.
There was a problem hiding this comment.
From a metadata footprint perspective I think 0 and null are effectively the same here.
I agree. We can go either way and it's probably fine to do that.
Or on upgrade just knowing which files were from the previous version of the table.
This is a good point. +1 for including this in TrackedFile.
There was a problem hiding this comment.
Thanks for the feedback @rdblue and @amogh-jahagirdar !
The points discussed so far:
-
I'll open a follow-up PR to rename the field to
format_version -
required + 0 or optional + null for pre-v4 written files
I don't have a strong opinion, either seems fine. Maybe required + 0 forces the users/writers to think through more what value to write? -
Will keep the field in
TrackedFileinstead of moving toManifestInfo -
enum instead of int
I think forward-compatibility is still an open question for the enum representation. For instance in a V4 reader the enum could look like this:
enum FormatVersion {
PRE_V4(0),
V4(4);
...
public static FormatVersion fromId(int id) { ... }
}
What would be the expectation of reading metadata written by V5 writers, like FormatVersion.fromId(5);?
Should we keep an enum value for a later unknown version? It'd be "lossy" as it won't necessary keep the original value. We could have a class instead of an enum that can also keep the original value and also provide fromId(int) and greaterThan(int) functions.
There was a problem hiding this comment.
I think we should use required and 0=unknown. That seems clean and gives us a nice invariant for implementations.
Introduce writer_format_version field into TrackedFile.
One unrelated on-liner adjustment: in
TrackedFile.internalSet()incoming record_count is converted into object Long, however the member itself is primitive long. Change the conversion to primitive.