API: Decouple default test version range from MAX_FORMAT_VERSION#16677
API: Decouple default test version range from MAX_FORMAT_VERSION#16677stevenzwu wants to merge 1 commit into
Conversation
| // existing parameterized test suite is not pinned to an incomplete code path. Bump this to | ||
| // MAX_FORMAT_VERSION when the incubating version's read/write code paths cover every feature | ||
| // exercised by the parameterized suite. | ||
| public static final int MAX_PRODUCTION_VERSION = 3; |
There was a problem hiding this comment.
open to name suggestion. another one is MAX_STABLE_FORMAT_VERSION
There was a problem hiding this comment.
or MAX_PARAMETERIZED_TEST_FORMAT_VERSION or DEFAULT_MAX_TEST_FORMAT_VERSION
There was a problem hiding this comment.
My vote is DEFAULT_MAX_TEST_FORMAT_VERSION. It stays parallel with the sibling MAX_FORMAT_VERSION (keeps the _FORMAT_VERSION suffix), and DEFAULT reads naturally as "the default upper bound for these ranges, which explicit v4 tests can still opt past."
I'd lean away from MAX_PRODUCTION_VERSION / MAX_STABLE_FORMAT_VERSION: sitting next to core's TableMetadata.SUPPORTED_TABLE_FORMAT_VERSION = 4, a "production/stable = 3" constant reads like a product-level claim about what Iceberg ships rather than a test-suite knob.
MAX_PARAMETERIZED_TEST_FORMAT_VERSION is also fine and very explicit, but a few consumers iterate the ranges in plain for loops / assumeThat(...) guards (e.g. TestSparkMetadataColumns, TestFlinkUpsert, TestScansAndSchemaEvolution) rather than via @FieldSource, so "parameterized" slightly undersells the scope. DEFAULT_MAX_TEST_FORMAT_VERSION avoids that.
There was a problem hiding this comment.
+1 on DEFAULT_MAX_TEST_FORMAT_VERSION
| // existing parameterized test suite is not pinned to an incomplete code path. Bump this to | ||
| // MAX_FORMAT_VERSION when the incubating version's read/write code paths cover every feature | ||
| // exercised by the parameterized suite. | ||
| public static final int MAX_PRODUCTION_VERSION = 3; |
There was a problem hiding this comment.
One nuance on narrowing all three ranges uniformly: the incubation rationale (incomplete v4 read/write paths cause spurious failures) applies to tests that read/write data, but a few @FieldSource(ALL_VERSIONS) consumers are pure metadata/JSON round-trips that never touch those paths. For example TestLoadTableResponseParser.roundTripSerdeV3andHigher is parameterized over ALL_VERSIONS with assumeThat(formatVersion).isGreaterThanOrEqualTo(3), so today it already runs (and passes in CI) for both v3 and v4; after this change it silently drops to v3-only.
Is dropping v4 from these serde-only tests intended, or worth keeping an explicit v4 case for them (the way TestManifestWriterVersions keeps dedicated V4_FORMATS methods alongside its ALL_VERSIONS ones)? Not blocking - just flagging that the blanket narrowing also removes v4 coverage that isn't on the incomplete read/write path.
|
|
||
| public static final List<Integer> ALL_VERSIONS = | ||
| IntStream.rangeClosed(1, MAX_FORMAT_VERSION).boxed().collect(Collectors.toUnmodifiableList()); | ||
| IntStream.rangeClosed(1, MAX_PRODUCTION_VERSION) |
There was a problem hiding this comment.
instead of change all_versions, I think we can one for UP_TO_V3 as well, so that parametrized caller is aware of what's the coverage. We can move some of previous tests to use UP_TO_V3 while allow v4 targeted tests to continue to use ALL_VERSIONS (like TestFormatVersions)
There was a problem hiding this comment.
Workable but the migration cost is significant. The three constants are referenced by 68 test files directly and another 89 files inherit from one of the four core bases that use them (TestBase, DeleteFileIndexTestBase, PartitionStatsHandlerTestBase, ScanPlanningAndReportingTestBase) — 157 distinct test classes total, spanning core, spark, flink, data, kafka-connect, parquet, orc, and api.
Opt-in switching means flipping each of those (or at minimum the four base classes) to UP_TO_V3 to preserve today's "no v4 in default coverage" state. That's a wide diff up front, and a similarly wide diff later when v4 is ready and the goal is to get them back on ALL_VERSIONS.
The default-lag approach in this PR keeps both endpoints cheap — one-line gate today, one-line revert when v4 read/write covers everything the parameterized suite exercises. The comment on the constant flags it as a transient scaffold so it doesn't outlive its purpose.
UP_TO_V3 could still be added alongside as an opt-in for new tests that want to pin coverage explicitly; that's purely additive and works with either approach.
There was a problem hiding this comment.
Fair, my mental model splits the parameterized callers into two buckets:
-
Version-aware tests (e.g. TestFormatVersions, TestMergeAppend): graduating their coverage from v3 to v4 feels like it should be a deliberate, reviewable decision. Ideally it shall happen in the same PR that signals "v4 integration concludes for the given test coverage. This also help highlight the intent to the reviewer.
-
Version-agnostic tests: these read to me as "exercise this against every format version we ship." If the original contract was "all available versions," I'd lean toward keeping that contract intact and letting v4 join the set when it's ready, rather than redefining ALL_VERSIONS and silently dropping coverage we already have today. The serde round-trip case
roundTripSerdeV3andHigherflagged above is a good example.
While v4 is being incubated, the default version-parameterized test ranges (ALL_VERSIONS, V2_AND_ABOVE, V3_AND_ABOVE) should not yet include v4: they drive ~25 core tests plus several Flink tests, and exposing those tests to incomplete v4 read/write paths during incubation creates spurious failures. Add DEFAULT_MAX_TEST_FORMAT_VERSION = 3 and rebase the three derived ranges on it. MAX_FORMAT_VERSION stays at 4 so bound tests (TestTableMetadata, TestSchema, TestFormatVersions upgrade tests) continue to validate against v4. Bump DEFAULT_MAX_TEST_FORMAT_VERSION to 4 once v4 read/write covers all features the parameterized suite exercises (equality deletes, etc.). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
640a96e to
917c81d
Compare
…add matching reader V4Writer and V4DeleteWriter now emit content_entry Parquet rows via TrackedFileWrapper/ContentEntryAdapter rather than the legacy manifest_entry Avro shape. ContentEntryReader and ContentEntryManifestReaderAdapter project content_entry rows back to ManifestEntry<DataFile/DeleteFile> so all downstream consumers (ManifestGroup, MergingSnapshotProducer rewrite paths) work unchanged. Read-path dispatch in ManifestFiles is layered: 1. Avro manifests are always legacy (no file inspection). 2. Snapshot-tree callers thread an Integer writerFormatVersion hint through the new package-private read overloads: 1 routes to ContentEntryReader, 0 routes to legacy. 3. Callers without a hint (tests writing-then-reading, ad-hoc tooling) fall back to inspecting the Parquet footer schema for field id 134 (content_type) or 147 (tracking). The footer read is delegated to InternalParquet via DynMethods so core has no compile-time dependency on iceberg-parquet. Key design choices: - TrackedFile.schemaWithContentStats omits partition and content_stats when their struct types are empty (Parquet rejects empty groups). - TrackedFileWrapper uses hasPartition/hasContentStats flags to map positions dynamically when either optional group is absent. - V4Writer.add(DataFile) bypasses Delegates.suppressFirstRowId so per-entry firstRowId is stored in the tracking struct rather than at manifest level. - ContentEntryReader.setEntry uses wrapAppendPreservingFirstRowId for ADDED entries so firstRowId read from the tracking struct is not re-suppressed. - ContentEntryAdapter preserves firstRowId for EXISTING entries so uncommitted manifests can round-trip per-entry row IDs. - ContentEntryManifestReaderAdapter applies the same committed/uncommitted firstRowId nullification logic as ManifestReader.idAssigner. - ContentEntryManifestReaderAdapter.iterator tracks ordinal position and sets fileOrdinal and manifestLocation on each BaseFile to match Avro reader behavior. - Parquet.readSchema(InputFile) is a new public helper that returns just the Iceberg-converted file schema; InternalParquet.readSchema delegates to it for the DynMethods entry point. - v4 spec forbids content_type=POSITION_DELETES (PR apache#16025); three TestManifestReader tests that write standalone position-delete files / DV delete files are guarded with assumeThat isLessThan(4) and will be removed once PR apache#16677 (or its successor) gates v4 out of the broad parameterized test suite during incubation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…add matching reader V4Writer and V4DeleteWriter now emit content_entry Parquet rows via TrackedFileWrapper/ContentEntryAdapter rather than the legacy manifest_entry Avro shape. ContentEntryReader and ContentEntryManifestReaderAdapter project content_entry rows back to ManifestEntry<DataFile/DeleteFile> so all downstream consumers (ManifestGroup, MergingSnapshotProducer rewrite paths) work unchanged. Read-path dispatch in ManifestFiles is layered: 1. Avro manifests are always legacy (no file inspection). 2. Snapshot-tree callers thread an Integer writerFormatVersion hint through the new package-private read overloads: 1 routes to ContentEntryReader, 0 routes to legacy. 3. Callers without a hint (tests writing-then-reading, ad-hoc tooling) fall back to inspecting the Parquet footer schema for field id 134 (content_type) or 147 (tracking). The footer read is delegated to InternalParquet via DynMethods so core has no compile-time dependency on iceberg-parquet. Key design choices: - TrackedFile.schemaWithContentStats omits partition and content_stats when their struct types are empty (Parquet rejects empty groups). - TrackedFileWrapper uses hasPartition/hasContentStats flags to map positions dynamically when either optional group is absent. - V4Writer.add(DataFile) bypasses Delegates.suppressFirstRowId so per-entry firstRowId is stored in the tracking struct rather than at manifest level. - ContentEntryReader.setEntry uses wrapAppendPreservingFirstRowId for ADDED entries so firstRowId read from the tracking struct is not re-suppressed. - ContentEntryAdapter preserves firstRowId for EXISTING entries so uncommitted manifests can round-trip per-entry row IDs. - ContentEntryManifestReaderAdapter applies the same committed/uncommitted firstRowId nullification logic as ManifestReader.idAssigner. - ContentEntryManifestReaderAdapter.iterator tracks ordinal position and sets fileOrdinal and manifestLocation on each BaseFile to match Avro reader behavior. - Parquet.readSchema(InputFile) is a new public helper that returns just the Iceberg-converted file schema; InternalParquet.readSchema delegates to it for the DynMethods entry point. - v4 spec forbids content_type=POSITION_DELETES (PR apache#16025); three TestManifestReader tests that write standalone position-delete files / DV delete files are guarded with assumeThat isLessThan(4) and will be removed once PR apache#16677 (or its successor) gates v4 out of the broad parameterized test suite during incubation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…add matching reader V4Writer and V4DeleteWriter now emit content_entry Parquet rows via TrackedFileWrapper/ContentEntryAdapter rather than the legacy manifest_entry Avro shape. ContentEntryReader and ContentEntryManifestReaderAdapter project content_entry rows back to ManifestEntry<DataFile/DeleteFile> so all downstream consumers (ManifestGroup, MergingSnapshotProducer rewrite paths) work unchanged. Read-path dispatch in ManifestFiles is layered: 1. Avro manifests are always legacy (no file inspection). 2. Snapshot-tree callers thread an Integer writerFormatVersion hint through the new package-private read overloads: 1 routes to ContentEntryReader, 0 routes to legacy. 3. Callers without a hint (tests writing-then-reading, ad-hoc tooling) fall back to inspecting the Parquet footer schema for field id 134 (content_type) or 147 (tracking). The footer read is delegated to InternalParquet via DynMethods so core has no compile-time dependency on iceberg-parquet. Key design choices: - TrackedFile.schemaWithContentStats omits partition and content_stats when their struct types are empty (Parquet rejects empty groups). - TrackedFileWrapper uses hasPartition/hasContentStats flags to map positions dynamically when either optional group is absent. - V4Writer.add(DataFile) bypasses Delegates.suppressFirstRowId so per-entry firstRowId is stored in the tracking struct rather than at manifest level. - ContentEntryReader.setEntry uses wrapAppendPreservingFirstRowId for ADDED entries so firstRowId read from the tracking struct is not re-suppressed. - ContentEntryAdapter preserves firstRowId for EXISTING entries so uncommitted manifests can round-trip per-entry row IDs. - ContentEntryManifestReaderAdapter applies the same committed/uncommitted firstRowId nullification logic as ManifestReader.idAssigner. - ContentEntryManifestReaderAdapter.iterator tracks ordinal position and sets fileOrdinal and manifestLocation on each BaseFile to match Avro reader behavior. - Parquet.readSchema(InputFile) is a new public helper that returns just the Iceberg-converted file schema; InternalParquet.readSchema delegates to it for the DynMethods entry point. - v4 spec forbids content_type=POSITION_DELETES (PR apache#16025); three TestManifestReader tests that write standalone position-delete files / DV delete files are guarded with assumeThat isLessThan(4) and will be removed once PR apache#16677 (or its successor) gates v4 out of the broad parameterized test suite during incubation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|
||
| // The highest format version covered by the default version-parameterized test ranges below. | ||
| // While a new format version is being incubated, this lags MAX_FORMAT_VERSION so that the | ||
| // existing parameterized test suite is not pinned to an incomplete code path. Bump this to |
There was a problem hiding this comment.
Bump it to MAX_FORMAT_VERSION or simply remove? Is there any point keeping it further?
…add matching reader V4Writer and V4DeleteWriter now emit content_entry Parquet rows via TrackedFileWrapper/ContentEntryAdapter rather than the legacy manifest_entry Avro shape. ContentEntryReader and ContentEntryManifestReaderAdapter project content_entry rows back to ManifestEntry<DataFile/DeleteFile> so all downstream consumers (ManifestGroup, MergingSnapshotProducer rewrite paths) work unchanged. Read-path dispatch in ManifestFiles is layered: 1. Avro manifests are always legacy (no file inspection). 2. Snapshot-tree callers thread an Integer writerFormatVersion hint through the new package-private read overloads: 1 routes to ContentEntryReader, 0 routes to legacy. 3. Callers without a hint (tests writing-then-reading, ad-hoc tooling) fall back to inspecting the Parquet footer schema for field id 134 (content_type) or 147 (tracking). The footer read is delegated to InternalParquet via DynMethods so core has no compile-time dependency on iceberg-parquet. Key design choices: - TrackedFile.schemaWithContentStats omits partition and content_stats when their struct types are empty (Parquet rejects empty groups). - TrackedFileWrapper uses hasPartition/hasContentStats flags to map positions dynamically when either optional group is absent. - V4Writer.add(DataFile) bypasses Delegates.suppressFirstRowId so per-entry firstRowId is stored in the tracking struct rather than at manifest level. - ContentEntryReader.setEntry uses wrapAppendPreservingFirstRowId for ADDED entries so firstRowId read from the tracking struct is not re-suppressed. - ContentEntryAdapter preserves firstRowId for EXISTING entries so uncommitted manifests can round-trip per-entry row IDs. - ContentEntryManifestReaderAdapter applies the same committed/uncommitted firstRowId nullification logic as ManifestReader.idAssigner. - ContentEntryManifestReaderAdapter.iterator tracks ordinal position and sets fileOrdinal and manifestLocation on each BaseFile to match Avro reader behavior. - Parquet.readSchema(InputFile) is a new public helper that returns just the Iceberg-converted file schema; InternalParquet.readSchema delegates to it for the DynMethods entry point. - v4 spec forbids content_type=POSITION_DELETES (PR apache#16025); three TestManifestReader tests that write standalone position-delete files / DV delete files are guarded with assumeThat isLessThan(4) and will be removed once PR apache#16677 (or its successor) gates v4 out of the broad parameterized test suite during incubation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…add matching reader V4Writer and V4DeleteWriter now emit content_entry Parquet rows via TrackedFileWrapper/ContentEntryAdapter rather than the legacy manifest_entry Avro shape. ContentEntryReader and ContentEntryManifestReaderAdapter project content_entry rows back to ManifestEntry<DataFile/DeleteFile> so all downstream consumers (ManifestGroup, MergingSnapshotProducer rewrite paths) work unchanged. Read-path dispatch in ManifestFiles is layered: 1. Avro manifests are always legacy (no file inspection). 2. Snapshot-tree callers thread an Integer writerFormatVersion hint through the new package-private read overloads: 1 routes to ContentEntryReader, 0 routes to legacy. 3. Callers without a hint (tests writing-then-reading, ad-hoc tooling) fall back to inspecting the Parquet footer schema for field id 134 (content_type) or 147 (tracking). The footer read is delegated to InternalParquet via DynMethods so core has no compile-time dependency on iceberg-parquet. Key design choices: - TrackedFile.schemaWithContentStats omits partition and content_stats when their struct types are empty (Parquet rejects empty groups). - TrackedFileWrapper uses hasPartition/hasContentStats flags to map positions dynamically when either optional group is absent. - V4Writer.add(DataFile) bypasses Delegates.suppressFirstRowId so per-entry firstRowId is stored in the tracking struct rather than at manifest level. - ContentEntryReader.setEntry uses wrapAppendPreservingFirstRowId for ADDED entries so firstRowId read from the tracking struct is not re-suppressed. - ContentEntryAdapter preserves firstRowId for EXISTING entries so uncommitted manifests can round-trip per-entry row IDs. - ContentEntryManifestReaderAdapter applies the same committed/uncommitted firstRowId nullification logic as ManifestReader.idAssigner. - ContentEntryManifestReaderAdapter.iterator tracks ordinal position and sets fileOrdinal and manifestLocation on each BaseFile to match Avro reader behavior. - Parquet.readSchema(InputFile) is a new public helper that returns just the Iceberg-converted file schema; InternalParquet.readSchema delegates to it for the DynMethods entry point. - v4 spec forbids content_type=POSITION_DELETES (PR apache#16025); three TestManifestReader tests that write standalone position-delete files / DV delete files are guarded with assumeThat isLessThan(4) and will be removed once PR apache#16677 (or its successor) gates v4 out of the broad parameterized test suite during incubation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…add matching reader V4Writer and V4DeleteWriter now emit content_entry Parquet rows via TrackedFileWrapper/ContentEntryAdapter rather than the legacy manifest_entry Avro shape. ContentEntryReader and ContentEntryManifestReaderAdapter project content_entry rows back to ManifestEntry<DataFile/DeleteFile> so all downstream consumers (ManifestGroup, MergingSnapshotProducer rewrite paths) work unchanged. Read-path dispatch in ManifestFiles is layered: 1. Avro manifests are always legacy (no file inspection). 2. Snapshot-tree callers thread an Integer writerFormatVersion hint through the new package-private read overloads: 1 routes to ContentEntryReader, 0 routes to legacy. 3. Callers without a hint (tests writing-then-reading, ad-hoc tooling) fall back to inspecting the Parquet footer schema for field id 134 (content_type) or 147 (tracking). The footer read is delegated to InternalParquet via DynMethods so core has no compile-time dependency on iceberg-parquet. Key design choices: - TrackedFile.schemaWithContentStats omits partition and content_stats when their struct types are empty (Parquet rejects empty groups). - TrackedFileWrapper uses hasPartition/hasContentStats flags to map positions dynamically when either optional group is absent. - V4Writer.add(DataFile) bypasses Delegates.suppressFirstRowId so per-entry firstRowId is stored in the tracking struct rather than at manifest level. - ContentEntryReader.setEntry uses wrapAppendPreservingFirstRowId for ADDED entries so firstRowId read from the tracking struct is not re-suppressed. - ContentEntryAdapter preserves firstRowId for EXISTING entries so uncommitted manifests can round-trip per-entry row IDs. - ContentEntryManifestReaderAdapter applies the same committed/uncommitted firstRowId nullification logic as ManifestReader.idAssigner. - ContentEntryManifestReaderAdapter.iterator tracks ordinal position and sets fileOrdinal and manifestLocation on each BaseFile to match Avro reader behavior. - Parquet.readSchema(InputFile) is a new public helper that returns just the Iceberg-converted file schema; InternalParquet.readSchema delegates to it for the DynMethods entry point. - v4 spec forbids content_type=POSITION_DELETES (PR apache#16025); three TestManifestReader tests that write standalone position-delete files / DV delete files are guarded with assumeThat isLessThan(4) and will be removed once PR apache#16677 (or its successor) gates v4 out of the broad parameterized test suite during incubation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…add matching reader V4Writer and V4DeleteWriter now emit content_entry Parquet rows via TrackedFileWrapper/ContentEntryAdapter rather than the legacy manifest_entry Avro shape. ContentEntryReader and ContentEntryManifestReaderAdapter project content_entry rows back to ManifestEntry<DataFile/DeleteFile> so all downstream consumers (ManifestGroup, MergingSnapshotProducer rewrite paths) work unchanged. Read-path dispatch in ManifestFiles is layered: 1. Avro manifests are always legacy (no file inspection). 2. Snapshot-tree callers thread an Integer writerFormatVersion hint through the new package-private read overloads: 1 routes to ContentEntryReader, 0 routes to legacy. 3. Callers without a hint (tests writing-then-reading, ad-hoc tooling) fall back to inspecting the Parquet footer schema for field id 134 (content_type) or 147 (tracking). The footer read is delegated to InternalParquet via DynMethods so core has no compile-time dependency on iceberg-parquet. Key design choices: - TrackedFile.schemaWithContentStats omits partition and content_stats when their struct types are empty (Parquet rejects empty groups). - TrackedFileWrapper uses hasPartition/hasContentStats flags to map positions dynamically when either optional group is absent. - V4Writer.add(DataFile) bypasses Delegates.suppressFirstRowId so per-entry firstRowId is stored in the tracking struct rather than at manifest level. - ContentEntryReader.setEntry uses wrapAppendPreservingFirstRowId for ADDED entries so firstRowId read from the tracking struct is not re-suppressed. - ContentEntryAdapter preserves firstRowId for EXISTING entries so uncommitted manifests can round-trip per-entry row IDs. - ContentEntryManifestReaderAdapter applies the same committed/uncommitted firstRowId nullification logic as ManifestReader.idAssigner. - ContentEntryManifestReaderAdapter.iterator tracks ordinal position and sets fileOrdinal and manifestLocation on each BaseFile to match Avro reader behavior. - Parquet.readSchema(InputFile) is a new public helper that returns just the Iceberg-converted file schema; InternalParquet.readSchema delegates to it for the DynMethods entry point. - v4 spec forbids content_type=POSITION_DELETES (PR apache#16025); three TestManifestReader tests that write standalone position-delete files / DV delete files are guarded with assumeThat isLessThan(4) and will be removed once PR apache#16677 (or its successor) gates v4 out of the broad parameterized test suite during incubation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…add matching reader V4Writer and V4DeleteWriter now emit content_entry Parquet rows via TrackedFileWrapper/ContentEntryAdapter rather than the legacy manifest_entry Avro shape. ContentEntryReader and ContentEntryManifestReaderAdapter project content_entry rows back to ManifestEntry<DataFile/DeleteFile> so all downstream consumers (ManifestGroup, MergingSnapshotProducer rewrite paths) work unchanged. Read-path dispatch in ManifestFiles is layered: 1. Avro manifests are always legacy (no file inspection). 2. Snapshot-tree callers thread an Integer writerFormatVersion hint through the new package-private read overloads: 1 routes to ContentEntryReader, 0 routes to legacy. 3. Callers without a hint (tests writing-then-reading, ad-hoc tooling) fall back to inspecting the Parquet footer schema for field id 134 (content_type) or 147 (tracking). The footer read is delegated to InternalParquet via DynMethods so core has no compile-time dependency on iceberg-parquet. Key design choices: - TrackedFile.schemaWithContentStats omits partition and content_stats when their struct types are empty (Parquet rejects empty groups). - TrackedFileWrapper uses hasPartition/hasContentStats flags to map positions dynamically when either optional group is absent. - V4Writer.add(DataFile) bypasses Delegates.suppressFirstRowId so per-entry firstRowId is stored in the tracking struct rather than at manifest level. - ContentEntryReader.setEntry uses wrapAppendPreservingFirstRowId for ADDED entries so firstRowId read from the tracking struct is not re-suppressed. - ContentEntryAdapter preserves firstRowId for EXISTING entries so uncommitted manifests can round-trip per-entry row IDs. - ContentEntryManifestReaderAdapter applies the same committed/uncommitted firstRowId nullification logic as ManifestReader.idAssigner. - ContentEntryManifestReaderAdapter.iterator tracks ordinal position and sets fileOrdinal and manifestLocation on each BaseFile to match Avro reader behavior. - Parquet.readSchema(InputFile) is a new public helper that returns just the Iceberg-converted file schema; InternalParquet.readSchema delegates to it for the DynMethods entry point. - v4 spec forbids content_type=POSITION_DELETES (PR apache#16025); three TestManifestReader tests that write standalone position-delete files / DV delete files are guarded with assumeThat isLessThan(4) and will be removed once PR apache#16677 (or its successor) gates v4 out of the broad parameterized test suite during incubation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…add matching reader V4Writer and V4DeleteWriter now emit content_entry Parquet rows via TrackedFileWrapper/ContentEntryAdapter rather than the legacy manifest_entry Avro shape. ContentEntryReader and ContentEntryManifestReaderAdapter project content_entry rows back to ManifestEntry<DataFile/DeleteFile> so all downstream consumers (ManifestGroup, MergingSnapshotProducer rewrite paths) work unchanged. Read-path dispatch in ManifestFiles is layered: 1. Avro manifests are always legacy (no file inspection). 2. Snapshot-tree callers thread an Integer writerFormatVersion hint through the new package-private read overloads: 1 routes to ContentEntryReader, 0 routes to legacy. 3. Callers without a hint (tests writing-then-reading, ad-hoc tooling) fall back to inspecting the Parquet footer schema for field id 134 (content_type) or 147 (tracking). The footer read is delegated to InternalParquet via DynMethods so core has no compile-time dependency on iceberg-parquet. Key design choices: - TrackedFile.schemaWithContentStats omits partition and content_stats when their struct types are empty (Parquet rejects empty groups). - TrackedFileWrapper uses hasPartition/hasContentStats flags to map positions dynamically when either optional group is absent. - V4Writer.add(DataFile) bypasses Delegates.suppressFirstRowId so per-entry firstRowId is stored in the tracking struct rather than at manifest level. - ContentEntryReader.setEntry uses wrapAppendPreservingFirstRowId for ADDED entries so firstRowId read from the tracking struct is not re-suppressed. - ContentEntryAdapter preserves firstRowId for EXISTING entries so uncommitted manifests can round-trip per-entry row IDs. - ContentEntryManifestReaderAdapter applies the same committed/uncommitted firstRowId nullification logic as ManifestReader.idAssigner. - ContentEntryManifestReaderAdapter.iterator tracks ordinal position and sets fileOrdinal and manifestLocation on each BaseFile to match Avro reader behavior. - Parquet.readSchema(InputFile) is a new public helper that returns just the Iceberg-converted file schema; InternalParquet.readSchema delegates to it for the DynMethods entry point. - v4 spec forbids content_type=POSITION_DELETES (PR apache#16025); three TestManifestReader tests that write standalone position-delete files / DV delete files are guarded with assumeThat isLessThan(4) and will be removed once PR apache#16677 (or its successor) gates v4 out of the broad parameterized test suite during incubation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…add matching reader V4Writer and V4DeleteWriter now emit content_entry Parquet rows via TrackedFileWrapper/ContentEntryAdapter rather than the legacy manifest_entry Avro shape. ContentEntryReader and ContentEntryManifestReaderAdapter project content_entry rows back to ManifestEntry<DataFile/DeleteFile> so all downstream consumers (ManifestGroup, MergingSnapshotProducer rewrite paths) work unchanged. Read-path dispatch in ManifestFiles is layered: 1. Avro manifests are always legacy (no file inspection). 2. Snapshot-tree callers thread an Integer writerFormatVersion hint through the new package-private read overloads: 1 routes to ContentEntryReader, 0 routes to legacy. 3. Callers without a hint (tests writing-then-reading, ad-hoc tooling) fall back to inspecting the Parquet footer schema for field id 134 (content_type) or 147 (tracking). The footer read is delegated to InternalParquet via DynMethods so core has no compile-time dependency on iceberg-parquet. Key design choices: - TrackedFile.schemaWithContentStats omits partition and content_stats when their struct types are empty (Parquet rejects empty groups). - TrackedFileWrapper uses hasPartition/hasContentStats flags to map positions dynamically when either optional group is absent. - V4Writer.add(DataFile) bypasses Delegates.suppressFirstRowId so per-entry firstRowId is stored in the tracking struct rather than at manifest level. - ContentEntryReader.setEntry uses wrapAppendPreservingFirstRowId for ADDED entries so firstRowId read from the tracking struct is not re-suppressed. - ContentEntryAdapter preserves firstRowId for EXISTING entries so uncommitted manifests can round-trip per-entry row IDs. - ContentEntryManifestReaderAdapter applies the same committed/uncommitted firstRowId nullification logic as ManifestReader.idAssigner. - ContentEntryManifestReaderAdapter.iterator tracks ordinal position and sets fileOrdinal and manifestLocation on each BaseFile to match Avro reader behavior. - Parquet.readSchema(InputFile) is a new public helper that returns just the Iceberg-converted file schema; InternalParquet.readSchema delegates to it for the DynMethods entry point. - v4 spec forbids content_type=POSITION_DELETES (PR apache#16025); three TestManifestReader tests that write standalone position-delete files / DV delete files are guarded with assumeThat isLessThan(4) and will be removed once PR apache#16677 (or its successor) gates v4 out of the broad parameterized test suite during incubation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…add matching reader V4Writer and V4DeleteWriter now emit content_entry Parquet rows via TrackedFileWrapper/ContentEntryAdapter rather than the legacy manifest_entry Avro shape. ContentEntryReader and ContentEntryManifestReaderAdapter project content_entry rows back to ManifestEntry<DataFile/DeleteFile> so all downstream consumers (ManifestGroup, MergingSnapshotProducer rewrite paths) work unchanged. Read-path dispatch in ManifestFiles is layered: 1. Avro manifests are always legacy (no file inspection). 2. Snapshot-tree callers thread an Integer writerFormatVersion hint through the new package-private read overloads: 1 routes to ContentEntryReader, 0 routes to legacy. 3. Callers without a hint (tests writing-then-reading, ad-hoc tooling) fall back to inspecting the Parquet footer schema for field id 134 (content_type) or 147 (tracking). The footer read is delegated to InternalParquet via DynMethods so core has no compile-time dependency on iceberg-parquet. Key design choices: - TrackedFile.schemaWithContentStats omits partition and content_stats when their struct types are empty (Parquet rejects empty groups). - TrackedFileWrapper uses hasPartition/hasContentStats flags to map positions dynamically when either optional group is absent. - V4Writer.add(DataFile) bypasses Delegates.suppressFirstRowId so per-entry firstRowId is stored in the tracking struct rather than at manifest level. - ContentEntryReader.setEntry uses wrapAppendPreservingFirstRowId for ADDED entries so firstRowId read from the tracking struct is not re-suppressed. - ContentEntryAdapter preserves firstRowId for EXISTING entries so uncommitted manifests can round-trip per-entry row IDs. - ContentEntryManifestReaderAdapter applies the same committed/uncommitted firstRowId nullification logic as ManifestReader.idAssigner. - ContentEntryManifestReaderAdapter.iterator tracks ordinal position and sets fileOrdinal and manifestLocation on each BaseFile to match Avro reader behavior. - Parquet.readSchema(InputFile) is a new public helper that returns just the Iceberg-converted file schema; InternalParquet.readSchema delegates to it for the DynMethods entry point. - v4 spec forbids content_type=POSITION_DELETES (PR apache#16025); three TestManifestReader tests that write standalone position-delete files / DV delete files are guarded with assumeThat isLessThan(4) and will be removed once PR apache#16677 (or its successor) gates v4 out of the broad parameterized test suite during incubation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…add matching reader V4Writer and V4DeleteWriter now emit content_entry Parquet rows via TrackedFileWrapper/ContentEntryAdapter rather than the legacy manifest_entry Avro shape. ContentEntryReader and ContentEntryManifestReaderAdapter project content_entry rows back to ManifestEntry<DataFile/DeleteFile> so all downstream consumers (ManifestGroup, MergingSnapshotProducer rewrite paths) work unchanged. Read-path dispatch in ManifestFiles is layered: 1. Avro manifests are always legacy (no file inspection). 2. Snapshot-tree callers thread an Integer writerFormatVersion hint through the new package-private read overloads: 1 routes to ContentEntryReader, 0 routes to legacy. 3. Callers without a hint (tests writing-then-reading, ad-hoc tooling) fall back to inspecting the Parquet footer schema for field id 134 (content_type) or 147 (tracking). The footer read is delegated to InternalParquet via DynMethods so core has no compile-time dependency on iceberg-parquet. Key design choices: - TrackedFile.schemaWithContentStats omits partition and content_stats when their struct types are empty (Parquet rejects empty groups). - TrackedFileWrapper uses hasPartition/hasContentStats flags to map positions dynamically when either optional group is absent. - V4Writer.add(DataFile) bypasses Delegates.suppressFirstRowId so per-entry firstRowId is stored in the tracking struct rather than at manifest level. - ContentEntryReader.setEntry uses wrapAppendPreservingFirstRowId for ADDED entries so firstRowId read from the tracking struct is not re-suppressed. - ContentEntryAdapter preserves firstRowId for EXISTING entries so uncommitted manifests can round-trip per-entry row IDs. - ContentEntryManifestReaderAdapter applies the same committed/uncommitted firstRowId nullification logic as ManifestReader.idAssigner. - ContentEntryManifestReaderAdapter.iterator tracks ordinal position and sets fileOrdinal and manifestLocation on each BaseFile to match Avro reader behavior. - Parquet.readSchema(InputFile) is a new public helper that returns just the Iceberg-converted file schema; InternalParquet.readSchema delegates to it for the DynMethods entry point. - v4 spec forbids content_type=POSITION_DELETES (PR apache#16025); three TestManifestReader tests that write standalone position-delete files / DV delete files are guarded with assumeThat isLessThan(4) and will be removed once PR apache#16677 (or its successor) gates v4 out of the broad parameterized test suite during incubation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…add matching reader V4Writer and V4DeleteWriter now emit content_entry Parquet rows via TrackedFileWrapper/ContentEntryAdapter rather than the legacy manifest_entry Avro shape. ContentEntryReader and ContentEntryManifestReaderAdapter project content_entry rows back to ManifestEntry<DataFile/DeleteFile> so all downstream consumers (ManifestGroup, MergingSnapshotProducer rewrite paths) work unchanged. Read-path dispatch in ManifestFiles is layered: 1. Avro manifests are always legacy (no file inspection). 2. Snapshot-tree callers thread an Integer writerFormatVersion hint through the new package-private read overloads: 1 routes to ContentEntryReader, 0 routes to legacy. 3. Callers without a hint (tests writing-then-reading, ad-hoc tooling) fall back to inspecting the Parquet footer schema for field id 134 (content_type) or 147 (tracking). The footer read is delegated to InternalParquet via DynMethods so core has no compile-time dependency on iceberg-parquet. Key design choices: - TrackedFile.schemaWithContentStats omits partition and content_stats when their struct types are empty (Parquet rejects empty groups). - TrackedFileWrapper uses hasPartition/hasContentStats flags to map positions dynamically when either optional group is absent. - V4Writer.add(DataFile) bypasses Delegates.suppressFirstRowId so per-entry firstRowId is stored in the tracking struct rather than at manifest level. - ContentEntryReader.setEntry uses wrapAppendPreservingFirstRowId for ADDED entries so firstRowId read from the tracking struct is not re-suppressed. - ContentEntryAdapter preserves firstRowId for EXISTING entries so uncommitted manifests can round-trip per-entry row IDs. - ContentEntryManifestReaderAdapter applies the same committed/uncommitted firstRowId nullification logic as ManifestReader.idAssigner. - ContentEntryManifestReaderAdapter.iterator tracks ordinal position and sets fileOrdinal and manifestLocation on each BaseFile to match Avro reader behavior. - Parquet.readSchema(InputFile) is a new public helper that returns just the Iceberg-converted file schema; InternalParquet.readSchema delegates to it for the DynMethods entry point. - v4 spec forbids content_type=POSITION_DELETES (PR apache#16025); three TestManifestReader tests that write standalone position-delete files / DV delete files are guarded with assumeThat isLessThan(4) and will be removed once PR apache#16677 (or its successor) gates v4 out of the broad parameterized test suite during incubation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…add matching reader V4Writer and V4DeleteWriter now emit content_entry Parquet rows via TrackedFileWrapper/ContentEntryAdapter rather than the legacy manifest_entry Avro shape. ContentEntryReader and ContentEntryManifestReaderAdapter project content_entry rows back to ManifestEntry<DataFile/DeleteFile> so all downstream consumers (ManifestGroup, MergingSnapshotProducer rewrite paths) work unchanged. Read-path dispatch in ManifestFiles is layered: 1. Avro manifests are always legacy (no file inspection). 2. Snapshot-tree callers thread an Integer writerFormatVersion hint through the new package-private read overloads: 1 routes to ContentEntryReader, 0 routes to legacy. 3. Callers without a hint (tests writing-then-reading, ad-hoc tooling) fall back to inspecting the Parquet footer schema for field id 134 (content_type) or 147 (tracking). The footer read is delegated to InternalParquet via DynMethods so core has no compile-time dependency on iceberg-parquet. Key design choices: - TrackedFile.schemaWithContentStats omits partition and content_stats when their struct types are empty (Parquet rejects empty groups). - TrackedFileWrapper uses hasPartition/hasContentStats flags to map positions dynamically when either optional group is absent. - V4Writer.add(DataFile) bypasses Delegates.suppressFirstRowId so per-entry firstRowId is stored in the tracking struct rather than at manifest level. - ContentEntryReader.setEntry uses wrapAppendPreservingFirstRowId for ADDED entries so firstRowId read from the tracking struct is not re-suppressed. - ContentEntryAdapter preserves firstRowId for EXISTING entries so uncommitted manifests can round-trip per-entry row IDs. - ContentEntryManifestReaderAdapter applies the same committed/uncommitted firstRowId nullification logic as ManifestReader.idAssigner. - ContentEntryManifestReaderAdapter.iterator tracks ordinal position and sets fileOrdinal and manifestLocation on each BaseFile to match Avro reader behavior. - Parquet.readSchema(InputFile) is a new public helper that returns just the Iceberg-converted file schema; InternalParquet.readSchema delegates to it for the DynMethods entry point. - v4 spec forbids content_type=POSITION_DELETES (PR apache#16025); three TestManifestReader tests that write standalone position-delete files / DV delete files are guarded with assumeThat isLessThan(4) and will be removed once PR apache#16677 (or its successor) gates v4 out of the broad parameterized test suite during incubation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…add matching reader V4Writer and V4DeleteWriter now emit content_entry Parquet rows via TrackedFileWrapper/ContentEntryAdapter rather than the legacy manifest_entry Avro shape. ContentEntryReader and ContentEntryManifestReaderAdapter project content_entry rows back to ManifestEntry<DataFile/DeleteFile> so all downstream consumers (ManifestGroup, MergingSnapshotProducer rewrite paths) work unchanged. Read-path dispatch in ManifestFiles is layered: 1. Avro manifests are always legacy (no file inspection). 2. Snapshot-tree callers thread an Integer writerFormatVersion hint through the new package-private read overloads: 1 routes to ContentEntryReader, 0 routes to legacy. 3. Callers without a hint (tests writing-then-reading, ad-hoc tooling) fall back to inspecting the Parquet footer schema for field id 134 (content_type) or 147 (tracking). The footer read is delegated to InternalParquet via DynMethods so core has no compile-time dependency on iceberg-parquet. Key design choices: - TrackedFile.schemaWithContentStats omits partition and content_stats when their struct types are empty (Parquet rejects empty groups). - TrackedFileWrapper uses hasPartition/hasContentStats flags to map positions dynamically when either optional group is absent. - V4Writer.add(DataFile) bypasses Delegates.suppressFirstRowId so per-entry firstRowId is stored in the tracking struct rather than at manifest level. - ContentEntryReader.setEntry uses wrapAppendPreservingFirstRowId for ADDED entries so firstRowId read from the tracking struct is not re-suppressed. - ContentEntryAdapter preserves firstRowId for EXISTING entries so uncommitted manifests can round-trip per-entry row IDs. - ContentEntryManifestReaderAdapter applies the same committed/uncommitted firstRowId nullification logic as ManifestReader.idAssigner. - ContentEntryManifestReaderAdapter.iterator tracks ordinal position and sets fileOrdinal and manifestLocation on each BaseFile to match Avro reader behavior. - Parquet.readSchema(InputFile) is a new public helper that returns just the Iceberg-converted file schema; InternalParquet.readSchema delegates to it for the DynMethods entry point. - v4 spec forbids content_type=POSITION_DELETES (PR apache#16025); three TestManifestReader tests that write standalone position-delete files / DV delete files are guarded with assumeThat isLessThan(4) and will be removed once PR apache#16677 (or its successor) gates v4 out of the broad parameterized test suite during incubation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…add matching reader V4Writer and V4DeleteWriter now emit content_entry Parquet rows via TrackedFileWrapper/ContentEntryAdapter rather than the legacy manifest_entry Avro shape. ContentEntryReader and ContentEntryManifestReaderAdapter project content_entry rows back to ManifestEntry<DataFile/DeleteFile> so all downstream consumers (ManifestGroup, MergingSnapshotProducer rewrite paths) work unchanged. Read-path dispatch in ManifestFiles is layered: 1. Avro manifests are always legacy (no file inspection). 2. Snapshot-tree callers thread an Integer writerFormatVersion hint through the new package-private read overloads: 1 routes to ContentEntryReader, 0 routes to legacy. 3. Callers without a hint (tests writing-then-reading, ad-hoc tooling) fall back to inspecting the Parquet footer schema for field id 134 (content_type) or 147 (tracking). The footer read is delegated to InternalParquet via DynMethods so core has no compile-time dependency on iceberg-parquet. Key design choices: - TrackedFile.schemaWithContentStats omits partition and content_stats when their struct types are empty (Parquet rejects empty groups). - TrackedFileWrapper uses hasPartition/hasContentStats flags to map positions dynamically when either optional group is absent. - V4Writer.add(DataFile) bypasses Delegates.suppressFirstRowId so per-entry firstRowId is stored in the tracking struct rather than at manifest level. - ContentEntryReader.setEntry uses wrapAppendPreservingFirstRowId for ADDED entries so firstRowId read from the tracking struct is not re-suppressed. - ContentEntryAdapter preserves firstRowId for EXISTING entries so uncommitted manifests can round-trip per-entry row IDs. - ContentEntryManifestReaderAdapter applies the same committed/uncommitted firstRowId nullification logic as ManifestReader.idAssigner. - ContentEntryManifestReaderAdapter.iterator tracks ordinal position and sets fileOrdinal and manifestLocation on each BaseFile to match Avro reader behavior. - Parquet.readSchema(InputFile) is a new public helper that returns just the Iceberg-converted file schema; InternalParquet.readSchema delegates to it for the DynMethods entry point. - v4 spec forbids content_type=POSITION_DELETES (PR apache#16025); three TestManifestReader tests that write standalone position-delete files / DV delete files are guarded with assumeThat isLessThan(4) and will be removed once PR apache#16677 (or its successor) gates v4 out of the broad parameterized test suite during incubation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…add matching reader V4Writer and V4DeleteWriter now emit content_entry Parquet rows via TrackedFileWrapper/ContentEntryAdapter rather than the legacy manifest_entry Avro shape. ContentEntryReader and ContentEntryManifestReaderAdapter project content_entry rows back to ManifestEntry<DataFile/DeleteFile> so all downstream consumers (ManifestGroup, MergingSnapshotProducer rewrite paths) work unchanged. Read-path dispatch in ManifestFiles is layered: 1. Avro manifests are always legacy (no file inspection). 2. Snapshot-tree callers thread an Integer writerFormatVersion hint through the new package-private read overloads: 1 routes to ContentEntryReader, 0 routes to legacy. 3. Callers without a hint (tests writing-then-reading, ad-hoc tooling) fall back to inspecting the Parquet footer schema for field id 134 (content_type) or 147 (tracking). The footer read is delegated to InternalParquet via DynMethods so core has no compile-time dependency on iceberg-parquet. Key design choices: - TrackedFile.schemaWithContentStats omits partition and content_stats when their struct types are empty (Parquet rejects empty groups). - TrackedFileWrapper uses hasPartition/hasContentStats flags to map positions dynamically when either optional group is absent. - V4Writer.add(DataFile) bypasses Delegates.suppressFirstRowId so per-entry firstRowId is stored in the tracking struct rather than at manifest level. - ContentEntryReader.setEntry uses wrapAppendPreservingFirstRowId for ADDED entries so firstRowId read from the tracking struct is not re-suppressed. - ContentEntryAdapter preserves firstRowId for EXISTING entries so uncommitted manifests can round-trip per-entry row IDs. - ContentEntryManifestReaderAdapter applies the same committed/uncommitted firstRowId nullification logic as ManifestReader.idAssigner. - ContentEntryManifestReaderAdapter.iterator tracks ordinal position and sets fileOrdinal and manifestLocation on each BaseFile to match Avro reader behavior. - Parquet.readSchema(InputFile) is a new public helper that returns just the Iceberg-converted file schema; InternalParquet.readSchema delegates to it for the DynMethods entry point. - v4 spec forbids content_type=POSITION_DELETES (PR apache#16025); three TestManifestReader tests that write standalone position-delete files / DV delete files are guarded with assumeThat isLessThan(4) and will be removed once PR apache#16677 (or its successor) gates v4 out of the broad parameterized test suite during incubation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…add matching reader V4Writer and V4DeleteWriter now emit content_entry Parquet rows via TrackedFileWrapper/ContentEntryAdapter rather than the legacy manifest_entry Avro shape. ContentEntryReader and ContentEntryManifestReaderAdapter project content_entry rows back to ManifestEntry<DataFile/DeleteFile> so all downstream consumers (ManifestGroup, MergingSnapshotProducer rewrite paths) work unchanged. Read-path dispatch in ManifestFiles is layered: 1. Avro manifests are always legacy (no file inspection). 2. Snapshot-tree callers thread an Integer writerFormatVersion hint through the new package-private read overloads: 1 routes to ContentEntryReader, 0 routes to legacy. 3. Callers without a hint (tests writing-then-reading, ad-hoc tooling) fall back to inspecting the Parquet footer schema for field id 134 (content_type) or 147 (tracking). The footer read is delegated to InternalParquet via DynMethods so core has no compile-time dependency on iceberg-parquet. Key design choices: - TrackedFile.schemaWithContentStats omits partition and content_stats when their struct types are empty (Parquet rejects empty groups). - TrackedFileWrapper uses hasPartition/hasContentStats flags to map positions dynamically when either optional group is absent. - V4Writer.add(DataFile) bypasses Delegates.suppressFirstRowId so per-entry firstRowId is stored in the tracking struct rather than at manifest level. - ContentEntryReader.setEntry uses wrapAppendPreservingFirstRowId for ADDED entries so firstRowId read from the tracking struct is not re-suppressed. - ContentEntryAdapter preserves firstRowId for EXISTING entries so uncommitted manifests can round-trip per-entry row IDs. - ContentEntryManifestReaderAdapter applies the same committed/uncommitted firstRowId nullification logic as ManifestReader.idAssigner. - ContentEntryManifestReaderAdapter.iterator tracks ordinal position and sets fileOrdinal and manifestLocation on each BaseFile to match Avro reader behavior. - Parquet.readSchema(InputFile) is a new public helper that returns just the Iceberg-converted file schema; InternalParquet.readSchema delegates to it for the DynMethods entry point. - v4 spec forbids content_type=POSITION_DELETES (PR apache#16025); three TestManifestReader tests that write standalone position-delete files / DV delete files are guarded with assumeThat isLessThan(4) and will be removed once PR apache#16677 (or its successor) gates v4 out of the broad parameterized test suite during incubation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…add matching reader V4Writer and V4DeleteWriter now emit content_entry Parquet rows via TrackedFileWrapper/ContentEntryAdapter rather than the legacy manifest_entry Avro shape. ContentEntryReader and ContentEntryManifestReaderAdapter project content_entry rows back to ManifestEntry<DataFile/DeleteFile> so all downstream consumers (ManifestGroup, MergingSnapshotProducer rewrite paths) work unchanged. Read-path dispatch in ManifestFiles is layered: 1. Avro manifests are always legacy (no file inspection). 2. Snapshot-tree callers thread an Integer writerFormatVersion hint through the new package-private read overloads: 1 routes to ContentEntryReader, 0 routes to legacy. 3. Callers without a hint (tests writing-then-reading, ad-hoc tooling) fall back to inspecting the Parquet footer schema for field id 134 (content_type) or 147 (tracking). The footer read is delegated to InternalParquet via DynMethods so core has no compile-time dependency on iceberg-parquet. Key design choices: - TrackedFile.schemaWithContentStats omits partition and content_stats when their struct types are empty (Parquet rejects empty groups). - TrackedFileWrapper uses hasPartition/hasContentStats flags to map positions dynamically when either optional group is absent. - V4Writer.add(DataFile) bypasses Delegates.suppressFirstRowId so per-entry firstRowId is stored in the tracking struct rather than at manifest level. - ContentEntryReader.setEntry uses wrapAppendPreservingFirstRowId for ADDED entries so firstRowId read from the tracking struct is not re-suppressed. - ContentEntryAdapter preserves firstRowId for EXISTING entries so uncommitted manifests can round-trip per-entry row IDs. - ContentEntryManifestReaderAdapter applies the same committed/uncommitted firstRowId nullification logic as ManifestReader.idAssigner. - ContentEntryManifestReaderAdapter.iterator tracks ordinal position and sets fileOrdinal and manifestLocation on each BaseFile to match Avro reader behavior. - Parquet.readSchema(InputFile) is a new public helper that returns just the Iceberg-converted file schema; InternalParquet.readSchema delegates to it for the DynMethods entry point. - v4 spec forbids content_type=POSITION_DELETES (PR apache#16025); three TestManifestReader tests that write standalone position-delete files / DV delete files are guarded with assumeThat isLessThan(4) and will be removed once PR apache#16677 (or its successor) gates v4 out of the broad parameterized test suite during incubation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…add matching reader V4Writer and V4DeleteWriter now emit content_entry Parquet rows via TrackedFileWrapper/ContentEntryAdapter rather than the legacy manifest_entry Avro shape. ContentEntryReader and ContentEntryManifestReaderAdapter project content_entry rows back to ManifestEntry<DataFile/DeleteFile> so all downstream consumers (ManifestGroup, MergingSnapshotProducer rewrite paths) work unchanged. Read-path dispatch in ManifestFiles is layered: 1. Avro manifests are always legacy (no file inspection). 2. Snapshot-tree callers thread an Integer writerFormatVersion hint through the new package-private read overloads: 1 routes to ContentEntryReader, 0 routes to legacy. 3. Callers without a hint (tests writing-then-reading, ad-hoc tooling) fall back to inspecting the Parquet footer schema for field id 134 (content_type) or 147 (tracking). The footer read is delegated to InternalParquet via DynMethods so core has no compile-time dependency on iceberg-parquet. Key design choices: - TrackedFile.schemaWithContentStats omits partition and content_stats when their struct types are empty (Parquet rejects empty groups). - TrackedFileWrapper uses hasPartition/hasContentStats flags to map positions dynamically when either optional group is absent. - V4Writer.add(DataFile) bypasses Delegates.suppressFirstRowId so per-entry firstRowId is stored in the tracking struct rather than at manifest level. - ContentEntryReader.setEntry uses wrapAppendPreservingFirstRowId for ADDED entries so firstRowId read from the tracking struct is not re-suppressed. - ContentEntryAdapter preserves firstRowId for EXISTING entries so uncommitted manifests can round-trip per-entry row IDs. - ContentEntryManifestReaderAdapter applies the same committed/uncommitted firstRowId nullification logic as ManifestReader.idAssigner. - ContentEntryManifestReaderAdapter.iterator tracks ordinal position and sets fileOrdinal and manifestLocation on each BaseFile to match Avro reader behavior. - Parquet.readSchema(InputFile) is a new public helper that returns just the Iceberg-converted file schema; InternalParquet.readSchema delegates to it for the DynMethods entry point. - v4 spec forbids content_type=POSITION_DELETES (PR apache#16025); three TestManifestReader tests that write standalone position-delete files / DV delete files are guarded with assumeThat isLessThan(4) and will be removed once PR apache#16677 (or its successor) gates v4 out of the broad parameterized test suite during incubation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…add matching reader V4Writer and V4DeleteWriter now emit content_entry Parquet rows via TrackedFileWrapper/ContentEntryAdapter rather than the legacy manifest_entry Avro shape. ContentEntryReader and ContentEntryManifestReaderAdapter project content_entry rows back to ManifestEntry<DataFile/DeleteFile> so all downstream consumers (ManifestGroup, MergingSnapshotProducer rewrite paths) work unchanged. Read-path dispatch in ManifestFiles is layered: 1. Avro manifests are always legacy (no file inspection). 2. Snapshot-tree callers thread an Integer writerFormatVersion hint through the new package-private read overloads: 1 routes to ContentEntryReader, 0 routes to legacy. 3. Callers without a hint (tests writing-then-reading, ad-hoc tooling) fall back to inspecting the Parquet footer schema for field id 134 (content_type) or 147 (tracking). The footer read is delegated to InternalParquet via DynMethods so core has no compile-time dependency on iceberg-parquet. Key design choices: - TrackedFile.schemaWithContentStats omits partition and content_stats when their struct types are empty (Parquet rejects empty groups). - TrackedFileWrapper uses hasPartition/hasContentStats flags to map positions dynamically when either optional group is absent. - V4Writer.add(DataFile) bypasses Delegates.suppressFirstRowId so per-entry firstRowId is stored in the tracking struct rather than at manifest level. - ContentEntryReader.setEntry uses wrapAppendPreservingFirstRowId for ADDED entries so firstRowId read from the tracking struct is not re-suppressed. - ContentEntryAdapter preserves firstRowId for EXISTING entries so uncommitted manifests can round-trip per-entry row IDs. - ContentEntryManifestReaderAdapter applies the same committed/uncommitted firstRowId nullification logic as ManifestReader.idAssigner. - ContentEntryManifestReaderAdapter.iterator tracks ordinal position and sets fileOrdinal and manifestLocation on each BaseFile to match Avro reader behavior. - Parquet.readSchema(InputFile) is a new public helper that returns just the Iceberg-converted file schema; InternalParquet.readSchema delegates to it for the DynMethods entry point. - v4 spec forbids content_type=POSITION_DELETES (PR apache#16025); three TestManifestReader tests that write standalone position-delete files / DV delete files are guarded with assumeThat isLessThan(4) and will be removed once PR apache#16677 (or its successor) gates v4 out of the broad parameterized test suite during incubation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…add matching reader V4Writer and V4DeleteWriter now emit content_entry Parquet rows via TrackedFileWrapper/ContentEntryAdapter rather than the legacy manifest_entry Avro shape. ContentEntryReader and ContentEntryManifestReaderAdapter project content_entry rows back to ManifestEntry<DataFile/DeleteFile> so all downstream consumers (ManifestGroup, MergingSnapshotProducer rewrite paths) work unchanged. Read-path dispatch in ManifestFiles is layered: 1. Avro manifests are always legacy (no file inspection). 2. Snapshot-tree callers thread an Integer writerFormatVersion hint through the new package-private read overloads: 1 routes to ContentEntryReader, 0 routes to legacy. 3. Callers without a hint (tests writing-then-reading, ad-hoc tooling) fall back to inspecting the Parquet footer schema for field id 134 (content_type) or 147 (tracking). The footer read is delegated to InternalParquet via DynMethods so core has no compile-time dependency on iceberg-parquet. Key design choices: - TrackedFile.schemaWithContentStats omits partition and content_stats when their struct types are empty (Parquet rejects empty groups). - TrackedFileWrapper uses hasPartition/hasContentStats flags to map positions dynamically when either optional group is absent. - V4Writer.add(DataFile) bypasses Delegates.suppressFirstRowId so per-entry firstRowId is stored in the tracking struct rather than at manifest level. - ContentEntryReader.setEntry uses wrapAppendPreservingFirstRowId for ADDED entries so firstRowId read from the tracking struct is not re-suppressed. - ContentEntryAdapter preserves firstRowId for EXISTING entries so uncommitted manifests can round-trip per-entry row IDs. - ContentEntryManifestReaderAdapter applies the same committed/uncommitted firstRowId nullification logic as ManifestReader.idAssigner. - ContentEntryManifestReaderAdapter.iterator tracks ordinal position and sets fileOrdinal and manifestLocation on each BaseFile to match Avro reader behavior. - Parquet.readSchema(InputFile) is a new public helper that returns just the Iceberg-converted file schema; InternalParquet.readSchema delegates to it for the DynMethods entry point. - v4 spec forbids content_type=POSITION_DELETES (PR apache#16025); three TestManifestReader tests that write standalone position-delete files / DV delete files are guarded with assumeThat isLessThan(4) and will be removed once PR apache#16677 (or its successor) gates v4 out of the broad parameterized test suite during incubation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…add matching reader V4Writer and V4DeleteWriter now emit content_entry Parquet rows via TrackedFileWrapper/ContentEntryAdapter rather than the legacy manifest_entry Avro shape. ContentEntryReader and ContentEntryManifestReaderAdapter project content_entry rows back to ManifestEntry<DataFile/DeleteFile> so all downstream consumers (ManifestGroup, MergingSnapshotProducer rewrite paths) work unchanged. Read-path dispatch in ManifestFiles is layered: 1. Avro manifests are always legacy (no file inspection). 2. Snapshot-tree callers thread an Integer writerFormatVersion hint through the new package-private read overloads: 1 routes to ContentEntryReader, 0 routes to legacy. 3. Callers without a hint (tests writing-then-reading, ad-hoc tooling) fall back to inspecting the Parquet footer schema for field id 134 (content_type) or 147 (tracking). The footer read is delegated to InternalParquet via DynMethods so core has no compile-time dependency on iceberg-parquet. Key design choices: - TrackedFile.schemaWithContentStats omits partition and content_stats when their struct types are empty (Parquet rejects empty groups). - TrackedFileWrapper uses hasPartition/hasContentStats flags to map positions dynamically when either optional group is absent. - V4Writer.add(DataFile) bypasses Delegates.suppressFirstRowId so per-entry firstRowId is stored in the tracking struct rather than at manifest level. - ContentEntryReader.setEntry uses wrapAppendPreservingFirstRowId for ADDED entries so firstRowId read from the tracking struct is not re-suppressed. - ContentEntryAdapter preserves firstRowId for EXISTING entries so uncommitted manifests can round-trip per-entry row IDs. - ContentEntryManifestReaderAdapter applies the same committed/uncommitted firstRowId nullification logic as ManifestReader.idAssigner. - ContentEntryManifestReaderAdapter.iterator tracks ordinal position and sets fileOrdinal and manifestLocation on each BaseFile to match Avro reader behavior. - Parquet.readSchema(InputFile) is a new public helper that returns just the Iceberg-converted file schema; InternalParquet.readSchema delegates to it for the DynMethods entry point. - v4 spec forbids content_type=POSITION_DELETES (PR apache#16025); three TestManifestReader tests that write standalone position-delete files / DV delete files are guarded with assumeThat isLessThan(4) and will be removed once PR apache#16677 (or its successor) gates v4 out of the broad parameterized test suite during incubation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…add matching reader V4Writer and V4DeleteWriter now emit content_entry Parquet rows via TrackedFileWrapper/ContentEntryAdapter rather than the legacy manifest_entry Avro shape. ContentEntryReader and ContentEntryManifestReaderAdapter project content_entry rows back to ManifestEntry<DataFile/DeleteFile> so all downstream consumers (ManifestGroup, MergingSnapshotProducer rewrite paths) work unchanged. Read-path dispatch in ManifestFiles is layered: 1. Avro manifests are always legacy (no file inspection). 2. Snapshot-tree callers thread an Integer writerFormatVersion hint through the new package-private read overloads: 1 routes to ContentEntryReader, 0 routes to legacy. 3. Callers without a hint (tests writing-then-reading, ad-hoc tooling) fall back to inspecting the Parquet footer schema for field id 134 (content_type) or 147 (tracking). The footer read is delegated to InternalParquet via DynMethods so core has no compile-time dependency on iceberg-parquet. Key design choices: - TrackedFile.schemaWithContentStats omits partition and content_stats when their struct types are empty (Parquet rejects empty groups). - TrackedFileWrapper uses hasPartition/hasContentStats flags to map positions dynamically when either optional group is absent. - V4Writer.add(DataFile) bypasses Delegates.suppressFirstRowId so per-entry firstRowId is stored in the tracking struct rather than at manifest level. - ContentEntryReader.setEntry uses wrapAppendPreservingFirstRowId for ADDED entries so firstRowId read from the tracking struct is not re-suppressed. - ContentEntryAdapter preserves firstRowId for EXISTING entries so uncommitted manifests can round-trip per-entry row IDs. - ContentEntryManifestReaderAdapter applies the same committed/uncommitted firstRowId nullification logic as ManifestReader.idAssigner. - ContentEntryManifestReaderAdapter.iterator tracks ordinal position and sets fileOrdinal and manifestLocation on each BaseFile to match Avro reader behavior. - Parquet.readSchema(InputFile) is a new public helper that returns just the Iceberg-converted file schema; InternalParquet.readSchema delegates to it for the DynMethods entry point. - v4 spec forbids content_type=POSITION_DELETES (PR apache#16025); three TestManifestReader tests that write standalone position-delete files / DV delete files are guarded with assumeThat isLessThan(4) and will be removed once PR apache#16677 (or its successor) gates v4 out of the broad parameterized test suite during incubation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…add matching reader V4Writer and V4DeleteWriter now emit content_entry Parquet rows via TrackedFileWrapper/ContentEntryAdapter rather than the legacy manifest_entry Avro shape. ContentEntryReader and ContentEntryManifestReaderAdapter project content_entry rows back to ManifestEntry<DataFile/DeleteFile> so all downstream consumers (ManifestGroup, MergingSnapshotProducer rewrite paths) work unchanged. Read-path dispatch in ManifestFiles is layered: 1. Avro manifests are always legacy (no file inspection). 2. Snapshot-tree callers thread an Integer writerFormatVersion hint through the new package-private read overloads: 1 routes to ContentEntryReader, 0 routes to legacy. 3. Callers without a hint (tests writing-then-reading, ad-hoc tooling) fall back to inspecting the Parquet footer schema for field id 134 (content_type) or 147 (tracking). The footer read is delegated to InternalParquet via DynMethods so core has no compile-time dependency on iceberg-parquet. Key design choices: - TrackedFile.schemaWithContentStats omits partition and content_stats when their struct types are empty (Parquet rejects empty groups). - TrackedFileWrapper uses hasPartition/hasContentStats flags to map positions dynamically when either optional group is absent. - V4Writer.add(DataFile) bypasses Delegates.suppressFirstRowId so per-entry firstRowId is stored in the tracking struct rather than at manifest level. - ContentEntryReader.setEntry uses wrapAppendPreservingFirstRowId for ADDED entries so firstRowId read from the tracking struct is not re-suppressed. - ContentEntryAdapter preserves firstRowId for EXISTING entries so uncommitted manifests can round-trip per-entry row IDs. - ContentEntryManifestReaderAdapter applies the same committed/uncommitted firstRowId nullification logic as ManifestReader.idAssigner. - ContentEntryManifestReaderAdapter.iterator tracks ordinal position and sets fileOrdinal and manifestLocation on each BaseFile to match Avro reader behavior. - Parquet.readSchema(InputFile) is a new public helper that returns just the Iceberg-converted file schema; InternalParquet.readSchema delegates to it for the DynMethods entry point. - v4 spec forbids content_type=POSITION_DELETES (PR apache#16025); three TestManifestReader tests that write standalone position-delete files / DV delete files are guarded with assumeThat isLessThan(4) and will be removed once PR apache#16677 (or its successor) gates v4 out of the broad parameterized test suite during incubation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…add matching reader V4Writer and V4DeleteWriter now emit content_entry Parquet rows via TrackedFileWrapper/ContentEntryAdapter rather than the legacy manifest_entry Avro shape. ContentEntryReader and ContentEntryManifestReaderAdapter project content_entry rows back to ManifestEntry<DataFile/DeleteFile> so all downstream consumers (ManifestGroup, MergingSnapshotProducer rewrite paths) work unchanged. Read-path dispatch in ManifestFiles is layered: 1. Avro manifests are always legacy (no file inspection). 2. Snapshot-tree callers thread an Integer writerFormatVersion hint through the new package-private read overloads: 1 routes to ContentEntryReader, 0 routes to legacy. 3. Callers without a hint (tests writing-then-reading, ad-hoc tooling) fall back to inspecting the Parquet footer schema for field id 134 (content_type) or 147 (tracking). The footer read is delegated to InternalParquet via DynMethods so core has no compile-time dependency on iceberg-parquet. Key design choices: - TrackedFile.schemaWithContentStats omits partition and content_stats when their struct types are empty (Parquet rejects empty groups). - TrackedFileWrapper uses hasPartition/hasContentStats flags to map positions dynamically when either optional group is absent. - V4Writer.add(DataFile) bypasses Delegates.suppressFirstRowId so per-entry firstRowId is stored in the tracking struct rather than at manifest level. - ContentEntryReader.setEntry uses wrapAppendPreservingFirstRowId for ADDED entries so firstRowId read from the tracking struct is not re-suppressed. - ContentEntryAdapter preserves firstRowId for EXISTING entries so uncommitted manifests can round-trip per-entry row IDs. - ContentEntryManifestReaderAdapter applies the same committed/uncommitted firstRowId nullification logic as ManifestReader.idAssigner. - ContentEntryManifestReaderAdapter.iterator tracks ordinal position and sets fileOrdinal and manifestLocation on each BaseFile to match Avro reader behavior. - Parquet.readSchema(InputFile) is a new public helper that returns just the Iceberg-converted file schema; InternalParquet.readSchema delegates to it for the DynMethods entry point. - v4 spec forbids content_type=POSITION_DELETES (PR apache#16025); three TestManifestReader tests that write standalone position-delete files / DV delete files are guarded with assumeThat isLessThan(4) and will be removed once PR apache#16677 (or its successor) gates v4 out of the broad parameterized test suite during incubation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…add matching reader V4Writer and V4DeleteWriter now emit content_entry Parquet rows via TrackedFileWrapper/ContentEntryAdapter rather than the legacy manifest_entry Avro shape. ContentEntryReader and ContentEntryManifestReaderAdapter project content_entry rows back to ManifestEntry<DataFile/DeleteFile> so all downstream consumers (ManifestGroup, MergingSnapshotProducer rewrite paths) work unchanged. Read-path dispatch in ManifestFiles is layered: 1. Avro manifests are always legacy (no file inspection). 2. Snapshot-tree callers thread an Integer writerFormatVersion hint through the new package-private read overloads: 1 routes to ContentEntryReader, 0 routes to legacy. 3. Callers without a hint (tests writing-then-reading, ad-hoc tooling) fall back to inspecting the Parquet footer schema for field id 134 (content_type) or 147 (tracking). The footer read is delegated to InternalParquet via DynMethods so core has no compile-time dependency on iceberg-parquet. Key design choices: - TrackedFile.schemaWithContentStats omits partition and content_stats when their struct types are empty (Parquet rejects empty groups). - TrackedFileWrapper uses hasPartition/hasContentStats flags to map positions dynamically when either optional group is absent. - V4Writer.add(DataFile) bypasses Delegates.suppressFirstRowId so per-entry firstRowId is stored in the tracking struct rather than at manifest level. - ContentEntryReader.setEntry uses wrapAppendPreservingFirstRowId for ADDED entries so firstRowId read from the tracking struct is not re-suppressed. - ContentEntryAdapter preserves firstRowId for EXISTING entries so uncommitted manifests can round-trip per-entry row IDs. - ContentEntryManifestReaderAdapter applies the same committed/uncommitted firstRowId nullification logic as ManifestReader.idAssigner. - ContentEntryManifestReaderAdapter.iterator tracks ordinal position and sets fileOrdinal and manifestLocation on each BaseFile to match Avro reader behavior. - Parquet.readSchema(InputFile) is a new public helper that returns just the Iceberg-converted file schema; InternalParquet.readSchema delegates to it for the DynMethods entry point. - v4 spec forbids content_type=POSITION_DELETES (PR apache#16025); three TestManifestReader tests that write standalone position-delete files / DV delete files are guarded with assumeThat isLessThan(4) and will be removed once PR apache#16677 (or its successor) gates v4 out of the broad parameterized test suite during incubation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
ALL_VERSIONS,V2_AND_ABOVE,V3_AND_ABOVE) so they no longer cover v4 by default. Those ranges drive ~25 core tests plus several Flink tests, and exposing them to incomplete v4 read/write paths during incubation produces spurious failures.DEFAULT_MAX_TEST_FORMAT_VERSION = 3and rebase the three derived ranges on it.MAX_FORMAT_VERSIONstays at 4 so bound tests inTestTableMetadata,TestSchema, andTestFormatVersionsupgrade tests continue to validate against v4.DEFAULT_MAX_TEST_FORMAT_VERSIONto 4 once v4 read/write covers all features the parameterized suite exercises (equality deletes, etc.).Test plan
:iceberg-api:spotlessCheckpasses:iceberg-api:compileTestJavaand:iceberg-core:compileTestJavapassTestFormatVersions,TestRowLineageMetadata,TestManifestWriterVersionspassTestRowLineageMetadatacorrectly drops from 22 to 11 cases (v4 no longer parameterized)TestFormatVersionsunchanged (upgrade-target tests still run against v4)TestManifestWriterVersionsexplicit V4 cases still run🤖 Generated with Claude Code