Skip to content

API: Decouple default test version range from MAX_FORMAT_VERSION#16677

Open
stevenzwu wants to merge 1 commit into
apache:mainfrom
stevenzwu:core/test-helpers-incubation-versions
Open

API: Decouple default test version range from MAX_FORMAT_VERSION#16677
stevenzwu wants to merge 1 commit into
apache:mainfrom
stevenzwu:core/test-helpers-incubation-versions

Conversation

@stevenzwu

@stevenzwu stevenzwu commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Summary

  • During v4 incubation, narrow the default version-parameterized test ranges (ALL_VERSIONS, V2_AND_ABOVE, V3_AND_ABOVE) so they no longer cover v4 by default. Those ranges drive ~25 core tests plus several Flink tests, and exposing them to incomplete v4 read/write paths during incubation produces spurious failures.
  • Add DEFAULT_MAX_TEST_FORMAT_VERSION = 3 and rebase the three derived ranges on it. MAX_FORMAT_VERSION stays at 4 so bound tests in TestTableMetadata, TestSchema, and TestFormatVersions upgrade tests continue to validate against v4.
  • Bump DEFAULT_MAX_TEST_FORMAT_VERSION to 4 once v4 read/write covers all features the parameterized suite exercises (equality deletes, etc.).

Test plan

  • :iceberg-api:spotlessCheck passes
  • :iceberg-api:compileTestJava and :iceberg-core:compileTestJava pass
  • TestFormatVersions, TestRowLineageMetadata, TestManifestWriterVersions pass
    • TestRowLineageMetadata correctly drops from 22 to 11 cases (v4 no longer parameterized)
    • TestFormatVersions unchanged (upgrade-target tests still run against v4)
    • TestManifestWriterVersions explicit V4 cases still run

🤖 Generated with Claude Code

@github-actions github-actions Bot added the API label Jun 3, 2026
// existing parameterized test suite is not pinned to an incomplete code path. Bump this to
// MAX_FORMAT_VERSION when the incubating version's read/write code paths cover every feature
// exercised by the parameterized suite.
public static final int MAX_PRODUCTION_VERSION = 3;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

open to name suggestion. another one is MAX_STABLE_FORMAT_VERSION

@stevenzwu stevenzwu Jun 3, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or MAX_PARAMETERIZED_TEST_FORMAT_VERSION or DEFAULT_MAX_TEST_FORMAT_VERSION

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My vote is DEFAULT_MAX_TEST_FORMAT_VERSION. It stays parallel with the sibling MAX_FORMAT_VERSION (keeps the _FORMAT_VERSION suffix), and DEFAULT reads naturally as "the default upper bound for these ranges, which explicit v4 tests can still opt past."

I'd lean away from MAX_PRODUCTION_VERSION / MAX_STABLE_FORMAT_VERSION: sitting next to core's TableMetadata.SUPPORTED_TABLE_FORMAT_VERSION = 4, a "production/stable = 3" constant reads like a product-level claim about what Iceberg ships rather than a test-suite knob.

MAX_PARAMETERIZED_TEST_FORMAT_VERSION is also fine and very explicit, but a few consumers iterate the ranges in plain for loops / assumeThat(...) guards (e.g. TestSparkMetadataColumns, TestFlinkUpsert, TestScansAndSchemaEvolution) rather than via @FieldSource, so "parameterized" slightly undersells the scope. DEFAULT_MAX_TEST_FORMAT_VERSION avoids that.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on DEFAULT_MAX_TEST_FORMAT_VERSION

// existing parameterized test suite is not pinned to an incomplete code path. Bump this to
// MAX_FORMAT_VERSION when the incubating version's read/write code paths cover every feature
// exercised by the parameterized suite.
public static final int MAX_PRODUCTION_VERSION = 3;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nuance on narrowing all three ranges uniformly: the incubation rationale (incomplete v4 read/write paths cause spurious failures) applies to tests that read/write data, but a few @FieldSource(ALL_VERSIONS) consumers are pure metadata/JSON round-trips that never touch those paths. For example TestLoadTableResponseParser.roundTripSerdeV3andHigher is parameterized over ALL_VERSIONS with assumeThat(formatVersion).isGreaterThanOrEqualTo(3), so today it already runs (and passes in CI) for both v3 and v4; after this change it silently drops to v3-only.

Is dropping v4 from these serde-only tests intended, or worth keeping an explicit v4 case for them (the way TestManifestWriterVersions keeps dedicated V4_FORMATS methods alongside its ALL_VERSIONS ones)? Not blocking - just flagging that the blanket narrowing also removes v4 coverage that isn't on the incomplete read/write path.


public static final List<Integer> ALL_VERSIONS =
IntStream.rangeClosed(1, MAX_FORMAT_VERSION).boxed().collect(Collectors.toUnmodifiableList());
IntStream.rangeClosed(1, MAX_PRODUCTION_VERSION)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of change all_versions, I think we can one for UP_TO_V3 as well, so that parametrized caller is aware of what's the coverage. We can move some of previous tests to use UP_TO_V3 while allow v4 targeted tests to continue to use ALL_VERSIONS (like TestFormatVersions)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Workable but the migration cost is significant. The three constants are referenced by 68 test files directly and another 89 files inherit from one of the four core bases that use them (TestBase, DeleteFileIndexTestBase, PartitionStatsHandlerTestBase, ScanPlanningAndReportingTestBase) — 157 distinct test classes total, spanning core, spark, flink, data, kafka-connect, parquet, orc, and api.

Opt-in switching means flipping each of those (or at minimum the four base classes) to UP_TO_V3 to preserve today's "no v4 in default coverage" state. That's a wide diff up front, and a similarly wide diff later when v4 is ready and the goal is to get them back on ALL_VERSIONS.

The default-lag approach in this PR keeps both endpoints cheap — one-line gate today, one-line revert when v4 read/write covers everything the parameterized suite exercises. The comment on the constant flags it as a transient scaffold so it doesn't outlive its purpose.

UP_TO_V3 could still be added alongside as an opt-in for new tests that want to pin coverage explicitly; that's purely additive and works with either approach.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair, my mental model splits the parameterized callers into two buckets:

  1. Version-aware tests (e.g. TestFormatVersions, TestMergeAppend): graduating their coverage from v3 to v4 feels like it should be a deliberate, reviewable decision. Ideally it shall happen in the same PR that signals "v4 integration concludes for the given test coverage. This also help highlight the intent to the reviewer.

  2. Version-agnostic tests: these read to me as "exercise this against every format version we ship." If the original contract was "all available versions," I'd lean toward keeping that contract intact and letting v4 join the set when it's ready, rather than redefining ALL_VERSIONS and silently dropping coverage we already have today. The serde round-trip case roundTripSerdeV3andHigher flagged above is a good example.

While v4 is being incubated, the default version-parameterized test
ranges (ALL_VERSIONS, V2_AND_ABOVE, V3_AND_ABOVE) should not yet include
v4: they drive ~25 core tests plus several Flink tests, and exposing
those tests to incomplete v4 read/write paths during incubation creates
spurious failures.

Add DEFAULT_MAX_TEST_FORMAT_VERSION = 3 and rebase the three derived
ranges on it. MAX_FORMAT_VERSION stays at 4 so bound tests
(TestTableMetadata, TestSchema, TestFormatVersions upgrade tests)
continue to validate against v4.

Bump DEFAULT_MAX_TEST_FORMAT_VERSION to 4 once v4 read/write covers all
features the parameterized suite exercises (equality deletes, etc.).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@stevenzwu stevenzwu force-pushed the core/test-helpers-incubation-versions branch from 640a96e to 917c81d Compare June 4, 2026 19:46
@stevenzwu stevenzwu mentioned this pull request Jun 6, 2026
3 tasks
stevenzwu added a commit to stevenzwu/iceberg that referenced this pull request Jun 7, 2026
…add matching reader

V4Writer and V4DeleteWriter now emit content_entry Parquet rows via
TrackedFileWrapper/ContentEntryAdapter rather than the legacy manifest_entry
Avro shape. ContentEntryReader and ContentEntryManifestReaderAdapter project
content_entry rows back to ManifestEntry<DataFile/DeleteFile> so all downstream
consumers (ManifestGroup, MergingSnapshotProducer rewrite paths) work
unchanged.

Read-path dispatch in ManifestFiles is layered:
1. Avro manifests are always legacy (no file inspection).
2. Snapshot-tree callers thread an Integer writerFormatVersion hint through
   the new package-private read overloads: 1 routes to ContentEntryReader,
   0 routes to legacy.
3. Callers without a hint (tests writing-then-reading, ad-hoc tooling) fall
   back to inspecting the Parquet footer schema for field id 134 (content_type)
   or 147 (tracking). The footer read is delegated to InternalParquet via
   DynMethods so core has no compile-time dependency on iceberg-parquet.

Key design choices:
- TrackedFile.schemaWithContentStats omits partition and content_stats when
  their struct types are empty (Parquet rejects empty groups).
- TrackedFileWrapper uses hasPartition/hasContentStats flags to map positions
  dynamically when either optional group is absent.
- V4Writer.add(DataFile) bypasses Delegates.suppressFirstRowId so per-entry
  firstRowId is stored in the tracking struct rather than at manifest level.
- ContentEntryReader.setEntry uses wrapAppendPreservingFirstRowId for ADDED
  entries so firstRowId read from the tracking struct is not re-suppressed.
- ContentEntryAdapter preserves firstRowId for EXISTING entries so uncommitted
  manifests can round-trip per-entry row IDs.
- ContentEntryManifestReaderAdapter applies the same committed/uncommitted
  firstRowId nullification logic as ManifestReader.idAssigner.
- ContentEntryManifestReaderAdapter.iterator tracks ordinal position and sets
  fileOrdinal and manifestLocation on each BaseFile to match Avro reader behavior.
- Parquet.readSchema(InputFile) is a new public helper that returns just the
  Iceberg-converted file schema; InternalParquet.readSchema delegates to it
  for the DynMethods entry point.
- v4 spec forbids content_type=POSITION_DELETES (PR apache#16025); three
  TestManifestReader tests that write standalone position-delete files / DV
  delete files are guarded with assumeThat isLessThan(4) and will be removed
  once PR apache#16677 (or its successor) gates v4 out of the broad parameterized
  test suite during incubation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
stevenzwu added a commit to stevenzwu/iceberg that referenced this pull request Jun 11, 2026
…add matching reader

V4Writer and V4DeleteWriter now emit content_entry Parquet rows via
TrackedFileWrapper/ContentEntryAdapter rather than the legacy manifest_entry
Avro shape. ContentEntryReader and ContentEntryManifestReaderAdapter project
content_entry rows back to ManifestEntry<DataFile/DeleteFile> so all downstream
consumers (ManifestGroup, MergingSnapshotProducer rewrite paths) work
unchanged.

Read-path dispatch in ManifestFiles is layered:
1. Avro manifests are always legacy (no file inspection).
2. Snapshot-tree callers thread an Integer writerFormatVersion hint through
   the new package-private read overloads: 1 routes to ContentEntryReader,
   0 routes to legacy.
3. Callers without a hint (tests writing-then-reading, ad-hoc tooling) fall
   back to inspecting the Parquet footer schema for field id 134 (content_type)
   or 147 (tracking). The footer read is delegated to InternalParquet via
   DynMethods so core has no compile-time dependency on iceberg-parquet.

Key design choices:
- TrackedFile.schemaWithContentStats omits partition and content_stats when
  their struct types are empty (Parquet rejects empty groups).
- TrackedFileWrapper uses hasPartition/hasContentStats flags to map positions
  dynamically when either optional group is absent.
- V4Writer.add(DataFile) bypasses Delegates.suppressFirstRowId so per-entry
  firstRowId is stored in the tracking struct rather than at manifest level.
- ContentEntryReader.setEntry uses wrapAppendPreservingFirstRowId for ADDED
  entries so firstRowId read from the tracking struct is not re-suppressed.
- ContentEntryAdapter preserves firstRowId for EXISTING entries so uncommitted
  manifests can round-trip per-entry row IDs.
- ContentEntryManifestReaderAdapter applies the same committed/uncommitted
  firstRowId nullification logic as ManifestReader.idAssigner.
- ContentEntryManifestReaderAdapter.iterator tracks ordinal position and sets
  fileOrdinal and manifestLocation on each BaseFile to match Avro reader behavior.
- Parquet.readSchema(InputFile) is a new public helper that returns just the
  Iceberg-converted file schema; InternalParquet.readSchema delegates to it
  for the DynMethods entry point.
- v4 spec forbids content_type=POSITION_DELETES (PR apache#16025); three
  TestManifestReader tests that write standalone position-delete files / DV
  delete files are guarded with assumeThat isLessThan(4) and will be removed
  once PR apache#16677 (or its successor) gates v4 out of the broad parameterized
  test suite during incubation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
stevenzwu added a commit to stevenzwu/iceberg that referenced this pull request Jun 11, 2026
…add matching reader

V4Writer and V4DeleteWriter now emit content_entry Parquet rows via
TrackedFileWrapper/ContentEntryAdapter rather than the legacy manifest_entry
Avro shape. ContentEntryReader and ContentEntryManifestReaderAdapter project
content_entry rows back to ManifestEntry<DataFile/DeleteFile> so all downstream
consumers (ManifestGroup, MergingSnapshotProducer rewrite paths) work
unchanged.

Read-path dispatch in ManifestFiles is layered:
1. Avro manifests are always legacy (no file inspection).
2. Snapshot-tree callers thread an Integer writerFormatVersion hint through
   the new package-private read overloads: 1 routes to ContentEntryReader,
   0 routes to legacy.
3. Callers without a hint (tests writing-then-reading, ad-hoc tooling) fall
   back to inspecting the Parquet footer schema for field id 134 (content_type)
   or 147 (tracking). The footer read is delegated to InternalParquet via
   DynMethods so core has no compile-time dependency on iceberg-parquet.

Key design choices:
- TrackedFile.schemaWithContentStats omits partition and content_stats when
  their struct types are empty (Parquet rejects empty groups).
- TrackedFileWrapper uses hasPartition/hasContentStats flags to map positions
  dynamically when either optional group is absent.
- V4Writer.add(DataFile) bypasses Delegates.suppressFirstRowId so per-entry
  firstRowId is stored in the tracking struct rather than at manifest level.
- ContentEntryReader.setEntry uses wrapAppendPreservingFirstRowId for ADDED
  entries so firstRowId read from the tracking struct is not re-suppressed.
- ContentEntryAdapter preserves firstRowId for EXISTING entries so uncommitted
  manifests can round-trip per-entry row IDs.
- ContentEntryManifestReaderAdapter applies the same committed/uncommitted
  firstRowId nullification logic as ManifestReader.idAssigner.
- ContentEntryManifestReaderAdapter.iterator tracks ordinal position and sets
  fileOrdinal and manifestLocation on each BaseFile to match Avro reader behavior.
- Parquet.readSchema(InputFile) is a new public helper that returns just the
  Iceberg-converted file schema; InternalParquet.readSchema delegates to it
  for the DynMethods entry point.
- v4 spec forbids content_type=POSITION_DELETES (PR apache#16025); three
  TestManifestReader tests that write standalone position-delete files / DV
  delete files are guarded with assumeThat isLessThan(4) and will be removed
  once PR apache#16677 (or its successor) gates v4 out of the broad parameterized
  test suite during incubation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

// The highest format version covered by the default version-parameterized test ranges below.
// While a new format version is being incubated, this lags MAX_FORMAT_VERSION so that the
// existing parameterized test suite is not pinned to an incomplete code path. Bump this to

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bump it to MAX_FORMAT_VERSION or simply remove? Is there any point keeping it further?

stevenzwu added a commit to stevenzwu/iceberg that referenced this pull request Jun 13, 2026
…add matching reader

V4Writer and V4DeleteWriter now emit content_entry Parquet rows via
TrackedFileWrapper/ContentEntryAdapter rather than the legacy manifest_entry
Avro shape. ContentEntryReader and ContentEntryManifestReaderAdapter project
content_entry rows back to ManifestEntry<DataFile/DeleteFile> so all downstream
consumers (ManifestGroup, MergingSnapshotProducer rewrite paths) work
unchanged.

Read-path dispatch in ManifestFiles is layered:
1. Avro manifests are always legacy (no file inspection).
2. Snapshot-tree callers thread an Integer writerFormatVersion hint through
   the new package-private read overloads: 1 routes to ContentEntryReader,
   0 routes to legacy.
3. Callers without a hint (tests writing-then-reading, ad-hoc tooling) fall
   back to inspecting the Parquet footer schema for field id 134 (content_type)
   or 147 (tracking). The footer read is delegated to InternalParquet via
   DynMethods so core has no compile-time dependency on iceberg-parquet.

Key design choices:
- TrackedFile.schemaWithContentStats omits partition and content_stats when
  their struct types are empty (Parquet rejects empty groups).
- TrackedFileWrapper uses hasPartition/hasContentStats flags to map positions
  dynamically when either optional group is absent.
- V4Writer.add(DataFile) bypasses Delegates.suppressFirstRowId so per-entry
  firstRowId is stored in the tracking struct rather than at manifest level.
- ContentEntryReader.setEntry uses wrapAppendPreservingFirstRowId for ADDED
  entries so firstRowId read from the tracking struct is not re-suppressed.
- ContentEntryAdapter preserves firstRowId for EXISTING entries so uncommitted
  manifests can round-trip per-entry row IDs.
- ContentEntryManifestReaderAdapter applies the same committed/uncommitted
  firstRowId nullification logic as ManifestReader.idAssigner.
- ContentEntryManifestReaderAdapter.iterator tracks ordinal position and sets
  fileOrdinal and manifestLocation on each BaseFile to match Avro reader behavior.
- Parquet.readSchema(InputFile) is a new public helper that returns just the
  Iceberg-converted file schema; InternalParquet.readSchema delegates to it
  for the DynMethods entry point.
- v4 spec forbids content_type=POSITION_DELETES (PR apache#16025); three
  TestManifestReader tests that write standalone position-delete files / DV
  delete files are guarded with assumeThat isLessThan(4) and will be removed
  once PR apache#16677 (or its successor) gates v4 out of the broad parameterized
  test suite during incubation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
stevenzwu added a commit to stevenzwu/iceberg that referenced this pull request Jun 14, 2026
…add matching reader

V4Writer and V4DeleteWriter now emit content_entry Parquet rows via
TrackedFileWrapper/ContentEntryAdapter rather than the legacy manifest_entry
Avro shape. ContentEntryReader and ContentEntryManifestReaderAdapter project
content_entry rows back to ManifestEntry<DataFile/DeleteFile> so all downstream
consumers (ManifestGroup, MergingSnapshotProducer rewrite paths) work
unchanged.

Read-path dispatch in ManifestFiles is layered:
1. Avro manifests are always legacy (no file inspection).
2. Snapshot-tree callers thread an Integer writerFormatVersion hint through
   the new package-private read overloads: 1 routes to ContentEntryReader,
   0 routes to legacy.
3. Callers without a hint (tests writing-then-reading, ad-hoc tooling) fall
   back to inspecting the Parquet footer schema for field id 134 (content_type)
   or 147 (tracking). The footer read is delegated to InternalParquet via
   DynMethods so core has no compile-time dependency on iceberg-parquet.

Key design choices:
- TrackedFile.schemaWithContentStats omits partition and content_stats when
  their struct types are empty (Parquet rejects empty groups).
- TrackedFileWrapper uses hasPartition/hasContentStats flags to map positions
  dynamically when either optional group is absent.
- V4Writer.add(DataFile) bypasses Delegates.suppressFirstRowId so per-entry
  firstRowId is stored in the tracking struct rather than at manifest level.
- ContentEntryReader.setEntry uses wrapAppendPreservingFirstRowId for ADDED
  entries so firstRowId read from the tracking struct is not re-suppressed.
- ContentEntryAdapter preserves firstRowId for EXISTING entries so uncommitted
  manifests can round-trip per-entry row IDs.
- ContentEntryManifestReaderAdapter applies the same committed/uncommitted
  firstRowId nullification logic as ManifestReader.idAssigner.
- ContentEntryManifestReaderAdapter.iterator tracks ordinal position and sets
  fileOrdinal and manifestLocation on each BaseFile to match Avro reader behavior.
- Parquet.readSchema(InputFile) is a new public helper that returns just the
  Iceberg-converted file schema; InternalParquet.readSchema delegates to it
  for the DynMethods entry point.
- v4 spec forbids content_type=POSITION_DELETES (PR apache#16025); three
  TestManifestReader tests that write standalone position-delete files / DV
  delete files are guarded with assumeThat isLessThan(4) and will be removed
  once PR apache#16677 (or its successor) gates v4 out of the broad parameterized
  test suite during incubation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
stevenzwu added a commit to stevenzwu/iceberg that referenced this pull request Jun 14, 2026
…add matching reader

V4Writer and V4DeleteWriter now emit content_entry Parquet rows via
TrackedFileWrapper/ContentEntryAdapter rather than the legacy manifest_entry
Avro shape. ContentEntryReader and ContentEntryManifestReaderAdapter project
content_entry rows back to ManifestEntry<DataFile/DeleteFile> so all downstream
consumers (ManifestGroup, MergingSnapshotProducer rewrite paths) work
unchanged.

Read-path dispatch in ManifestFiles is layered:
1. Avro manifests are always legacy (no file inspection).
2. Snapshot-tree callers thread an Integer writerFormatVersion hint through
   the new package-private read overloads: 1 routes to ContentEntryReader,
   0 routes to legacy.
3. Callers without a hint (tests writing-then-reading, ad-hoc tooling) fall
   back to inspecting the Parquet footer schema for field id 134 (content_type)
   or 147 (tracking). The footer read is delegated to InternalParquet via
   DynMethods so core has no compile-time dependency on iceberg-parquet.

Key design choices:
- TrackedFile.schemaWithContentStats omits partition and content_stats when
  their struct types are empty (Parquet rejects empty groups).
- TrackedFileWrapper uses hasPartition/hasContentStats flags to map positions
  dynamically when either optional group is absent.
- V4Writer.add(DataFile) bypasses Delegates.suppressFirstRowId so per-entry
  firstRowId is stored in the tracking struct rather than at manifest level.
- ContentEntryReader.setEntry uses wrapAppendPreservingFirstRowId for ADDED
  entries so firstRowId read from the tracking struct is not re-suppressed.
- ContentEntryAdapter preserves firstRowId for EXISTING entries so uncommitted
  manifests can round-trip per-entry row IDs.
- ContentEntryManifestReaderAdapter applies the same committed/uncommitted
  firstRowId nullification logic as ManifestReader.idAssigner.
- ContentEntryManifestReaderAdapter.iterator tracks ordinal position and sets
  fileOrdinal and manifestLocation on each BaseFile to match Avro reader behavior.
- Parquet.readSchema(InputFile) is a new public helper that returns just the
  Iceberg-converted file schema; InternalParquet.readSchema delegates to it
  for the DynMethods entry point.
- v4 spec forbids content_type=POSITION_DELETES (PR apache#16025); three
  TestManifestReader tests that write standalone position-delete files / DV
  delete files are guarded with assumeThat isLessThan(4) and will be removed
  once PR apache#16677 (or its successor) gates v4 out of the broad parameterized
  test suite during incubation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
stevenzwu added a commit to stevenzwu/iceberg that referenced this pull request Jun 16, 2026
…add matching reader

V4Writer and V4DeleteWriter now emit content_entry Parquet rows via
TrackedFileWrapper/ContentEntryAdapter rather than the legacy manifest_entry
Avro shape. ContentEntryReader and ContentEntryManifestReaderAdapter project
content_entry rows back to ManifestEntry<DataFile/DeleteFile> so all downstream
consumers (ManifestGroup, MergingSnapshotProducer rewrite paths) work
unchanged.

Read-path dispatch in ManifestFiles is layered:
1. Avro manifests are always legacy (no file inspection).
2. Snapshot-tree callers thread an Integer writerFormatVersion hint through
   the new package-private read overloads: 1 routes to ContentEntryReader,
   0 routes to legacy.
3. Callers without a hint (tests writing-then-reading, ad-hoc tooling) fall
   back to inspecting the Parquet footer schema for field id 134 (content_type)
   or 147 (tracking). The footer read is delegated to InternalParquet via
   DynMethods so core has no compile-time dependency on iceberg-parquet.

Key design choices:
- TrackedFile.schemaWithContentStats omits partition and content_stats when
  their struct types are empty (Parquet rejects empty groups).
- TrackedFileWrapper uses hasPartition/hasContentStats flags to map positions
  dynamically when either optional group is absent.
- V4Writer.add(DataFile) bypasses Delegates.suppressFirstRowId so per-entry
  firstRowId is stored in the tracking struct rather than at manifest level.
- ContentEntryReader.setEntry uses wrapAppendPreservingFirstRowId for ADDED
  entries so firstRowId read from the tracking struct is not re-suppressed.
- ContentEntryAdapter preserves firstRowId for EXISTING entries so uncommitted
  manifests can round-trip per-entry row IDs.
- ContentEntryManifestReaderAdapter applies the same committed/uncommitted
  firstRowId nullification logic as ManifestReader.idAssigner.
- ContentEntryManifestReaderAdapter.iterator tracks ordinal position and sets
  fileOrdinal and manifestLocation on each BaseFile to match Avro reader behavior.
- Parquet.readSchema(InputFile) is a new public helper that returns just the
  Iceberg-converted file schema; InternalParquet.readSchema delegates to it
  for the DynMethods entry point.
- v4 spec forbids content_type=POSITION_DELETES (PR apache#16025); three
  TestManifestReader tests that write standalone position-delete files / DV
  delete files are guarded with assumeThat isLessThan(4) and will be removed
  once PR apache#16677 (or its successor) gates v4 out of the broad parameterized
  test suite during incubation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
stevenzwu added a commit to stevenzwu/iceberg that referenced this pull request Jun 21, 2026
…add matching reader

V4Writer and V4DeleteWriter now emit content_entry Parquet rows via
TrackedFileWrapper/ContentEntryAdapter rather than the legacy manifest_entry
Avro shape. ContentEntryReader and ContentEntryManifestReaderAdapter project
content_entry rows back to ManifestEntry<DataFile/DeleteFile> so all downstream
consumers (ManifestGroup, MergingSnapshotProducer rewrite paths) work
unchanged.

Read-path dispatch in ManifestFiles is layered:
1. Avro manifests are always legacy (no file inspection).
2. Snapshot-tree callers thread an Integer writerFormatVersion hint through
   the new package-private read overloads: 1 routes to ContentEntryReader,
   0 routes to legacy.
3. Callers without a hint (tests writing-then-reading, ad-hoc tooling) fall
   back to inspecting the Parquet footer schema for field id 134 (content_type)
   or 147 (tracking). The footer read is delegated to InternalParquet via
   DynMethods so core has no compile-time dependency on iceberg-parquet.

Key design choices:
- TrackedFile.schemaWithContentStats omits partition and content_stats when
  their struct types are empty (Parquet rejects empty groups).
- TrackedFileWrapper uses hasPartition/hasContentStats flags to map positions
  dynamically when either optional group is absent.
- V4Writer.add(DataFile) bypasses Delegates.suppressFirstRowId so per-entry
  firstRowId is stored in the tracking struct rather than at manifest level.
- ContentEntryReader.setEntry uses wrapAppendPreservingFirstRowId for ADDED
  entries so firstRowId read from the tracking struct is not re-suppressed.
- ContentEntryAdapter preserves firstRowId for EXISTING entries so uncommitted
  manifests can round-trip per-entry row IDs.
- ContentEntryManifestReaderAdapter applies the same committed/uncommitted
  firstRowId nullification logic as ManifestReader.idAssigner.
- ContentEntryManifestReaderAdapter.iterator tracks ordinal position and sets
  fileOrdinal and manifestLocation on each BaseFile to match Avro reader behavior.
- Parquet.readSchema(InputFile) is a new public helper that returns just the
  Iceberg-converted file schema; InternalParquet.readSchema delegates to it
  for the DynMethods entry point.
- v4 spec forbids content_type=POSITION_DELETES (PR apache#16025); three
  TestManifestReader tests that write standalone position-delete files / DV
  delete files are guarded with assumeThat isLessThan(4) and will be removed
  once PR apache#16677 (or its successor) gates v4 out of the broad parameterized
  test suite during incubation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
stevenzwu added a commit to stevenzwu/iceberg that referenced this pull request Jun 21, 2026
…add matching reader

V4Writer and V4DeleteWriter now emit content_entry Parquet rows via
TrackedFileWrapper/ContentEntryAdapter rather than the legacy manifest_entry
Avro shape. ContentEntryReader and ContentEntryManifestReaderAdapter project
content_entry rows back to ManifestEntry<DataFile/DeleteFile> so all downstream
consumers (ManifestGroup, MergingSnapshotProducer rewrite paths) work
unchanged.

Read-path dispatch in ManifestFiles is layered:
1. Avro manifests are always legacy (no file inspection).
2. Snapshot-tree callers thread an Integer writerFormatVersion hint through
   the new package-private read overloads: 1 routes to ContentEntryReader,
   0 routes to legacy.
3. Callers without a hint (tests writing-then-reading, ad-hoc tooling) fall
   back to inspecting the Parquet footer schema for field id 134 (content_type)
   or 147 (tracking). The footer read is delegated to InternalParquet via
   DynMethods so core has no compile-time dependency on iceberg-parquet.

Key design choices:
- TrackedFile.schemaWithContentStats omits partition and content_stats when
  their struct types are empty (Parquet rejects empty groups).
- TrackedFileWrapper uses hasPartition/hasContentStats flags to map positions
  dynamically when either optional group is absent.
- V4Writer.add(DataFile) bypasses Delegates.suppressFirstRowId so per-entry
  firstRowId is stored in the tracking struct rather than at manifest level.
- ContentEntryReader.setEntry uses wrapAppendPreservingFirstRowId for ADDED
  entries so firstRowId read from the tracking struct is not re-suppressed.
- ContentEntryAdapter preserves firstRowId for EXISTING entries so uncommitted
  manifests can round-trip per-entry row IDs.
- ContentEntryManifestReaderAdapter applies the same committed/uncommitted
  firstRowId nullification logic as ManifestReader.idAssigner.
- ContentEntryManifestReaderAdapter.iterator tracks ordinal position and sets
  fileOrdinal and manifestLocation on each BaseFile to match Avro reader behavior.
- Parquet.readSchema(InputFile) is a new public helper that returns just the
  Iceberg-converted file schema; InternalParquet.readSchema delegates to it
  for the DynMethods entry point.
- v4 spec forbids content_type=POSITION_DELETES (PR apache#16025); three
  TestManifestReader tests that write standalone position-delete files / DV
  delete files are guarded with assumeThat isLessThan(4) and will be removed
  once PR apache#16677 (or its successor) gates v4 out of the broad parameterized
  test suite during incubation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
stevenzwu added a commit to stevenzwu/iceberg that referenced this pull request Jun 21, 2026
…add matching reader

V4Writer and V4DeleteWriter now emit content_entry Parquet rows via
TrackedFileWrapper/ContentEntryAdapter rather than the legacy manifest_entry
Avro shape. ContentEntryReader and ContentEntryManifestReaderAdapter project
content_entry rows back to ManifestEntry<DataFile/DeleteFile> so all downstream
consumers (ManifestGroup, MergingSnapshotProducer rewrite paths) work
unchanged.

Read-path dispatch in ManifestFiles is layered:
1. Avro manifests are always legacy (no file inspection).
2. Snapshot-tree callers thread an Integer writerFormatVersion hint through
   the new package-private read overloads: 1 routes to ContentEntryReader,
   0 routes to legacy.
3. Callers without a hint (tests writing-then-reading, ad-hoc tooling) fall
   back to inspecting the Parquet footer schema for field id 134 (content_type)
   or 147 (tracking). The footer read is delegated to InternalParquet via
   DynMethods so core has no compile-time dependency on iceberg-parquet.

Key design choices:
- TrackedFile.schemaWithContentStats omits partition and content_stats when
  their struct types are empty (Parquet rejects empty groups).
- TrackedFileWrapper uses hasPartition/hasContentStats flags to map positions
  dynamically when either optional group is absent.
- V4Writer.add(DataFile) bypasses Delegates.suppressFirstRowId so per-entry
  firstRowId is stored in the tracking struct rather than at manifest level.
- ContentEntryReader.setEntry uses wrapAppendPreservingFirstRowId for ADDED
  entries so firstRowId read from the tracking struct is not re-suppressed.
- ContentEntryAdapter preserves firstRowId for EXISTING entries so uncommitted
  manifests can round-trip per-entry row IDs.
- ContentEntryManifestReaderAdapter applies the same committed/uncommitted
  firstRowId nullification logic as ManifestReader.idAssigner.
- ContentEntryManifestReaderAdapter.iterator tracks ordinal position and sets
  fileOrdinal and manifestLocation on each BaseFile to match Avro reader behavior.
- Parquet.readSchema(InputFile) is a new public helper that returns just the
  Iceberg-converted file schema; InternalParquet.readSchema delegates to it
  for the DynMethods entry point.
- v4 spec forbids content_type=POSITION_DELETES (PR apache#16025); three
  TestManifestReader tests that write standalone position-delete files / DV
  delete files are guarded with assumeThat isLessThan(4) and will be removed
  once PR apache#16677 (or its successor) gates v4 out of the broad parameterized
  test suite during incubation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
stevenzwu added a commit to stevenzwu/iceberg that referenced this pull request Jun 21, 2026
…add matching reader

V4Writer and V4DeleteWriter now emit content_entry Parquet rows via
TrackedFileWrapper/ContentEntryAdapter rather than the legacy manifest_entry
Avro shape. ContentEntryReader and ContentEntryManifestReaderAdapter project
content_entry rows back to ManifestEntry<DataFile/DeleteFile> so all downstream
consumers (ManifestGroup, MergingSnapshotProducer rewrite paths) work
unchanged.

Read-path dispatch in ManifestFiles is layered:
1. Avro manifests are always legacy (no file inspection).
2. Snapshot-tree callers thread an Integer writerFormatVersion hint through
   the new package-private read overloads: 1 routes to ContentEntryReader,
   0 routes to legacy.
3. Callers without a hint (tests writing-then-reading, ad-hoc tooling) fall
   back to inspecting the Parquet footer schema for field id 134 (content_type)
   or 147 (tracking). The footer read is delegated to InternalParquet via
   DynMethods so core has no compile-time dependency on iceberg-parquet.

Key design choices:
- TrackedFile.schemaWithContentStats omits partition and content_stats when
  their struct types are empty (Parquet rejects empty groups).
- TrackedFileWrapper uses hasPartition/hasContentStats flags to map positions
  dynamically when either optional group is absent.
- V4Writer.add(DataFile) bypasses Delegates.suppressFirstRowId so per-entry
  firstRowId is stored in the tracking struct rather than at manifest level.
- ContentEntryReader.setEntry uses wrapAppendPreservingFirstRowId for ADDED
  entries so firstRowId read from the tracking struct is not re-suppressed.
- ContentEntryAdapter preserves firstRowId for EXISTING entries so uncommitted
  manifests can round-trip per-entry row IDs.
- ContentEntryManifestReaderAdapter applies the same committed/uncommitted
  firstRowId nullification logic as ManifestReader.idAssigner.
- ContentEntryManifestReaderAdapter.iterator tracks ordinal position and sets
  fileOrdinal and manifestLocation on each BaseFile to match Avro reader behavior.
- Parquet.readSchema(InputFile) is a new public helper that returns just the
  Iceberg-converted file schema; InternalParquet.readSchema delegates to it
  for the DynMethods entry point.
- v4 spec forbids content_type=POSITION_DELETES (PR apache#16025); three
  TestManifestReader tests that write standalone position-delete files / DV
  delete files are guarded with assumeThat isLessThan(4) and will be removed
  once PR apache#16677 (or its successor) gates v4 out of the broad parameterized
  test suite during incubation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
stevenzwu added a commit to stevenzwu/iceberg that referenced this pull request Jun 21, 2026
…add matching reader

V4Writer and V4DeleteWriter now emit content_entry Parquet rows via
TrackedFileWrapper/ContentEntryAdapter rather than the legacy manifest_entry
Avro shape. ContentEntryReader and ContentEntryManifestReaderAdapter project
content_entry rows back to ManifestEntry<DataFile/DeleteFile> so all downstream
consumers (ManifestGroup, MergingSnapshotProducer rewrite paths) work
unchanged.

Read-path dispatch in ManifestFiles is layered:
1. Avro manifests are always legacy (no file inspection).
2. Snapshot-tree callers thread an Integer writerFormatVersion hint through
   the new package-private read overloads: 1 routes to ContentEntryReader,
   0 routes to legacy.
3. Callers without a hint (tests writing-then-reading, ad-hoc tooling) fall
   back to inspecting the Parquet footer schema for field id 134 (content_type)
   or 147 (tracking). The footer read is delegated to InternalParquet via
   DynMethods so core has no compile-time dependency on iceberg-parquet.

Key design choices:
- TrackedFile.schemaWithContentStats omits partition and content_stats when
  their struct types are empty (Parquet rejects empty groups).
- TrackedFileWrapper uses hasPartition/hasContentStats flags to map positions
  dynamically when either optional group is absent.
- V4Writer.add(DataFile) bypasses Delegates.suppressFirstRowId so per-entry
  firstRowId is stored in the tracking struct rather than at manifest level.
- ContentEntryReader.setEntry uses wrapAppendPreservingFirstRowId for ADDED
  entries so firstRowId read from the tracking struct is not re-suppressed.
- ContentEntryAdapter preserves firstRowId for EXISTING entries so uncommitted
  manifests can round-trip per-entry row IDs.
- ContentEntryManifestReaderAdapter applies the same committed/uncommitted
  firstRowId nullification logic as ManifestReader.idAssigner.
- ContentEntryManifestReaderAdapter.iterator tracks ordinal position and sets
  fileOrdinal and manifestLocation on each BaseFile to match Avro reader behavior.
- Parquet.readSchema(InputFile) is a new public helper that returns just the
  Iceberg-converted file schema; InternalParquet.readSchema delegates to it
  for the DynMethods entry point.
- v4 spec forbids content_type=POSITION_DELETES (PR apache#16025); three
  TestManifestReader tests that write standalone position-delete files / DV
  delete files are guarded with assumeThat isLessThan(4) and will be removed
  once PR apache#16677 (or its successor) gates v4 out of the broad parameterized
  test suite during incubation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
stevenzwu added a commit to stevenzwu/iceberg that referenced this pull request Jun 21, 2026
…add matching reader

V4Writer and V4DeleteWriter now emit content_entry Parquet rows via
TrackedFileWrapper/ContentEntryAdapter rather than the legacy manifest_entry
Avro shape. ContentEntryReader and ContentEntryManifestReaderAdapter project
content_entry rows back to ManifestEntry<DataFile/DeleteFile> so all downstream
consumers (ManifestGroup, MergingSnapshotProducer rewrite paths) work
unchanged.

Read-path dispatch in ManifestFiles is layered:
1. Avro manifests are always legacy (no file inspection).
2. Snapshot-tree callers thread an Integer writerFormatVersion hint through
   the new package-private read overloads: 1 routes to ContentEntryReader,
   0 routes to legacy.
3. Callers without a hint (tests writing-then-reading, ad-hoc tooling) fall
   back to inspecting the Parquet footer schema for field id 134 (content_type)
   or 147 (tracking). The footer read is delegated to InternalParquet via
   DynMethods so core has no compile-time dependency on iceberg-parquet.

Key design choices:
- TrackedFile.schemaWithContentStats omits partition and content_stats when
  their struct types are empty (Parquet rejects empty groups).
- TrackedFileWrapper uses hasPartition/hasContentStats flags to map positions
  dynamically when either optional group is absent.
- V4Writer.add(DataFile) bypasses Delegates.suppressFirstRowId so per-entry
  firstRowId is stored in the tracking struct rather than at manifest level.
- ContentEntryReader.setEntry uses wrapAppendPreservingFirstRowId for ADDED
  entries so firstRowId read from the tracking struct is not re-suppressed.
- ContentEntryAdapter preserves firstRowId for EXISTING entries so uncommitted
  manifests can round-trip per-entry row IDs.
- ContentEntryManifestReaderAdapter applies the same committed/uncommitted
  firstRowId nullification logic as ManifestReader.idAssigner.
- ContentEntryManifestReaderAdapter.iterator tracks ordinal position and sets
  fileOrdinal and manifestLocation on each BaseFile to match Avro reader behavior.
- Parquet.readSchema(InputFile) is a new public helper that returns just the
  Iceberg-converted file schema; InternalParquet.readSchema delegates to it
  for the DynMethods entry point.
- v4 spec forbids content_type=POSITION_DELETES (PR apache#16025); three
  TestManifestReader tests that write standalone position-delete files / DV
  delete files are guarded with assumeThat isLessThan(4) and will be removed
  once PR apache#16677 (or its successor) gates v4 out of the broad parameterized
  test suite during incubation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
stevenzwu added a commit to stevenzwu/iceberg that referenced this pull request Jun 21, 2026
…add matching reader

V4Writer and V4DeleteWriter now emit content_entry Parquet rows via
TrackedFileWrapper/ContentEntryAdapter rather than the legacy manifest_entry
Avro shape. ContentEntryReader and ContentEntryManifestReaderAdapter project
content_entry rows back to ManifestEntry<DataFile/DeleteFile> so all downstream
consumers (ManifestGroup, MergingSnapshotProducer rewrite paths) work
unchanged.

Read-path dispatch in ManifestFiles is layered:
1. Avro manifests are always legacy (no file inspection).
2. Snapshot-tree callers thread an Integer writerFormatVersion hint through
   the new package-private read overloads: 1 routes to ContentEntryReader,
   0 routes to legacy.
3. Callers without a hint (tests writing-then-reading, ad-hoc tooling) fall
   back to inspecting the Parquet footer schema for field id 134 (content_type)
   or 147 (tracking). The footer read is delegated to InternalParquet via
   DynMethods so core has no compile-time dependency on iceberg-parquet.

Key design choices:
- TrackedFile.schemaWithContentStats omits partition and content_stats when
  their struct types are empty (Parquet rejects empty groups).
- TrackedFileWrapper uses hasPartition/hasContentStats flags to map positions
  dynamically when either optional group is absent.
- V4Writer.add(DataFile) bypasses Delegates.suppressFirstRowId so per-entry
  firstRowId is stored in the tracking struct rather than at manifest level.
- ContentEntryReader.setEntry uses wrapAppendPreservingFirstRowId for ADDED
  entries so firstRowId read from the tracking struct is not re-suppressed.
- ContentEntryAdapter preserves firstRowId for EXISTING entries so uncommitted
  manifests can round-trip per-entry row IDs.
- ContentEntryManifestReaderAdapter applies the same committed/uncommitted
  firstRowId nullification logic as ManifestReader.idAssigner.
- ContentEntryManifestReaderAdapter.iterator tracks ordinal position and sets
  fileOrdinal and manifestLocation on each BaseFile to match Avro reader behavior.
- Parquet.readSchema(InputFile) is a new public helper that returns just the
  Iceberg-converted file schema; InternalParquet.readSchema delegates to it
  for the DynMethods entry point.
- v4 spec forbids content_type=POSITION_DELETES (PR apache#16025); three
  TestManifestReader tests that write standalone position-delete files / DV
  delete files are guarded with assumeThat isLessThan(4) and will be removed
  once PR apache#16677 (or its successor) gates v4 out of the broad parameterized
  test suite during incubation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
stevenzwu added a commit to stevenzwu/iceberg that referenced this pull request Jun 21, 2026
…add matching reader

V4Writer and V4DeleteWriter now emit content_entry Parquet rows via
TrackedFileWrapper/ContentEntryAdapter rather than the legacy manifest_entry
Avro shape. ContentEntryReader and ContentEntryManifestReaderAdapter project
content_entry rows back to ManifestEntry<DataFile/DeleteFile> so all downstream
consumers (ManifestGroup, MergingSnapshotProducer rewrite paths) work
unchanged.

Read-path dispatch in ManifestFiles is layered:
1. Avro manifests are always legacy (no file inspection).
2. Snapshot-tree callers thread an Integer writerFormatVersion hint through
   the new package-private read overloads: 1 routes to ContentEntryReader,
   0 routes to legacy.
3. Callers without a hint (tests writing-then-reading, ad-hoc tooling) fall
   back to inspecting the Parquet footer schema for field id 134 (content_type)
   or 147 (tracking). The footer read is delegated to InternalParquet via
   DynMethods so core has no compile-time dependency on iceberg-parquet.

Key design choices:
- TrackedFile.schemaWithContentStats omits partition and content_stats when
  their struct types are empty (Parquet rejects empty groups).
- TrackedFileWrapper uses hasPartition/hasContentStats flags to map positions
  dynamically when either optional group is absent.
- V4Writer.add(DataFile) bypasses Delegates.suppressFirstRowId so per-entry
  firstRowId is stored in the tracking struct rather than at manifest level.
- ContentEntryReader.setEntry uses wrapAppendPreservingFirstRowId for ADDED
  entries so firstRowId read from the tracking struct is not re-suppressed.
- ContentEntryAdapter preserves firstRowId for EXISTING entries so uncommitted
  manifests can round-trip per-entry row IDs.
- ContentEntryManifestReaderAdapter applies the same committed/uncommitted
  firstRowId nullification logic as ManifestReader.idAssigner.
- ContentEntryManifestReaderAdapter.iterator tracks ordinal position and sets
  fileOrdinal and manifestLocation on each BaseFile to match Avro reader behavior.
- Parquet.readSchema(InputFile) is a new public helper that returns just the
  Iceberg-converted file schema; InternalParquet.readSchema delegates to it
  for the DynMethods entry point.
- v4 spec forbids content_type=POSITION_DELETES (PR apache#16025); three
  TestManifestReader tests that write standalone position-delete files / DV
  delete files are guarded with assumeThat isLessThan(4) and will be removed
  once PR apache#16677 (or its successor) gates v4 out of the broad parameterized
  test suite during incubation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
stevenzwu added a commit to stevenzwu/iceberg that referenced this pull request Jun 21, 2026
…add matching reader

V4Writer and V4DeleteWriter now emit content_entry Parquet rows via
TrackedFileWrapper/ContentEntryAdapter rather than the legacy manifest_entry
Avro shape. ContentEntryReader and ContentEntryManifestReaderAdapter project
content_entry rows back to ManifestEntry<DataFile/DeleteFile> so all downstream
consumers (ManifestGroup, MergingSnapshotProducer rewrite paths) work
unchanged.

Read-path dispatch in ManifestFiles is layered:
1. Avro manifests are always legacy (no file inspection).
2. Snapshot-tree callers thread an Integer writerFormatVersion hint through
   the new package-private read overloads: 1 routes to ContentEntryReader,
   0 routes to legacy.
3. Callers without a hint (tests writing-then-reading, ad-hoc tooling) fall
   back to inspecting the Parquet footer schema for field id 134 (content_type)
   or 147 (tracking). The footer read is delegated to InternalParquet via
   DynMethods so core has no compile-time dependency on iceberg-parquet.

Key design choices:
- TrackedFile.schemaWithContentStats omits partition and content_stats when
  their struct types are empty (Parquet rejects empty groups).
- TrackedFileWrapper uses hasPartition/hasContentStats flags to map positions
  dynamically when either optional group is absent.
- V4Writer.add(DataFile) bypasses Delegates.suppressFirstRowId so per-entry
  firstRowId is stored in the tracking struct rather than at manifest level.
- ContentEntryReader.setEntry uses wrapAppendPreservingFirstRowId for ADDED
  entries so firstRowId read from the tracking struct is not re-suppressed.
- ContentEntryAdapter preserves firstRowId for EXISTING entries so uncommitted
  manifests can round-trip per-entry row IDs.
- ContentEntryManifestReaderAdapter applies the same committed/uncommitted
  firstRowId nullification logic as ManifestReader.idAssigner.
- ContentEntryManifestReaderAdapter.iterator tracks ordinal position and sets
  fileOrdinal and manifestLocation on each BaseFile to match Avro reader behavior.
- Parquet.readSchema(InputFile) is a new public helper that returns just the
  Iceberg-converted file schema; InternalParquet.readSchema delegates to it
  for the DynMethods entry point.
- v4 spec forbids content_type=POSITION_DELETES (PR apache#16025); three
  TestManifestReader tests that write standalone position-delete files / DV
  delete files are guarded with assumeThat isLessThan(4) and will be removed
  once PR apache#16677 (or its successor) gates v4 out of the broad parameterized
  test suite during incubation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
stevenzwu added a commit to stevenzwu/iceberg that referenced this pull request Jun 21, 2026
…add matching reader

V4Writer and V4DeleteWriter now emit content_entry Parquet rows via
TrackedFileWrapper/ContentEntryAdapter rather than the legacy manifest_entry
Avro shape. ContentEntryReader and ContentEntryManifestReaderAdapter project
content_entry rows back to ManifestEntry<DataFile/DeleteFile> so all downstream
consumers (ManifestGroup, MergingSnapshotProducer rewrite paths) work
unchanged.

Read-path dispatch in ManifestFiles is layered:
1. Avro manifests are always legacy (no file inspection).
2. Snapshot-tree callers thread an Integer writerFormatVersion hint through
   the new package-private read overloads: 1 routes to ContentEntryReader,
   0 routes to legacy.
3. Callers without a hint (tests writing-then-reading, ad-hoc tooling) fall
   back to inspecting the Parquet footer schema for field id 134 (content_type)
   or 147 (tracking). The footer read is delegated to InternalParquet via
   DynMethods so core has no compile-time dependency on iceberg-parquet.

Key design choices:
- TrackedFile.schemaWithContentStats omits partition and content_stats when
  their struct types are empty (Parquet rejects empty groups).
- TrackedFileWrapper uses hasPartition/hasContentStats flags to map positions
  dynamically when either optional group is absent.
- V4Writer.add(DataFile) bypasses Delegates.suppressFirstRowId so per-entry
  firstRowId is stored in the tracking struct rather than at manifest level.
- ContentEntryReader.setEntry uses wrapAppendPreservingFirstRowId for ADDED
  entries so firstRowId read from the tracking struct is not re-suppressed.
- ContentEntryAdapter preserves firstRowId for EXISTING entries so uncommitted
  manifests can round-trip per-entry row IDs.
- ContentEntryManifestReaderAdapter applies the same committed/uncommitted
  firstRowId nullification logic as ManifestReader.idAssigner.
- ContentEntryManifestReaderAdapter.iterator tracks ordinal position and sets
  fileOrdinal and manifestLocation on each BaseFile to match Avro reader behavior.
- Parquet.readSchema(InputFile) is a new public helper that returns just the
  Iceberg-converted file schema; InternalParquet.readSchema delegates to it
  for the DynMethods entry point.
- v4 spec forbids content_type=POSITION_DELETES (PR apache#16025); three
  TestManifestReader tests that write standalone position-delete files / DV
  delete files are guarded with assumeThat isLessThan(4) and will be removed
  once PR apache#16677 (or its successor) gates v4 out of the broad parameterized
  test suite during incubation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
stevenzwu added a commit to stevenzwu/iceberg that referenced this pull request Jun 22, 2026
…add matching reader

V4Writer and V4DeleteWriter now emit content_entry Parquet rows via
TrackedFileWrapper/ContentEntryAdapter rather than the legacy manifest_entry
Avro shape. ContentEntryReader and ContentEntryManifestReaderAdapter project
content_entry rows back to ManifestEntry<DataFile/DeleteFile> so all downstream
consumers (ManifestGroup, MergingSnapshotProducer rewrite paths) work
unchanged.

Read-path dispatch in ManifestFiles is layered:
1. Avro manifests are always legacy (no file inspection).
2. Snapshot-tree callers thread an Integer writerFormatVersion hint through
   the new package-private read overloads: 1 routes to ContentEntryReader,
   0 routes to legacy.
3. Callers without a hint (tests writing-then-reading, ad-hoc tooling) fall
   back to inspecting the Parquet footer schema for field id 134 (content_type)
   or 147 (tracking). The footer read is delegated to InternalParquet via
   DynMethods so core has no compile-time dependency on iceberg-parquet.

Key design choices:
- TrackedFile.schemaWithContentStats omits partition and content_stats when
  their struct types are empty (Parquet rejects empty groups).
- TrackedFileWrapper uses hasPartition/hasContentStats flags to map positions
  dynamically when either optional group is absent.
- V4Writer.add(DataFile) bypasses Delegates.suppressFirstRowId so per-entry
  firstRowId is stored in the tracking struct rather than at manifest level.
- ContentEntryReader.setEntry uses wrapAppendPreservingFirstRowId for ADDED
  entries so firstRowId read from the tracking struct is not re-suppressed.
- ContentEntryAdapter preserves firstRowId for EXISTING entries so uncommitted
  manifests can round-trip per-entry row IDs.
- ContentEntryManifestReaderAdapter applies the same committed/uncommitted
  firstRowId nullification logic as ManifestReader.idAssigner.
- ContentEntryManifestReaderAdapter.iterator tracks ordinal position and sets
  fileOrdinal and manifestLocation on each BaseFile to match Avro reader behavior.
- Parquet.readSchema(InputFile) is a new public helper that returns just the
  Iceberg-converted file schema; InternalParquet.readSchema delegates to it
  for the DynMethods entry point.
- v4 spec forbids content_type=POSITION_DELETES (PR apache#16025); three
  TestManifestReader tests that write standalone position-delete files / DV
  delete files are guarded with assumeThat isLessThan(4) and will be removed
  once PR apache#16677 (or its successor) gates v4 out of the broad parameterized
  test suite during incubation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
stevenzwu added a commit to stevenzwu/iceberg that referenced this pull request Jun 22, 2026
…add matching reader

V4Writer and V4DeleteWriter now emit content_entry Parquet rows via
TrackedFileWrapper/ContentEntryAdapter rather than the legacy manifest_entry
Avro shape. ContentEntryReader and ContentEntryManifestReaderAdapter project
content_entry rows back to ManifestEntry<DataFile/DeleteFile> so all downstream
consumers (ManifestGroup, MergingSnapshotProducer rewrite paths) work
unchanged.

Read-path dispatch in ManifestFiles is layered:
1. Avro manifests are always legacy (no file inspection).
2. Snapshot-tree callers thread an Integer writerFormatVersion hint through
   the new package-private read overloads: 1 routes to ContentEntryReader,
   0 routes to legacy.
3. Callers without a hint (tests writing-then-reading, ad-hoc tooling) fall
   back to inspecting the Parquet footer schema for field id 134 (content_type)
   or 147 (tracking). The footer read is delegated to InternalParquet via
   DynMethods so core has no compile-time dependency on iceberg-parquet.

Key design choices:
- TrackedFile.schemaWithContentStats omits partition and content_stats when
  their struct types are empty (Parquet rejects empty groups).
- TrackedFileWrapper uses hasPartition/hasContentStats flags to map positions
  dynamically when either optional group is absent.
- V4Writer.add(DataFile) bypasses Delegates.suppressFirstRowId so per-entry
  firstRowId is stored in the tracking struct rather than at manifest level.
- ContentEntryReader.setEntry uses wrapAppendPreservingFirstRowId for ADDED
  entries so firstRowId read from the tracking struct is not re-suppressed.
- ContentEntryAdapter preserves firstRowId for EXISTING entries so uncommitted
  manifests can round-trip per-entry row IDs.
- ContentEntryManifestReaderAdapter applies the same committed/uncommitted
  firstRowId nullification logic as ManifestReader.idAssigner.
- ContentEntryManifestReaderAdapter.iterator tracks ordinal position and sets
  fileOrdinal and manifestLocation on each BaseFile to match Avro reader behavior.
- Parquet.readSchema(InputFile) is a new public helper that returns just the
  Iceberg-converted file schema; InternalParquet.readSchema delegates to it
  for the DynMethods entry point.
- v4 spec forbids content_type=POSITION_DELETES (PR apache#16025); three
  TestManifestReader tests that write standalone position-delete files / DV
  delete files are guarded with assumeThat isLessThan(4) and will be removed
  once PR apache#16677 (or its successor) gates v4 out of the broad parameterized
  test suite during incubation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
stevenzwu added a commit to stevenzwu/iceberg that referenced this pull request Jun 22, 2026
…add matching reader

V4Writer and V4DeleteWriter now emit content_entry Parquet rows via
TrackedFileWrapper/ContentEntryAdapter rather than the legacy manifest_entry
Avro shape. ContentEntryReader and ContentEntryManifestReaderAdapter project
content_entry rows back to ManifestEntry<DataFile/DeleteFile> so all downstream
consumers (ManifestGroup, MergingSnapshotProducer rewrite paths) work
unchanged.

Read-path dispatch in ManifestFiles is layered:
1. Avro manifests are always legacy (no file inspection).
2. Snapshot-tree callers thread an Integer writerFormatVersion hint through
   the new package-private read overloads: 1 routes to ContentEntryReader,
   0 routes to legacy.
3. Callers without a hint (tests writing-then-reading, ad-hoc tooling) fall
   back to inspecting the Parquet footer schema for field id 134 (content_type)
   or 147 (tracking). The footer read is delegated to InternalParquet via
   DynMethods so core has no compile-time dependency on iceberg-parquet.

Key design choices:
- TrackedFile.schemaWithContentStats omits partition and content_stats when
  their struct types are empty (Parquet rejects empty groups).
- TrackedFileWrapper uses hasPartition/hasContentStats flags to map positions
  dynamically when either optional group is absent.
- V4Writer.add(DataFile) bypasses Delegates.suppressFirstRowId so per-entry
  firstRowId is stored in the tracking struct rather than at manifest level.
- ContentEntryReader.setEntry uses wrapAppendPreservingFirstRowId for ADDED
  entries so firstRowId read from the tracking struct is not re-suppressed.
- ContentEntryAdapter preserves firstRowId for EXISTING entries so uncommitted
  manifests can round-trip per-entry row IDs.
- ContentEntryManifestReaderAdapter applies the same committed/uncommitted
  firstRowId nullification logic as ManifestReader.idAssigner.
- ContentEntryManifestReaderAdapter.iterator tracks ordinal position and sets
  fileOrdinal and manifestLocation on each BaseFile to match Avro reader behavior.
- Parquet.readSchema(InputFile) is a new public helper that returns just the
  Iceberg-converted file schema; InternalParquet.readSchema delegates to it
  for the DynMethods entry point.
- v4 spec forbids content_type=POSITION_DELETES (PR apache#16025); three
  TestManifestReader tests that write standalone position-delete files / DV
  delete files are guarded with assumeThat isLessThan(4) and will be removed
  once PR apache#16677 (or its successor) gates v4 out of the broad parameterized
  test suite during incubation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
stevenzwu added a commit to stevenzwu/iceberg that referenced this pull request Jun 22, 2026
…add matching reader

V4Writer and V4DeleteWriter now emit content_entry Parquet rows via
TrackedFileWrapper/ContentEntryAdapter rather than the legacy manifest_entry
Avro shape. ContentEntryReader and ContentEntryManifestReaderAdapter project
content_entry rows back to ManifestEntry<DataFile/DeleteFile> so all downstream
consumers (ManifestGroup, MergingSnapshotProducer rewrite paths) work
unchanged.

Read-path dispatch in ManifestFiles is layered:
1. Avro manifests are always legacy (no file inspection).
2. Snapshot-tree callers thread an Integer writerFormatVersion hint through
   the new package-private read overloads: 1 routes to ContentEntryReader,
   0 routes to legacy.
3. Callers without a hint (tests writing-then-reading, ad-hoc tooling) fall
   back to inspecting the Parquet footer schema for field id 134 (content_type)
   or 147 (tracking). The footer read is delegated to InternalParquet via
   DynMethods so core has no compile-time dependency on iceberg-parquet.

Key design choices:
- TrackedFile.schemaWithContentStats omits partition and content_stats when
  their struct types are empty (Parquet rejects empty groups).
- TrackedFileWrapper uses hasPartition/hasContentStats flags to map positions
  dynamically when either optional group is absent.
- V4Writer.add(DataFile) bypasses Delegates.suppressFirstRowId so per-entry
  firstRowId is stored in the tracking struct rather than at manifest level.
- ContentEntryReader.setEntry uses wrapAppendPreservingFirstRowId for ADDED
  entries so firstRowId read from the tracking struct is not re-suppressed.
- ContentEntryAdapter preserves firstRowId for EXISTING entries so uncommitted
  manifests can round-trip per-entry row IDs.
- ContentEntryManifestReaderAdapter applies the same committed/uncommitted
  firstRowId nullification logic as ManifestReader.idAssigner.
- ContentEntryManifestReaderAdapter.iterator tracks ordinal position and sets
  fileOrdinal and manifestLocation on each BaseFile to match Avro reader behavior.
- Parquet.readSchema(InputFile) is a new public helper that returns just the
  Iceberg-converted file schema; InternalParquet.readSchema delegates to it
  for the DynMethods entry point.
- v4 spec forbids content_type=POSITION_DELETES (PR apache#16025); three
  TestManifestReader tests that write standalone position-delete files / DV
  delete files are guarded with assumeThat isLessThan(4) and will be removed
  once PR apache#16677 (or its successor) gates v4 out of the broad parameterized
  test suite during incubation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
stevenzwu added a commit to stevenzwu/iceberg that referenced this pull request Jun 22, 2026
…add matching reader

V4Writer and V4DeleteWriter now emit content_entry Parquet rows via
TrackedFileWrapper/ContentEntryAdapter rather than the legacy manifest_entry
Avro shape. ContentEntryReader and ContentEntryManifestReaderAdapter project
content_entry rows back to ManifestEntry<DataFile/DeleteFile> so all downstream
consumers (ManifestGroup, MergingSnapshotProducer rewrite paths) work
unchanged.

Read-path dispatch in ManifestFiles is layered:
1. Avro manifests are always legacy (no file inspection).
2. Snapshot-tree callers thread an Integer writerFormatVersion hint through
   the new package-private read overloads: 1 routes to ContentEntryReader,
   0 routes to legacy.
3. Callers without a hint (tests writing-then-reading, ad-hoc tooling) fall
   back to inspecting the Parquet footer schema for field id 134 (content_type)
   or 147 (tracking). The footer read is delegated to InternalParquet via
   DynMethods so core has no compile-time dependency on iceberg-parquet.

Key design choices:
- TrackedFile.schemaWithContentStats omits partition and content_stats when
  their struct types are empty (Parquet rejects empty groups).
- TrackedFileWrapper uses hasPartition/hasContentStats flags to map positions
  dynamically when either optional group is absent.
- V4Writer.add(DataFile) bypasses Delegates.suppressFirstRowId so per-entry
  firstRowId is stored in the tracking struct rather than at manifest level.
- ContentEntryReader.setEntry uses wrapAppendPreservingFirstRowId for ADDED
  entries so firstRowId read from the tracking struct is not re-suppressed.
- ContentEntryAdapter preserves firstRowId for EXISTING entries so uncommitted
  manifests can round-trip per-entry row IDs.
- ContentEntryManifestReaderAdapter applies the same committed/uncommitted
  firstRowId nullification logic as ManifestReader.idAssigner.
- ContentEntryManifestReaderAdapter.iterator tracks ordinal position and sets
  fileOrdinal and manifestLocation on each BaseFile to match Avro reader behavior.
- Parquet.readSchema(InputFile) is a new public helper that returns just the
  Iceberg-converted file schema; InternalParquet.readSchema delegates to it
  for the DynMethods entry point.
- v4 spec forbids content_type=POSITION_DELETES (PR apache#16025); three
  TestManifestReader tests that write standalone position-delete files / DV
  delete files are guarded with assumeThat isLessThan(4) and will be removed
  once PR apache#16677 (or its successor) gates v4 out of the broad parameterized
  test suite during incubation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
stevenzwu added a commit to stevenzwu/iceberg that referenced this pull request Jun 22, 2026
…add matching reader

V4Writer and V4DeleteWriter now emit content_entry Parquet rows via
TrackedFileWrapper/ContentEntryAdapter rather than the legacy manifest_entry
Avro shape. ContentEntryReader and ContentEntryManifestReaderAdapter project
content_entry rows back to ManifestEntry<DataFile/DeleteFile> so all downstream
consumers (ManifestGroup, MergingSnapshotProducer rewrite paths) work
unchanged.

Read-path dispatch in ManifestFiles is layered:
1. Avro manifests are always legacy (no file inspection).
2. Snapshot-tree callers thread an Integer writerFormatVersion hint through
   the new package-private read overloads: 1 routes to ContentEntryReader,
   0 routes to legacy.
3. Callers without a hint (tests writing-then-reading, ad-hoc tooling) fall
   back to inspecting the Parquet footer schema for field id 134 (content_type)
   or 147 (tracking). The footer read is delegated to InternalParquet via
   DynMethods so core has no compile-time dependency on iceberg-parquet.

Key design choices:
- TrackedFile.schemaWithContentStats omits partition and content_stats when
  their struct types are empty (Parquet rejects empty groups).
- TrackedFileWrapper uses hasPartition/hasContentStats flags to map positions
  dynamically when either optional group is absent.
- V4Writer.add(DataFile) bypasses Delegates.suppressFirstRowId so per-entry
  firstRowId is stored in the tracking struct rather than at manifest level.
- ContentEntryReader.setEntry uses wrapAppendPreservingFirstRowId for ADDED
  entries so firstRowId read from the tracking struct is not re-suppressed.
- ContentEntryAdapter preserves firstRowId for EXISTING entries so uncommitted
  manifests can round-trip per-entry row IDs.
- ContentEntryManifestReaderAdapter applies the same committed/uncommitted
  firstRowId nullification logic as ManifestReader.idAssigner.
- ContentEntryManifestReaderAdapter.iterator tracks ordinal position and sets
  fileOrdinal and manifestLocation on each BaseFile to match Avro reader behavior.
- Parquet.readSchema(InputFile) is a new public helper that returns just the
  Iceberg-converted file schema; InternalParquet.readSchema delegates to it
  for the DynMethods entry point.
- v4 spec forbids content_type=POSITION_DELETES (PR apache#16025); three
  TestManifestReader tests that write standalone position-delete files / DV
  delete files are guarded with assumeThat isLessThan(4) and will be removed
  once PR apache#16677 (or its successor) gates v4 out of the broad parameterized
  test suite during incubation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
stevenzwu added a commit to stevenzwu/iceberg that referenced this pull request Jun 26, 2026
…add matching reader

V4Writer and V4DeleteWriter now emit content_entry Parquet rows via
TrackedFileWrapper/ContentEntryAdapter rather than the legacy manifest_entry
Avro shape. ContentEntryReader and ContentEntryManifestReaderAdapter project
content_entry rows back to ManifestEntry<DataFile/DeleteFile> so all downstream
consumers (ManifestGroup, MergingSnapshotProducer rewrite paths) work
unchanged.

Read-path dispatch in ManifestFiles is layered:
1. Avro manifests are always legacy (no file inspection).
2. Snapshot-tree callers thread an Integer writerFormatVersion hint through
   the new package-private read overloads: 1 routes to ContentEntryReader,
   0 routes to legacy.
3. Callers without a hint (tests writing-then-reading, ad-hoc tooling) fall
   back to inspecting the Parquet footer schema for field id 134 (content_type)
   or 147 (tracking). The footer read is delegated to InternalParquet via
   DynMethods so core has no compile-time dependency on iceberg-parquet.

Key design choices:
- TrackedFile.schemaWithContentStats omits partition and content_stats when
  their struct types are empty (Parquet rejects empty groups).
- TrackedFileWrapper uses hasPartition/hasContentStats flags to map positions
  dynamically when either optional group is absent.
- V4Writer.add(DataFile) bypasses Delegates.suppressFirstRowId so per-entry
  firstRowId is stored in the tracking struct rather than at manifest level.
- ContentEntryReader.setEntry uses wrapAppendPreservingFirstRowId for ADDED
  entries so firstRowId read from the tracking struct is not re-suppressed.
- ContentEntryAdapter preserves firstRowId for EXISTING entries so uncommitted
  manifests can round-trip per-entry row IDs.
- ContentEntryManifestReaderAdapter applies the same committed/uncommitted
  firstRowId nullification logic as ManifestReader.idAssigner.
- ContentEntryManifestReaderAdapter.iterator tracks ordinal position and sets
  fileOrdinal and manifestLocation on each BaseFile to match Avro reader behavior.
- Parquet.readSchema(InputFile) is a new public helper that returns just the
  Iceberg-converted file schema; InternalParquet.readSchema delegates to it
  for the DynMethods entry point.
- v4 spec forbids content_type=POSITION_DELETES (PR apache#16025); three
  TestManifestReader tests that write standalone position-delete files / DV
  delete files are guarded with assumeThat isLessThan(4) and will be removed
  once PR apache#16677 (or its successor) gates v4 out of the broad parameterized
  test suite during incubation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
stevenzwu added a commit to stevenzwu/iceberg that referenced this pull request Jun 26, 2026
…add matching reader

V4Writer and V4DeleteWriter now emit content_entry Parquet rows via
TrackedFileWrapper/ContentEntryAdapter rather than the legacy manifest_entry
Avro shape. ContentEntryReader and ContentEntryManifestReaderAdapter project
content_entry rows back to ManifestEntry<DataFile/DeleteFile> so all downstream
consumers (ManifestGroup, MergingSnapshotProducer rewrite paths) work
unchanged.

Read-path dispatch in ManifestFiles is layered:
1. Avro manifests are always legacy (no file inspection).
2. Snapshot-tree callers thread an Integer writerFormatVersion hint through
   the new package-private read overloads: 1 routes to ContentEntryReader,
   0 routes to legacy.
3. Callers without a hint (tests writing-then-reading, ad-hoc tooling) fall
   back to inspecting the Parquet footer schema for field id 134 (content_type)
   or 147 (tracking). The footer read is delegated to InternalParquet via
   DynMethods so core has no compile-time dependency on iceberg-parquet.

Key design choices:
- TrackedFile.schemaWithContentStats omits partition and content_stats when
  their struct types are empty (Parquet rejects empty groups).
- TrackedFileWrapper uses hasPartition/hasContentStats flags to map positions
  dynamically when either optional group is absent.
- V4Writer.add(DataFile) bypasses Delegates.suppressFirstRowId so per-entry
  firstRowId is stored in the tracking struct rather than at manifest level.
- ContentEntryReader.setEntry uses wrapAppendPreservingFirstRowId for ADDED
  entries so firstRowId read from the tracking struct is not re-suppressed.
- ContentEntryAdapter preserves firstRowId for EXISTING entries so uncommitted
  manifests can round-trip per-entry row IDs.
- ContentEntryManifestReaderAdapter applies the same committed/uncommitted
  firstRowId nullification logic as ManifestReader.idAssigner.
- ContentEntryManifestReaderAdapter.iterator tracks ordinal position and sets
  fileOrdinal and manifestLocation on each BaseFile to match Avro reader behavior.
- Parquet.readSchema(InputFile) is a new public helper that returns just the
  Iceberg-converted file schema; InternalParquet.readSchema delegates to it
  for the DynMethods entry point.
- v4 spec forbids content_type=POSITION_DELETES (PR apache#16025); three
  TestManifestReader tests that write standalone position-delete files / DV
  delete files are guarded with assumeThat isLessThan(4) and will be removed
  once PR apache#16677 (or its successor) gates v4 out of the broad parameterized
  test suite during incubation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
stevenzwu added a commit to stevenzwu/iceberg that referenced this pull request Jun 27, 2026
…add matching reader

V4Writer and V4DeleteWriter now emit content_entry Parquet rows via
TrackedFileWrapper/ContentEntryAdapter rather than the legacy manifest_entry
Avro shape. ContentEntryReader and ContentEntryManifestReaderAdapter project
content_entry rows back to ManifestEntry<DataFile/DeleteFile> so all downstream
consumers (ManifestGroup, MergingSnapshotProducer rewrite paths) work
unchanged.

Read-path dispatch in ManifestFiles is layered:
1. Avro manifests are always legacy (no file inspection).
2. Snapshot-tree callers thread an Integer writerFormatVersion hint through
   the new package-private read overloads: 1 routes to ContentEntryReader,
   0 routes to legacy.
3. Callers without a hint (tests writing-then-reading, ad-hoc tooling) fall
   back to inspecting the Parquet footer schema for field id 134 (content_type)
   or 147 (tracking). The footer read is delegated to InternalParquet via
   DynMethods so core has no compile-time dependency on iceberg-parquet.

Key design choices:
- TrackedFile.schemaWithContentStats omits partition and content_stats when
  their struct types are empty (Parquet rejects empty groups).
- TrackedFileWrapper uses hasPartition/hasContentStats flags to map positions
  dynamically when either optional group is absent.
- V4Writer.add(DataFile) bypasses Delegates.suppressFirstRowId so per-entry
  firstRowId is stored in the tracking struct rather than at manifest level.
- ContentEntryReader.setEntry uses wrapAppendPreservingFirstRowId for ADDED
  entries so firstRowId read from the tracking struct is not re-suppressed.
- ContentEntryAdapter preserves firstRowId for EXISTING entries so uncommitted
  manifests can round-trip per-entry row IDs.
- ContentEntryManifestReaderAdapter applies the same committed/uncommitted
  firstRowId nullification logic as ManifestReader.idAssigner.
- ContentEntryManifestReaderAdapter.iterator tracks ordinal position and sets
  fileOrdinal and manifestLocation on each BaseFile to match Avro reader behavior.
- Parquet.readSchema(InputFile) is a new public helper that returns just the
  Iceberg-converted file schema; InternalParquet.readSchema delegates to it
  for the DynMethods entry point.
- v4 spec forbids content_type=POSITION_DELETES (PR apache#16025); three
  TestManifestReader tests that write standalone position-delete files / DV
  delete files are guarded with assumeThat isLessThan(4) and will be removed
  once PR apache#16677 (or its successor) gates v4 out of the broad parameterized
  test suite during incubation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
stevenzwu added a commit to stevenzwu/iceberg that referenced this pull request Jun 28, 2026
…add matching reader

V4Writer and V4DeleteWriter now emit content_entry Parquet rows via
TrackedFileWrapper/ContentEntryAdapter rather than the legacy manifest_entry
Avro shape. ContentEntryReader and ContentEntryManifestReaderAdapter project
content_entry rows back to ManifestEntry<DataFile/DeleteFile> so all downstream
consumers (ManifestGroup, MergingSnapshotProducer rewrite paths) work
unchanged.

Read-path dispatch in ManifestFiles is layered:
1. Avro manifests are always legacy (no file inspection).
2. Snapshot-tree callers thread an Integer writerFormatVersion hint through
   the new package-private read overloads: 1 routes to ContentEntryReader,
   0 routes to legacy.
3. Callers without a hint (tests writing-then-reading, ad-hoc tooling) fall
   back to inspecting the Parquet footer schema for field id 134 (content_type)
   or 147 (tracking). The footer read is delegated to InternalParquet via
   DynMethods so core has no compile-time dependency on iceberg-parquet.

Key design choices:
- TrackedFile.schemaWithContentStats omits partition and content_stats when
  their struct types are empty (Parquet rejects empty groups).
- TrackedFileWrapper uses hasPartition/hasContentStats flags to map positions
  dynamically when either optional group is absent.
- V4Writer.add(DataFile) bypasses Delegates.suppressFirstRowId so per-entry
  firstRowId is stored in the tracking struct rather than at manifest level.
- ContentEntryReader.setEntry uses wrapAppendPreservingFirstRowId for ADDED
  entries so firstRowId read from the tracking struct is not re-suppressed.
- ContentEntryAdapter preserves firstRowId for EXISTING entries so uncommitted
  manifests can round-trip per-entry row IDs.
- ContentEntryManifestReaderAdapter applies the same committed/uncommitted
  firstRowId nullification logic as ManifestReader.idAssigner.
- ContentEntryManifestReaderAdapter.iterator tracks ordinal position and sets
  fileOrdinal and manifestLocation on each BaseFile to match Avro reader behavior.
- Parquet.readSchema(InputFile) is a new public helper that returns just the
  Iceberg-converted file schema; InternalParquet.readSchema delegates to it
  for the DynMethods entry point.
- v4 spec forbids content_type=POSITION_DELETES (PR apache#16025); three
  TestManifestReader tests that write standalone position-delete files / DV
  delete files are guarded with assumeThat isLessThan(4) and will be removed
  once PR apache#16677 (or its successor) gates v4 out of the broad parameterized
  test suite during incubation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

5 participants