Skip to content

feat(avro): apply column default values when reading missing fields (3/4)#800

Draft
huan233usc wants to merge 1 commit into
apache:mainfrom
huan233usc:feat/default-values-read-avro
Draft

feat(avro): apply column default values when reading missing fields (3/4)#800
huan233usc wants to merge 1 commit into
apache:mainfrom
huan233usc:feat/default-values-read-avro

Conversation

@huan233usc

Copy link
Copy Markdown
Contributor

What

Part 3 of 4 of Iceberg v3 column default-value support (POC #731), built on the
schema layer (#746) and the Parquet read path (#792).

When a column is present in the read (table) schema but absent from an Avro data
file — because the column was added after those rows were written — fill it with
the column's v3 initial-default instead of null.

Changes

  • arrow/literal_util: add AppendDefaultToBuilder, which appends a
    Literal once to an Arrow ArrayBuilder (for the row-by-row Avro decode
    paths). Like MakeDefaultArray, it targets the storage type for an extension
    builder (arrow.uuid), since Scalar::CastTo has no kernel for extension
    types.
  • Avro projection (avro_schema_util.cc): when a field is missing from the
    file and carries an initial-default, project it as
    FieldProjection::Kind::kDefault, mirroring the generic / Parquet paths.
  • Avro decode (avro_data_util.cc, avro_direct_decoder.cc): materialize
    the kDefault branch via AppendDefaultToBuilder.

Tests

  • literal_util_test: AppendDefaultToBuilder appends a value and casts to the
    builder type.
  • avro_data_test: AppendDatumToBuilder fills missing required and optional
    default fields.
  • avro_test: end-to-end — write an Avro file with an old schema, then read it
    through ReaderFactoryRegistry with an evolved schema carrying defaults
    (ReadMissingFieldsWithDefaults).

Stack

  1. feat(schema): represent, serialize and validate v3 column default values (1/4) #746 — schema: represent / serialize / validate (merged)
  2. feat(parquet): apply column default values when reading missing fields (2/4) #792 — read path: Parquet (merged)
  3. this PR — read path: Avro
  4. schema evolution: addColumn / updateColumnDefault (feat: support v3 column default values in UpdateSchema (3/4) #793)

Draft while the earlier PRs in the stack settle.

…3/4)

When a column is present in the read schema but missing from an Avro data file
(written before the column existed), fill it with the column's v3
initial-default instead of null. Reuses the shared arrow/literal_util
materializer (merged in apache#792) and adds AppendDefaultToBuilder for the row-by-row
Avro decode paths, plus a kDefault projection branch in the Avro schema/data
projection.

Part 3 of the v3 column-default-values work (POC apache#731), built on the schema
support in apache#746 and the Parquet read path in apache#792.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant