Skip to content

chore: Setup project layout#1

Merged
Fokko merged 7 commits into
apache:mainfrom
Xuanwo:setup
Jul 21, 2023
Merged

chore: Setup project layout#1
Fokko merged 7 commits into
apache:mainfrom
Xuanwo:setup

Conversation

@Xuanwo
Copy link
Copy Markdown
Member

@Xuanwo Xuanwo commented Jul 21, 2023

This PR will setup the basic project layout.

Please use sqaush merge.

Xuanwo added 3 commits July 21, 2023 14:40
Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
@Xuanwo
Copy link
Copy Markdown
Member Author

Xuanwo commented Jul 21, 2023

cc @JanKaul, @liurenjie1024 for review.

Xuanwo added 2 commits July 21, 2023 14:50
Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
Copy link
Copy Markdown
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Others LGTM

Comment thread .asf.yaml
Signed-off-by: Xuanwo <github@xuanwo.io>
Copy link
Copy Markdown
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Comment thread .asf.yaml
Comment thread .github/dependabot.yml Outdated
Signed-off-by: Xuanwo <github@xuanwo.io>
@Fokko Fokko merged commit bd435b2 into apache:main Jul 21, 2023
@Xuanwo Xuanwo deleted the setup branch July 21, 2023 08:47
himadripal pushed a commit to himadripal/iceberg-rust that referenced this pull request Apr 17, 2024
fix connection pool issue for sql catalog
hareshkh pushed a commit to hareshkh/iceberg-rust that referenced this pull request Feb 17, 2026
* try publish

* try

* fix

* use make

* try docker up

* machine exec

* shit
greedAuguria added a commit to auguria-io/iceberg-rust that referenced this pull request May 25, 2026
… not REE

The RecordBatchTransformer back-fills identity-partition columns missing
from a data file (Iceberg column-projection rule apache#1, commit 663d7a3) and
materialized those constants as RunEndEncoded "for efficiency". But files
that physically carry the column emit the flat type via PassThrough, so the
per-file output schema diverged: `Utf8` for present files vs
`RunEndEncoded(Utf8)` for back-filled ones.

When a DataFusion compaction scans a mix of both (a table with schema
evolution / partial column presence), it concatenates batches across files
and panics:

    Arrow error: It is not possible to concatenate arrays of different data
    types (Utf8, RunEndEncoded("run_ends": Int32, "values": Utf8))

Observed live on golden `dev_golden_ingest_e1_search_events_cisco_asa`
(plan-46 Task 7): the first real `dryRun=false` compaction failed before
commit (no data harm). The existing REE->flat decodes (ad707b3, 7134847)
only run at the WRITE boundary, downstream of this read/merge concat, so
they never see it.

Fix: an identity-partition constant exists in the table schema, so emit it
in the canonical schema type (e.g. `Utf8`) for both the output schema field
and the `ColumnSource::Add` op. Back-filled and read-through files then share
an identical output schema and concat cleanly. Virtual/metadata constants
(`_file`, never in the schema, always back-filled -> already self-consistent)
keep RunEndEncoded, which is a genuine memory win there.

Tests: new `test_partition_constant_schema_matches_passthrough` asserts a
back-filled file and a read-through file produce identical output schemas
(the precondition `concat_batches` enforces); updated
`test_virtual_partition_column_uses_manifest_value` to assert flat `Utf8`.
All 19 record_batch_transformer tests pass.
greedAuguria added a commit to auguria-io/iceberg-rust that referenced this pull request May 25, 2026
Data files lacking embedded Iceberg field ids were assigned field ids
positionally (physical column N -> field-id N+1) and projected positionally.
That is only correct when a file's physical column order matches field-id
order. A schema-evolving writer that omits a column mid-schema and appends
later ones violates it: on golden cisco_asa, files omit product_name
(field-id 22) and append auguria_event_timestamp, so physical slot 21 (where
field-id 22 falls positionally) holds auguria_event_timestamp -- epoch-millis
were served as the product_name identity-partition value. Lossless but
silently corrupt partitioning (compaction split one table into ~9,787 bogus
product_name=<epoch-millis> partitions). Plan-46 Task 14.

Fix (crates/iceberg/src/arrow/reader.rs):
- assign_field_ids_by_name: for files without embedded ids and without an
  explicit schema.name-mapping.default, assign ids by NAME from the task
  schema, recursing through struct/list/map. Replaces the positional
  add_fallback_field_ids_to_arrow_schema (removed).
- Projection is now always field-id-based: ids are present either embedded
  (Branch 1) or name-assigned (Branch 2/3), so the position-based projection
  mask is no longer used in the read path.
- with_partition back-fill present-set is derived from the name-mapped arrow
  schema when ids aren't embedded, so omitted identity-partition columns stay
  eligible for manifest back-fill (rule apache#1) instead of being falsely reported
  present at an appended column's slot.
- New regression test
  test_read_parquet_without_field_ids_omitted_identity_partition_backfills_from_manifest.

Name-based resolution degrades to the same result as positional when order
does match field-id order, so nothing is lost. All 81 arrow:: unit tests pass;
clippy clean. Validated end-to-end on golden cisco_asa (465->1, lossless
1,717,975, product_name=cisco_asa) and cisco_asa_firewall (464->1, lossless
1,710,524, product_name=cisco_asa_firewall).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants