chore: Setup project layout#1
Merged
Merged
Conversation
Member
Author
|
cc @JanKaul, @liurenjie1024 for review. |
Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
Fokko
approved these changes
Jul 21, 2023
Signed-off-by: Xuanwo <github@xuanwo.io>
himadripal
pushed a commit
to himadripal/iceberg-rust
that referenced
this pull request
Apr 17, 2024
fix connection pool issue for sql catalog
hareshkh
pushed a commit
to hareshkh/iceberg-rust
that referenced
this pull request
Feb 17, 2026
* try publish * try * fix * use make * try docker up * machine exec * shit
15 tasks
greedAuguria
added a commit
to auguria-io/iceberg-rust
that referenced
this pull request
May 25, 2026
… not REE The RecordBatchTransformer back-fills identity-partition columns missing from a data file (Iceberg column-projection rule apache#1, commit 663d7a3) and materialized those constants as RunEndEncoded "for efficiency". But files that physically carry the column emit the flat type via PassThrough, so the per-file output schema diverged: `Utf8` for present files vs `RunEndEncoded(Utf8)` for back-filled ones. When a DataFusion compaction scans a mix of both (a table with schema evolution / partial column presence), it concatenates batches across files and panics: Arrow error: It is not possible to concatenate arrays of different data types (Utf8, RunEndEncoded("run_ends": Int32, "values": Utf8)) Observed live on golden `dev_golden_ingest_e1_search_events_cisco_asa` (plan-46 Task 7): the first real `dryRun=false` compaction failed before commit (no data harm). The existing REE->flat decodes (ad707b3, 7134847) only run at the WRITE boundary, downstream of this read/merge concat, so they never see it. Fix: an identity-partition constant exists in the table schema, so emit it in the canonical schema type (e.g. `Utf8`) for both the output schema field and the `ColumnSource::Add` op. Back-filled and read-through files then share an identical output schema and concat cleanly. Virtual/metadata constants (`_file`, never in the schema, always back-filled -> already self-consistent) keep RunEndEncoded, which is a genuine memory win there. Tests: new `test_partition_constant_schema_matches_passthrough` asserts a back-filled file and a read-through file produce identical output schemas (the precondition `concat_batches` enforces); updated `test_virtual_partition_column_uses_manifest_value` to assert flat `Utf8`. All 19 record_batch_transformer tests pass.
greedAuguria
added a commit
to auguria-io/iceberg-rust
that referenced
this pull request
May 25, 2026
Data files lacking embedded Iceberg field ids were assigned field ids positionally (physical column N -> field-id N+1) and projected positionally. That is only correct when a file's physical column order matches field-id order. A schema-evolving writer that omits a column mid-schema and appends later ones violates it: on golden cisco_asa, files omit product_name (field-id 22) and append auguria_event_timestamp, so physical slot 21 (where field-id 22 falls positionally) holds auguria_event_timestamp -- epoch-millis were served as the product_name identity-partition value. Lossless but silently corrupt partitioning (compaction split one table into ~9,787 bogus product_name=<epoch-millis> partitions). Plan-46 Task 14. Fix (crates/iceberg/src/arrow/reader.rs): - assign_field_ids_by_name: for files without embedded ids and without an explicit schema.name-mapping.default, assign ids by NAME from the task schema, recursing through struct/list/map. Replaces the positional add_fallback_field_ids_to_arrow_schema (removed). - Projection is now always field-id-based: ids are present either embedded (Branch 1) or name-assigned (Branch 2/3), so the position-based projection mask is no longer used in the read path. - with_partition back-fill present-set is derived from the name-mapped arrow schema when ids aren't embedded, so omitted identity-partition columns stay eligible for manifest back-fill (rule apache#1) instead of being falsely reported present at an appended column's slot. - New regression test test_read_parquet_without_field_ids_omitted_identity_partition_backfills_from_manifest. Name-based resolution degrades to the same result as positional when order does match field-id order, so nothing is lost. All 81 arrow:: unit tests pass; clippy clean. Validated end-to-end on golden cisco_asa (465->1, lossless 1,717,975, product_name=cisco_asa) and cisco_asa_firewall (464->1, lossless 1,710,524, product_name=cisco_asa_firewall).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR will setup the basic project layout.
Please use sqaush merge.