Spark: Reconcile derived partitioning from source table with target table specs in AddFilesProcedure by amogh-jahagirdar · Pull Request #10133 · apache/iceberg

amogh-jahagirdar · 2024-04-13T00:25:00Z

Currently the Spark add files procedure will derive a partition spec from the Hive style table and then use that as a spec when writing manifests as part of the import. However, unless the partitioning on the target Iceberg table is correctly defined upfront, then there is unexpected behavior after adding the file. Currently, what happens is:

1.) User creates target table with some partitioning
2.) User updates the partitioning on the target Iceberg table to "align" with what's in Hive. Say for example adds an identity column
3.) User runs the procedure.
4.) Internally the procedure derives a partition spec from the Hive table, and this ends up with a spec ID of 0.
5.) The manifests get written with spec ID of 0 , however this is the original partitioning of the table, and not the evolved partition spec that we would expect to get used. This leads to unexpected results when the target table is queried since the new partition field will be missing.

This change fixes this issue by reconciling the derived spec from 4 with what's a spec that's already in the target table since there's some sane inference we can do here.

If a compatible spec in the target table is found the procedure will use that spec as the spec to use when writing the manifests as part of the import. If a compatible spec is not found, the procedure will use the derived spec as before.

amogh-jahagirdar · 2024-04-13T00:29:01Z

  }

+  @TestTemplate
+  public void addFilesTargetTableEvolvedPartitioning() {


Let me fix some bad naming of columns in this test, and also add a test for dropping a partition field

I think there's some fundamental changes around the derived schema from the Spark table, we'll need to do to get this to work as expected. While the fix addresses the particular case in the issue, here's a case which still will continue to not behave as expected.

createIcebergTable("dept String, subdept String, id int, name String", "PARTITIONED BY (dept)"); sql("ALTER TABLE %s ADD PARTITION FIELD subdept", tableName); String createParquet = "CREATE TABLE %s (dept String, subdept String, id int, name String) USING %s" + " PARTITIONED BY (dept, subdept) LOCATION '%s'"; sql(createParquet, sourceTableName, "parquet", fileTableDir.getAbsolutePath()); sql("INSERT INTO %s PARTITION (dept='hr', subdept='communications') VALUES (1, 'John Doe')", sourceTableName); sql("INSERT INTO %s PARTITION (dept='hr', subdept='salary') VALUES (2, 'Jane Doe')", sourceTableName); sql("INSERT INTO %s PARTITION (dept='hr', subdept='communications') VALUES (3, 'Matt Doe')", sourceTableName); sql("INSERT INTO %s PARTITION (dept='facilities', subdept='all') VALUES (4, 'Will Doe')", sourceTableName); sql("CALL %s.system.add_files('%s', '%s')", catalogName, tableName, sourceTableName); assertEquals( "Iceberg table contains correct data", sql("SELECT id, name, dept, subdept FROM %s ORDER BY id", sourceTableName), sql("SELECT id, name, dept, subdept FROM %s ORDER BY id", tableName));

This case still will fail with the change because we fall back to the derived spec; we fall back to the derived spec because the field IDs in the derived spec are different then what's in the target table. The field IDs generated when deriving the schema from the Spark table are assigned starting from 0 and are in different field ID order then what's on the derived spec.

…able specs

amogh-jahagirdar · 2024-04-13T22:49:15Z

-    StructType sparkType = spark.table(name).schema();
-    Type converted = SparkTypeVisitor.visit(sparkType, new SparkTypeToType(sparkType));
-    return new Schema(converted.asNestedType().asStructType().fields());
+    return convert(spark.table(name).schema());


I'll just raise this separately since it's a small refactoring not directly related to this change

nastra

LGTM once CI passes

aokolnychyi

The change makes sense to me but the style has to be fixed.

github-actions · 2024-10-30T00:15:50Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

github-actions · 2024-11-07T00:15:00Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

github-actions Bot added the spark label Apr 13, 2024

amogh-jahagirdar changed the title ~~Spark: Reconcile derived partitioning from source table with target table specs~~ Spark: Reconcile derived partitioning from source table with target table specs in AddFilesProcedure Apr 13, 2024

amogh-jahagirdar commented Apr 13, 2024

View reviewed changes

amogh-jahagirdar requested review from aokolnychyi and nastra April 13, 2024 00:34

Spark: Reconcile derived partitioning from source table with target t…

4ed95bb

…able specs

amogh-jahagirdar force-pushed the use-correct-target-table-spec branch from 1c474eb to 4ed95bb Compare April 13, 2024 22:48

amogh-jahagirdar commented Apr 13, 2024

View reviewed changes

amogh-jahagirdar mentioned this pull request Apr 13, 2024

Spark: Simplify SparkSchemaUtil#schemaForTable #10137

Merged

nastra approved these changes Apr 16, 2024

View reviewed changes

aokolnychyi approved these changes Apr 16, 2024

View reviewed changes

github-actions Bot added the stale label Oct 30, 2024

github-actions Bot closed this Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark: Reconcile derived partitioning from source table with target table specs in AddFilesProcedure#10133

Spark: Reconcile derived partitioning from source table with target table specs in AddFilesProcedure#10133
amogh-jahagirdar wants to merge 1 commit into
apache:mainfrom
amogh-jahagirdar:use-correct-target-table-spec

amogh-jahagirdar commented Apr 13, 2024 •

edited

Loading

Uh oh!

amogh-jahagirdar Apr 13, 2024

Uh oh!

amogh-jahagirdar Apr 13, 2024

Uh oh!

amogh-jahagirdar Apr 13, 2024

Uh oh!

amogh-jahagirdar Apr 13, 2024

Uh oh!

nastra left a comment

Uh oh!

aokolnychyi left a comment

Uh oh!

github-actions Bot commented Oct 30, 2024

Uh oh!

github-actions Bot commented Nov 7, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

amogh-jahagirdar commented Apr 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amogh-jahagirdar Apr 13, 2024

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar Apr 13, 2024

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar Apr 13, 2024

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar Apr 13, 2024

Choose a reason for hiding this comment

Uh oh!

nastra left a comment

Choose a reason for hiding this comment

Uh oh!

aokolnychyi left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Oct 30, 2024

Uh oh!

github-actions Bot commented Nov 7, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

amogh-jahagirdar commented Apr 13, 2024 •

edited

Loading