Skip to content

Spark: Reconcile derived partitioning from source table with target table specs in AddFilesProcedure#10133

Closed
amogh-jahagirdar wants to merge 1 commit into
apache:mainfrom
amogh-jahagirdar:use-correct-target-table-spec
Closed

Spark: Reconcile derived partitioning from source table with target table specs in AddFilesProcedure#10133
amogh-jahagirdar wants to merge 1 commit into
apache:mainfrom
amogh-jahagirdar:use-correct-target-table-spec

Conversation

@amogh-jahagirdar
Copy link
Copy Markdown
Contributor

@amogh-jahagirdar amogh-jahagirdar commented Apr 13, 2024

Fixes #10008

Currently the Spark add files procedure will derive a partition spec from the Hive style table and then use that as a spec when writing manifests as part of the import. However, unless the partitioning on the target Iceberg table is correctly defined upfront, then there is unexpected behavior after adding the file. Currently, what happens is:

1.) User creates target table with some partitioning
2.) User updates the partitioning on the target Iceberg table to "align" with what's in Hive. Say for example adds an identity column
3.) User runs the procedure.
4.) Internally the procedure derives a partition spec from the Hive table, and this ends up with a spec ID of 0.
5.) The manifests get written with spec ID of 0 , however this is the original partitioning of the table, and not the evolved partition spec that we would expect to get used. This leads to unexpected results when the target table is queried since the new partition field will be missing.

This change fixes this issue by reconciling the derived spec from 4 with what's a spec that's already in the target table since there's some sane inference we can do here.

If a compatible spec in the target table is found the procedure will use that spec as the spec to use when writing the manifests as part of the import. If a compatible spec is not found, the procedure will use the derived spec as before.

@github-actions github-actions Bot added the spark label Apr 13, 2024
@amogh-jahagirdar amogh-jahagirdar changed the title Spark: Reconcile derived partitioning from source table with target table specs Spark: Reconcile derived partitioning from source table with target table specs in AddFilesProcedure Apr 13, 2024
}

@TestTemplate
public void addFilesTargetTableEvolvedPartitioning() {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me fix some bad naming of columns in this test, and also add a test for dropping a partition field

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's some fundamental changes around the derived schema from the Spark table, we'll need to do to get this to work as expected. While the fix addresses the particular case in the issue, here's a case which still will continue to not behave as expected.

    createIcebergTable("dept String, subdept String, id int, name String", "PARTITIONED BY (dept)");

    sql("ALTER TABLE %s ADD PARTITION FIELD subdept", tableName);

    String createParquet =
        "CREATE TABLE %s (dept String, subdept String, id int, name String) USING %s"
            + " PARTITIONED BY (dept, subdept) LOCATION '%s'";

    sql(createParquet, sourceTableName, "parquet", fileTableDir.getAbsolutePath());
    sql("INSERT INTO %s PARTITION (dept='hr', subdept='communications') VALUES (1, 'John Doe')", sourceTableName);
    sql("INSERT INTO %s PARTITION (dept='hr', subdept='salary') VALUES (2, 'Jane Doe')", sourceTableName);
    sql("INSERT INTO %s PARTITION (dept='hr', subdept='communications') VALUES (3, 'Matt Doe')", sourceTableName);
    sql("INSERT INTO %s PARTITION (dept='facilities', subdept='all') VALUES (4, 'Will Doe')", sourceTableName);

    sql("CALL %s.system.add_files('%s', '%s')", catalogName, tableName, sourceTableName);

    assertEquals(
        "Iceberg table contains correct data",
        sql("SELECT id, name, dept, subdept FROM %s ORDER BY id", sourceTableName),
        sql("SELECT id, name, dept, subdept FROM %s ORDER BY id", tableName));

This case still will fail with the change because we fall back to the derived spec; we fall back to the derived spec because the field IDs in the derived spec are different then what's in the target table. The field IDs generated when deriving the schema from the Spark table are assigned starting from 0 and are in different field ID order then what's on the derived spec.

@amogh-jahagirdar amogh-jahagirdar force-pushed the use-correct-target-table-spec branch from 1c474eb to 4ed95bb Compare April 13, 2024 22:48
StructType sparkType = spark.table(name).schema();
Type converted = SparkTypeVisitor.visit(sparkType, new SparkTypeToType(sparkType));
return new Schema(converted.asNestedType().asStructType().fields());
return convert(spark.table(name).schema());
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll just raise this separately since it's a small refactoring not directly related to this change

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once CI passes

Copy link
Copy Markdown
Contributor

@aokolnychyi aokolnychyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change makes sense to me but the style has to be fixed.

@github-actions
Copy link
Copy Markdown

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions Bot added the stale label Oct 30, 2024
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Nov 7, 2024

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions Bot closed this Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

system.add_files utility does not support updated Partition Spec

3 participants