Skip to content

To persist sort order in new manifest files after it's been updated #14531

Description

@jennywang67

Feature Request / Improvement

Hi!

We are seeing a behavior where after we update a table's sort order, it's not being reflected in the new manifests.

This is reproducible by writing a unit test:

  @TestTemplate
  public void testSortOrder() {
    sql(
        "CREATE TABLE %s (col string, userid int, dt string) USING iceberg partitioned by (dt)",
        tableName);
    sql("ALTER TABLE %s WRITE ORDERED BY userid", tableName);
    sql("INSERT OVERWRITE %s PARTITION (dt='dt1') VALUES ('str1', 1)", tableName);
    Assert.assertEquals(
        "Should have 1 row with sort order id is 1, and column stats is not null",
        1L,
        scalarSql(
            "SELECT count(*) FROM %s.files where sort_order_id = 1 and column_sizes is not null",
            tableName));
  }

And we see the following failure:

Should have 1 row with sort order id is 1, and column stats is not null
Expected :1
Actual   :0

We're able to confirm that there is only 1 row:

Assert.assertEquals(
        "Should have 1 row",
        1L,
        scalarSql(
            "SELECT count(*) FROM %s.files",
            tableName));

Internally, we noticed that updating SparkWrite.createWriter() to create writerFactory with .dataSortOrder(table.sortOrder()) resolves this issue:

      SparkFileWriterFactory writerFactory =
          SparkFileWriterFactory.builderFor(table)
              .dataFileFormat(format)
              .dataSchema(writeSchema)
              .dataSparkType(dsSchema)
              .dataSortOrder(table.sortOrder())
              .writeProperties(writeProperties)
              .build();

Thanks!

Query engine

None

Willingness to contribute

  • I can contribute this improvement/feature independently
  • I would be willing to contribute this improvement/feature with guidance from the Iceberg community
  • I cannot contribute this improvement/feature at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    improvementPR that improves existing functionality

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions