Skip to content

partition_ttl_number is not enforced on incremental refresh for partitioned MV over Iceberg #71972

@HangyuanLiu

Description

@HangyuanLiu

Steps to reproduce the behavior (Required)

  1. Create an Iceberg catalog and base table with date partitions:
    CREATE EXTERNAL CATALOG bug08_ice
    PROPERTIES(
      "type" = "iceberg",
      "iceberg.catalog.type" = "hadoop",
      "iceberg.catalog.warehouse" = "file:///tmp/starrocks-sql-test/iceberg/bug08_confirm"
    );
    
    SET catalog bug08_ice;
    CREATE DATABASE bug08_ice_db;
    USE bug08_ice_db;
    CREATE TABLE t1 (id int, dt date, val int) PARTITION BY (dt);
    INSERT INTO t1 VALUES
      (1, '2024-01-01', 10),
      (2, '2024-01-02', 20),
      (3, '2024-01-03', 30),
      (4, '2024-01-04', 40),
      (5, '2024-01-05', 50);
  2. Create a partitioned MV with partition_ttl_number = 2:
    SET catalog default_catalog;
    CREATE DATABASE bug08_mv_db;
    USE bug08_mv_db;
    
    CREATE MATERIALIZED VIEW test_mv1
    PARTITION BY dt
    REFRESH DEFERRED MANUAL
    PROPERTIES (
      "replication_num" = "1",
      "partition_ttl_number" = "2"
    )
    AS SELECT dt, sum(val) AS sv
       FROM bug08_ice.bug08_ice_db.t1
       GROUP BY dt;
  3. Run the initial refresh:
    REFRESH MATERIALIZED VIEW test_mv1 WITH SYNC MODE;
    SELECT dt FROM test_mv1 ORDER BY dt;
    The MV correctly keeps only the latest 2 partitions: 2024-01-04, 2024-01-05.
  4. Insert one new base partition and refresh again:
    INSERT INTO bug08_ice.bug08_ice_db.t1 VALUES (6, '2024-01-06', 60);
    REFRESH MATERIALIZED VIEW test_mv1 WITH SYNC MODE;
    SELECT dt FROM test_mv1 ORDER BY dt;
    SELECT count(*) FROM test_mv1;
    SHOW PARTITIONS FROM test_mv1;

Expected behavior (Required)

After every refresh, including incremental refresh, the MV should keep at most partition_ttl_number partitions.

With partition_ttl_number = 2, after inserting 2024-01-06 and refreshing, the MV should contain only:

  • 2024-01-05
  • 2024-01-06

So SELECT count(*) FROM test_mv1 should return 2.

Real behavior (Required)

The initial refresh behaves correctly, but the incremental refresh does not trim the old MV partition.

After inserting 2024-01-06 and refreshing again, the MV contains:

  • 2024-01-04
  • 2024-01-05
  • 2024-01-06

So SELECT count(*) FROM test_mv1 returns 3.

This means partition_ttl_number is honored when the MV is first populated, but is not enforced after later incremental refreshes.

Additional observations

I reproduced this locally and the FE logs show:

  • initial refresh partition diff: adds=p20240104,p20240105
  • incremental refresh partition diff: adds=p20240106, deletes=
  • the incremental refresh only refreshes p20240106

The current code path also looks suspicious:

  • RangePartitionDiffer trims the candidate add set by partition_ttl_number, which explains why the initial refresh only creates the latest N partitions.
  • MVPCTRefreshRangePartitioner.syncAddOrDropPartitions() only calls filterPartitionsByTTL(adds, true) on the newly added partitions, and does not trim already existing stale MV partitions after incremental refresh.

Relevant code paths:

  • fe/fe-core/src/main/java/com/starrocks/sql/common/RangePartitionDiffer.java
  • fe/fe-core/src/main/java/com/starrocks/scheduler/mv/pct/MVPCTRefreshRangePartitioner.java

StarRocks version (Required)

Reproduced on a local FE runtime with:

  • show variables like 'version_comment' = fix/bug-28-iceberg-row-dml-reject-e9c501d
  • source checkout HEAD = 189283f334c (upstream/main)

I cannot provide select current_version() output because it is not supported in this local runtime environment.

Metadata

Metadata

Assignees

No one assigned

    Labels

    type/bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions