Skip to content

TPC-DS q39 regression after adding primary key constraints: aggregate GROUP BY includes many unreferenced FD columns #23027

@hhhizzz

Description

@hhhizzz

Describe the bug

Problem

After 73e3c2a617 / #22646 (chore: Add primary key constraints for TPC-H, TPC-DS), TPC-DS q39 shows a large performance regression.

The regression appears related to SQL aggregate planning with functional dependencies from primary key constraints.

In q39, the query groups by key columns such as:

  • item.i_item_sk
  • warehouse.w_warehouse_sk

After primary key constraints are present, the optimized aggregate plan expands the GROUP BY keys with many functionally dependent columns from those tables, even though the query does not need those columns after aggregation.

Examples observed in the plan include:

  • item.i_item_id
  • item.i_product_name
  • warehouse.w_gmt_offset

This makes the aggregate keys much wider and also causes extra columns to be projected from scans and carried through joins/aggregation.

Regression Shape

The regression pattern is:

  1. TPC-DS table schemas include primary key constraints.
  2. SQL planning recognizes functional dependencies from those constraints.
  3. Aggregate planning expands grouped primary key columns into dependent columns.
  4. The expansion includes columns that are not referenced by the query output.
  5. The plan carries much wider group keys than needed.
  6. q39 runtime increases substantially.

This looks like a planner-level issue rather than a Parquet reader issue: disabling the TPC-DS primary key constraints makes q39 return to the previous timing range.

Benchmark Results

Environment:

TPC-DS SF10
CPU: 24 Cores
Rounds: 10
Iterations: 1
Parquet pushdown filters: true
Parquet reorder filters: true
Parquet pruning: true

With TPC-DS primary key constraints enabled:

q39 current mean: ~8301 ms

With TPC-DS primary key constraints disabled for diagnosis:

q39 current total: 14288.69 ms over 10 rounds
q39 current mean:  ~1428.87 ms
geomean current/main: 0.983399
failures: 0

So q39 is roughly:

~8301 ms -> ~1429 ms

when primary key constraints are removed from the TPC-DS schema setup.

Expected Behavior

Functional dependency support should allow queries to select columns determined by grouped keys, but aggregate planning should not add unreferenced functionally dependent columns to the physical/logical group keys.

Only columns actually required after aggregation should need to appear in aggregate output/grouping.

To Reproduce

Run TPCDS q39 before and after the #22646

Expected behavior

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions