Describe the bug
Problem
After 73e3c2a617 / #22646 (chore: Add primary key constraints for TPC-H, TPC-DS), TPC-DS q39 shows a large performance regression.
The regression appears related to SQL aggregate planning with functional dependencies from primary key constraints.
In q39, the query groups by key columns such as:
item.i_item_sk
warehouse.w_warehouse_sk
After primary key constraints are present, the optimized aggregate plan expands the GROUP BY keys with many functionally dependent columns from those tables, even though the query does not need those columns after aggregation.
Examples observed in the plan include:
item.i_item_id
item.i_product_name
warehouse.w_gmt_offset
This makes the aggregate keys much wider and also causes extra columns to be projected from scans and carried through joins/aggregation.
Regression Shape
The regression pattern is:
- TPC-DS table schemas include primary key constraints.
- SQL planning recognizes functional dependencies from those constraints.
- Aggregate planning expands grouped primary key columns into dependent columns.
- The expansion includes columns that are not referenced by the query output.
- The plan carries much wider group keys than needed.
- q39 runtime increases substantially.
This looks like a planner-level issue rather than a Parquet reader issue: disabling the TPC-DS primary key constraints makes q39 return to the previous timing range.
Benchmark Results
Environment:
TPC-DS SF10
CPU: 24 Cores
Rounds: 10
Iterations: 1
Parquet pushdown filters: true
Parquet reorder filters: true
Parquet pruning: true
With TPC-DS primary key constraints enabled:
q39 current mean: ~8301 ms
With TPC-DS primary key constraints disabled for diagnosis:
q39 current total: 14288.69 ms over 10 rounds
q39 current mean: ~1428.87 ms
geomean current/main: 0.983399
failures: 0
So q39 is roughly:
when primary key constraints are removed from the TPC-DS schema setup.
Expected Behavior
Functional dependency support should allow queries to select columns determined by grouped keys, but aggregate planning should not add unreferenced functionally dependent columns to the physical/logical group keys.
Only columns actually required after aggregation should need to appear in aggregate output/grouping.
To Reproduce
Run TPCDS q39 before and after the #22646
Expected behavior
No response
Additional context
No response
Describe the bug
Problem
After
73e3c2a617/ #22646 (chore: Add primary key constraints for TPC-H, TPC-DS), TPC-DS q39 shows a large performance regression.The regression appears related to SQL aggregate planning with functional dependencies from primary key constraints.
In q39, the query groups by key columns such as:
item.i_item_skwarehouse.w_warehouse_skAfter primary key constraints are present, the optimized aggregate plan expands the
GROUP BYkeys with many functionally dependent columns from those tables, even though the query does not need those columns after aggregation.Examples observed in the plan include:
item.i_item_iditem.i_product_namewarehouse.w_gmt_offsetThis makes the aggregate keys much wider and also causes extra columns to be projected from scans and carried through joins/aggregation.
Regression Shape
The regression pattern is:
This looks like a planner-level issue rather than a Parquet reader issue: disabling the TPC-DS primary key constraints makes q39 return to the previous timing range.
Benchmark Results
Environment:
With TPC-DS primary key constraints enabled:
With TPC-DS primary key constraints disabled for diagnosis:
So q39 is roughly:
when primary key constraints are removed from the TPC-DS schema setup.
Expected Behavior
Functional dependency support should allow queries to select columns determined by grouped keys, but aggregate planning should not add unreferenced functionally dependent columns to the physical/logical group keys.
Only columns actually required after aggregation should need to appear in aggregate output/grouping.
To Reproduce
Run TPCDS q39 before and after the #22646
Expected behavior
No response
Additional context
No response