TPC-DS q39 regression after adding primary key constraints: aggregate GROUP BY includes many unreferenced FD columns

### Describe the bug

## Problem

After `73e3c2a617` / #22646 (`chore: Add primary key constraints for TPC-H, TPC-DS`), TPC-DS q39 shows a large performance regression.

The regression appears related to SQL aggregate planning with functional dependencies from primary key constraints.

In q39, the query groups by key columns such as:

- `item.i_item_sk`
- `warehouse.w_warehouse_sk`

After primary key constraints are present, the optimized aggregate plan expands the `GROUP BY` keys with many functionally dependent columns from those tables, even though the query does not need those columns after aggregation.

Examples observed in the plan include:

- `item.i_item_id`
- `item.i_product_name`
- `warehouse.w_gmt_offset`

This makes the aggregate keys much wider and also causes extra columns to be projected from scans and carried through joins/aggregation.

## Regression Shape

The regression pattern is:

1. TPC-DS table schemas include primary key constraints.
2. SQL planning recognizes functional dependencies from those constraints.
3. Aggregate planning expands grouped primary key columns into dependent columns.
4. The expansion includes columns that are not referenced by the query output.
5. The plan carries much wider group keys than needed.
6. q39 runtime increases substantially.

This looks like a planner-level issue rather than a Parquet reader issue: disabling the TPC-DS primary key constraints makes q39 return to the previous timing range.

## Benchmark Results

Environment:

```text
TPC-DS SF10
CPU: 24 Cores
Rounds: 10
Iterations: 1
Parquet pushdown filters: true
Parquet reorder filters: true
Parquet pruning: true
```

With TPC-DS primary key constraints enabled:

```text
q39 current mean: ~8301 ms
```

With TPC-DS primary key constraints disabled for diagnosis:

```text
q39 current total: 14288.69 ms over 10 rounds
q39 current mean:  ~1428.87 ms
geomean current/main: 0.983399
failures: 0
```

So q39 is roughly:

```text
~8301 ms -> ~1429 ms
```

when primary key constraints are removed from the TPC-DS schema setup.

## Expected Behavior

Functional dependency support should allow queries to select columns determined by grouped keys, but aggregate planning should not add unreferenced functionally dependent columns to the physical/logical group keys.

Only columns actually required after aggregation should need to appear in aggregate output/grouping.

### To Reproduce

Run TPCDS q39 before and after the https://github.com/apache/datafusion/pull/22646

### Expected behavior

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TPC-DS q39 regression after adding primary key constraints: aggregate GROUP BY includes many unreferenced FD columns #23027

Describe the bug

Problem

Regression Shape

Benchmark Results

Expected Behavior

To Reproduce

Expected behavior

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

TPC-DS q39 regression after adding primary key constraints: aggregate GROUP BY includes many unreferenced FD columns #23027

Description

Describe the bug

Problem

Regression Shape

Benchmark Results

Expected Behavior

To Reproduce

Expected behavior

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions