Skip to content

Add minimal APIs / hooks for granular statistics collection in TableProvider implementations#22300

Merged
adriangb merged 11 commits into
apache:mainfrom
pydantic:statistics-request-hooks
May 26, 2026
Merged

Add minimal APIs / hooks for granular statistics collection in TableProvider implementations#22300
adriangb merged 11 commits into
apache:mainfrom
pydantic:statistics-request-hooks

Conversation

@adriangb
Copy link
Copy Markdown
Contributor

@adriangb adriangb commented May 17, 2026

Which issue does this PR close?

This is not a replacement for #21996 — it is a minimal subset of it, carved out so the feature can be discussed/merged in smaller pieces.

Rationale for this change

#21996 ("Query-aware statistics requests via ScanArgs / ScanResult") is a full vertical slice: new statistics types, request threading optimizer → planner → provider, a built-in RequestStatistics optimizer rule, and a consumer integration (FilePruner / ListingTable).

This PR extracts only the framework hooks — just enough that the rest can be implemented entirely outside of DataFusion. A third party can write their own optimizer rule to derive statistics requests, and their own TableProvider to consume them, without DataFusion shipping any rule or consumer of its own.

In stock DataFusion nothing observable changes: no rule populates the new field, and the built-in providers ignore it.

What changes are included in this PR?

Five small, independently-reviewable commits:

  1. refactor: add TableScanBuilder, deprecate TableScan::try_newTableScan::try_new takes five positional args and bare TableScan { .. } literals are fragile to field additions. Introduce TableScanBuilder (with From<TableScan>), move schema derivation into build(), deprecate try_new (delegates to the builder), migrate all in-tree callers. Pure refactor.
  2. feat: add StatisticsRequest type — a new public StatisticsRequest enum in datafusion-expr-common::statistics (Min/Max/NullCount/DistinctCount/Sum/ByteSize per column, plus RowCount/TotalByteSize). Nothing consumes it yet.
  3. feat: add TableScan::statistics_requests field — an advisory BTreeSet<StatisticsRequest> on TableScan. A set so request-deriving optimizer rules stay idempotent under fixpoint iteration (a rule inserts its requests; re-running is a no-op and composes with other rules — no per-rule dedup). Empty by default; DataFusion's own rules never populate it.
  4. feat: thread statistics requests into ScanArgsScanArgs gains statistics_requests; the physical planner threads TableScan::statistics_requests into it so the request reaches TableProvider::scan_with_args.
  5. test: e2e statistics-request flow via a custom optimizer rule — an integration test playing both external roles.

Deliberately left out vs #21996, since this PR is request-side only: the built-in RequestStatistics optimizer rule, the FilePruner / ListingTable consumer integration, the PartitionedFile::satisfied_stats per-file response field, and the response-side types (StatisticsValue / SatisfiedStatistics). Those belong with whatever actually wires the response side.

Are these changes tested?

Yes:

  • datafusion/core/tests/user_defined/statistics_requests.rs: an end-to-end integration test where a custom OptimizerRule annotates TableScan and a custom TableProvider asserts the requests reach scan_with_args — plus a test that without such a rule the provider sees an empty request list.
  • All existing datafusion-expr / datafusion-optimizer / datafusion-proto tests pass against the TableScanBuilder refactor.

Are there any user-facing changes?

Yes — this needs the api change label:

  • New public type StatisticsRequest (re-exported via datafusion_expr::statistics).
  • New TableScanBuilder; TableScan::try_new is deprecated (still works, delegates to the builder).
  • TableScan gains a new public field statistics_requests — this breaks exhaustive TableScan { .. } struct literals downstream (the recommended fix is TableScanBuilder).
  • ScanArgs gains with_statistics_requests / statistics_requests.

🤖 Generated with Claude Code

@github-actions github-actions Bot added logical-expr Logical plan and expressions optimizer Optimizer rules core Core DataFusion crate catalog Related to the catalog crate proto Related to proto crate labels May 17, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 17, 2026

Thank you for opening this pull request!

Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch).

Details
     Cloning apache/main
    Building datafusion v53.1.0 (current)
       Built [  98.843s] (current)
     Parsing datafusion v53.1.0 (current)
      Parsed [   0.034s] (current)
    Building datafusion v53.1.0 (baseline)
       Built [  95.325s] (baseline)
     Parsing datafusion v53.1.0 (baseline)
      Parsed [   0.039s] (baseline)
    Checking datafusion v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.615s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [ 196.839s] datafusion
    Building datafusion-catalog v53.1.0 (current)
       Built [  36.460s] (current)
     Parsing datafusion-catalog v53.1.0 (current)
      Parsed [   0.027s] (current)
    Building datafusion-catalog v53.1.0 (baseline)
       Built [  37.035s] (baseline)
     Parsing datafusion-catalog v53.1.0 (baseline)
      Parsed [   0.026s] (baseline)
    Checking datafusion-catalog v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.141s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [  75.534s] datafusion-catalog
    Building datafusion-expr v53.1.0 (current)
       Built [  26.463s] (current)
     Parsing datafusion-expr v53.1.0 (current)
      Parsed [   0.072s] (current)
    Building datafusion-expr v53.1.0 (baseline)
       Built [  26.425s] (baseline)
     Parsing datafusion-expr v53.1.0 (baseline)
      Parsed [   0.072s] (baseline)
    Checking datafusion-expr v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   1.222s] 222 checks: 220 pass, 2 fail, 0 warn, 30 skip

--- failure constructible_struct_adds_field: externally-constructible struct adds field ---

Description:
A pub struct constructible with a struct literal has a new pub field. Existing struct literals must be updated to include the new field.
        ref: https://doc.rust-lang.org/reference/expressions/struct-expr.html
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.47.0/src/lints/constructible_struct_adds_field.ron

Failed in:
  field TableScan.statistics_requests in /home/runner/work/datafusion/datafusion/datafusion/expr/src/logical_plan/plan.rs:2796
  field TableScan.statistics_requests in /home/runner/work/datafusion/datafusion/datafusion/expr/src/logical_plan/plan.rs:2796

--- failure type_method_marked_deprecated: type method #[deprecated] added ---

Description:
A type method is now #[deprecated]. Downstream crates will get a compiler warning when using this method.
        ref: https://doc.rust-lang.org/reference/attributes/diagnostics.html#the-deprecated-attribute
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.47.0/src/lints/type_method_marked_deprecated.ron

Failed in:
  method datafusion_expr::logical_plan::TableScan::try_new in /home/runner/work/datafusion/datafusion/datafusion/expr/src/logical_plan/plan.rs:2872
  method datafusion_expr::TableScan::try_new in /home/runner/work/datafusion/datafusion/datafusion/expr/src/logical_plan/plan.rs:2872

     Summary semver requires new major version: 1 major and 1 minor checks failed
    Finished [  55.647s] datafusion-expr
    Building datafusion-expr-common v53.1.0 (current)
       Built [  18.826s] (current)
     Parsing datafusion-expr-common v53.1.0 (current)
      Parsed [   0.017s] (current)
    Building datafusion-expr-common v53.1.0 (baseline)
       Built [  18.595s] (baseline)
     Parsing datafusion-expr-common v53.1.0 (baseline)
      Parsed [   0.018s] (baseline)
    Checking datafusion-expr-common v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.209s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [  38.825s] datafusion-expr-common
    Building datafusion-optimizer v53.1.0 (current)
       Built [  26.075s] (current)
     Parsing datafusion-optimizer v53.1.0 (current)
      Parsed [   0.026s] (current)
    Building datafusion-optimizer v53.1.0 (baseline)
       Built [  26.581s] (baseline)
     Parsing datafusion-optimizer v53.1.0 (baseline)
      Parsed [   0.029s] (baseline)
    Checking datafusion-optimizer v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.172s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [  53.904s] datafusion-optimizer
    Building datafusion-proto v53.1.0 (current)
       Built [  53.768s] (current)
     Parsing datafusion-proto v53.1.0 (current)
      Parsed [   0.132s] (current)
    Building datafusion-proto v53.1.0 (baseline)
       Built [  53.670s] (baseline)
     Parsing datafusion-proto v53.1.0 (baseline)
      Parsed [   0.136s] (baseline)
    Checking datafusion-proto v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   1.667s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [ 111.506s] datafusion-proto

@github-actions github-actions Bot added the auto detected api change Auto detected API change label May 17, 2026
@adriangb adriangb added the api change Changes the API exposed to users of the crate label May 17, 2026
@adriangb adriangb force-pushed the statistics-request-hooks branch from 0ef815a to e2838e8 Compare May 17, 2026 15:40
@adriangb adriangb requested a review from alamb May 17, 2026 15:48
@adriangb
Copy link
Copy Markdown
Contributor Author

cc @asolimando

@adriangb adriangb force-pushed the statistics-request-hooks branch 2 times, most recently from 070700d to ae549d8 Compare May 17, 2026 16:26
@adriangb adriangb force-pushed the statistics-request-hooks branch 2 times, most recently from 9c6bbc9 to d20c5aa Compare May 17, 2026 16:41
@adriangb adriangb marked this pull request as ready for review May 17, 2026 17:11
@adriangb adriangb changed the title Minimal query-aware statistics request hooks (extracted from #21996) Add minimal APIs / hooks for granular statistics collection in TableProvider implementations May 17, 2026
adriangb and others added 5 commits May 25, 2026 10:15
`TableScan::try_new` takes five positional arguments and bare
`TableScan { .. }` struct literals are scattered across the codebase,
making both fragile to field additions.

Introduce `TableScanBuilder` (with `From<TableScan>`, so an existing
scan can be decomposed, tweaked, and rebuilt) and move the
schema-derivation logic into `build()`. `TableScan::try_new` is now
deprecated and delegates to the builder; all in-tree callers are
migrated to the builder. Pure refactor, no behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add `StatisticsRequest` to `datafusion-expr-common::statistics` — a small
vocabulary for query-aware statistics: a caller can ask a provider for a
specific statistic (Min/Max/NullCount/DistinctCount/Sum/ByteSize per
column, plus RowCount and TotalByteSize) instead of for a dense
`Statistics` covering every column.

It derives `Ord` so requests can be held in a deduplicating,
deterministically-ordered `BTreeSet` (see the next commit).

It is intentionally just a vocabulary; nothing in DataFusion populates or
consumes it yet. Re-exported via `datafusion_expr::statistics`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add an advisory `statistics_requests: BTreeSet<StatisticsRequest>` field
to `TableScan`. A custom optimizer rule can record the statistics the
surrounding plan shape would benefit from (e.g. Min/Max for sort keys);
the physical planner threads them into the table provider (next commit).

A `BTreeSet` rather than a `Vec`: the optimizer runs rules to a fixpoint,
so the request collection must be idempotent under re-derivation. With a
set a rule simply `insert`s its requests — re-running is a no-op and
composes with other rules — instead of every rule having to dedupe a
`Vec` itself. `BTree` ordering also keeps plans deterministic.

The field is empty by default and DataFusion's own rules never populate
it. It can be set when building a scan via
`TableScanBuilder::with_statistics_requests`, or mutated directly (it is
`pub`, like `TableScan`'s other fields). `Debug`/`PartialEq`/`Eq`/`Hash`/
`PartialOrd` for `TableScan` are left unchanged — it is advisory
metadata, not part of plan identity.

`map_expressions` in `tree_node.rs` is rewritten to rebuild `TableScan`
via `..scan` instead of an exhaustive destructure, so it carries this
(and any future) field through untouched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a `statistics_requests` field to `ScanArgs` (with
`with_statistics_requests` / `statistics_requests` accessors) and have
the physical planner thread `TableScan::statistics_requests` into it.

This completes the request-side path: a custom optimizer rule annotates
`TableScan`, and the request reaches a custom `TableProvider` in
`scan_with_args`. DataFusion's own providers ignore the field; the
default `ScanArgs` value is an empty slice.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a self-contained integration test that plays both external roles:
a custom `OptimizerRule` annotates each `TableScan` with
`StatisticsRequest`s, and a custom `TableProvider` records the
`ScanArgs::statistics_requests` it receives in `scan_with_args`.

This demonstrates the request-side hooks are sufficient to build the
feature entirely outside of DataFusion. A second test confirms that
without such a rule the provider sees an empty request list.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@adriangb adriangb force-pushed the statistics-request-hooks branch from d20c5aa to cd2202e Compare May 25, 2026 15:35
Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense to me -- thank you @adriangb

I left some comments about the docstrings and maybe bettr APIs, but I don't think any of them are necessary

Comment thread datafusion/catalog/src/table.rs Outdated
Comment thread datafusion/catalog/src/table.rs Outdated
/// Set the statistics the caller would like the provider to answer for
/// this scan, if it can do so cheaply.
///
/// Providers read these via [`Self::statistics_requests()`]; anything a
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What i the Provider here? Do you mean "Table Provider"?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread datafusion/catalog/src/table.rs
}
}

// ---------------------------------------------------------------------------
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment should be made into a real /// comment and folded into the doc strings of StatisticsRequests

By being here it is likely invisible / harder to find I think

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Ord)]
pub enum StatisticsRequest {
/// Smallest non-null value of `column`.
Min(Column),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A column has a bunch of owned strings: https://github.com/alamb/datafusion/blob/0181c7906b47f9b17fbde92bf30b2d939a80230c/datafusion/common/src/column.rs#L30-L37

What do you think about making this cheaper by making it take &Column or Arc<Column> rather than Column?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

) -> Result<Self> {
let table_scan =
TableScan::try_new(table_name, table_source, projection, filters, fetch)?;
let table_scan = TableScanBuilder::new(table_name, table_source)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a nice new API

Comment thread datafusion/expr/src/logical_plan/plan.rs Outdated
}
}

/// Builder for [`TableScan`].
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very nice

projected_schema,
filters,
fetch,
LogicalPlan::TableScan(mut scan) => {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like the older code was easier to verify that it didn't carry along any other fields with expressions -- why change it here?

Specificially, I worry if we ever add another field to Filter then we would be more likely to miss this callsite

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay will revert it

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

//! `TableScan` with `StatisticsRequest`s and have them reach a *custom*
//! `TableProvider`'s `scan_with_args`.
//!
//! DataFusion ships no rule that populates `TableScan::statistics_requests`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

adriangb and others added 5 commits May 26, 2026 13:18
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Restores the explicit field destructuring for the LogicalPlan::TableScan
arm of map_expressions instead of the `mut scan` + `..scan` shorthand.
Explicit destructuring forces a compile error if a new field carrying
expressions is added to TableScan, so this callsite cannot be silently
missed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolves an ambiguity raised in review: the doc referred to "the
provider"/"Providers" without saying which provider. Use `TableProvider`
consistently, and fix a typo ("for to allow") and trailing whitespace.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The "Query-aware statistics requests" explanation lived in a `//` block
comment above the enum, where it is invisible in rustdoc. Move it into
the enum's `///` doc comment so it is discoverable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A Column owns its relation/name strings, so cloning a StatisticsRequest
cloned those strings. The requests live in a BTreeSet<StatisticsRequest>
on TableScan that is cloned with the plan throughout optimization, so
wrap the per-column payload in Arc<Column> to make those clones cheap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Leftover trailing whitespace from a previously applied review suggestion;
removed by cargo fmt.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@adriangb adriangb enabled auto-merge May 26, 2026 18:36
@adriangb
Copy link
Copy Markdown
Contributor Author

Thanks for the review @alamb!

@adriangb adriangb added this pull request to the merge queue May 26, 2026
Merged via the queue into apache:main with commit 633595d May 26, 2026
38 checks passed
@adriangb adriangb deleted the statistics-request-hooks branch May 26, 2026 19:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api change Changes the API exposed to users of the crate auto detected api change Auto detected API change catalog Related to the catalog crate core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules proto Related to proto crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants