Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
5b6a862
Make stats rewrite rules public
gatesn Jun 10, 2026
e8dd011
Centralize stat expression binding
gatesn Jun 11, 2026
c1d6d94
Fix file stats binding for computed expressions
gatesn Jun 11, 2026
468ebfd
Fuse checked pruning stats rewrites
gatesn Jun 11, 2026
cd48674
Simplify stats binding null handling
gatesn Jun 11, 2026
52fd443
Install Java toolchain in CI
gatesn Jun 11, 2026
4293838
Remove legacy stat falsification hooks
gatesn Jun 11, 2026
ffbccee
Remove legacy stat expression hooks
gatesn Jun 11, 2026
1c81fb4
Merge remote-tracking branch 'origin/develop' into ngates/public-stat…
gatesn Jun 16, 2026
fe12e71
Address stats binding review comments
gatesn Jun 16, 2026
0de2c09
Bind abstract aggregate stats to legacy counts
gatesn Jun 17, 2026
118c995
Document aggregate stat binding coverage
gatesn Jun 17, 2026
32241d2
Make legacy stat aggregate binding explicit
gatesn Jun 17, 2026
360aa58
Make stats binders immutable
gatesn Jun 17, 2026
ec94e69
Remove required stats pruning binder
gatesn Jun 17, 2026
fff66b6
Restore session falsification tests
gatesn Jun 17, 2026
25e9400
Bind only direct stat aggregates
gatesn Jun 17, 2026
b222454
Split NaN stat rewrite proofs
gatesn Jun 17, 2026
e82764f
Use concrete stat terminology
gatesn Jun 17, 2026
619706f
Inline stats rewrite proof variants
gatesn Jun 17, 2026
9a7c766
Split stats rewrite rule structs
gatesn Jun 17, 2026
91fc94f
Clean up stats rewrite proof helpers
gatesn Jun 17, 2026
22cd9de
Preserve DuckDB filter order
gatesn Jun 17, 2026
d81033e
Restore integer Delta compression scheme
gatesn Jun 17, 2026
baae2dd
Address stats pruning review comments
gatesn Jun 18, 2026
db78444
Merge remote-tracking branch 'origin/develop' into ngates/public-stat…
gatesn Jun 18, 2026
9a737c9
Address follow-up stats rewrite review comments
gatesn Jun 18, 2026
83b1332
Localize legacy stat aggregate binding
gatesn Jun 18, 2026
4dac595
Fix DuckDB projection test compile
gatesn Jun 18, 2026
3e7c2d6
Keep stats binding free of reduction
gatesn Jun 18, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion docs/developer-guide/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ internals/session
internals/async-runtime
internals/vtables
internals/execution
internals/stats-pruning
internals/io
internals/serialization
internals/cuda
Expand All @@ -38,4 +39,4 @@ caption: Integrations
integrations/datafusion
integrations/duckdb
integrations/spark
```
```
39 changes: 39 additions & 0 deletions docs/developer-guide/internals/stats-pruning.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Stats Pruning

Vortex uses statistics to prove when a filter cannot match a row group, zone, or
file. The proof expression returns `true` when the input can be skipped. It
returns `false` or `null` when pruning is not proven.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why both?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its just easier to map nulls to nulls


Both `false` and `null` are non-pruning outcomes, but they mean different
things. `false` means the available stats disproved the skip proof. `null` means
the proof was unknown, usually because a required stat was missing or inexact.

The pruning pipeline has two phases:

1. `Expression::falsify(scope, session)` asks the session's
`StatsRewriteRule`s to rewrite a filter into an abstract proof expression.
Rules describe semantics in terms of `vortex.stat(input, aggregate_fn)`
placeholders. These placeholders name the statistic needed by the proof, but
not where that statistic is stored.
2. `bind_stats` lowers those abstract stat placeholders with a `StatBinder`.
The binder maps stats to the representation used by the caller, such as
zone-map table fields, file-level stat literals, or typed null literals for
missing stats.

Missing stats lower to typed null literals. This preserves the three-valued
logic used by pruning: only a non-null `true` value proves that the scope can be
skipped. A missing stat therefore cannot accidentally prune data.

## Binding Targets

Zone maps bind stats to fields in their per-zone stats table. The lowered
expression is evaluated against that table and produces a mask where `true`
means the zone can be skipped.

File-level stats bind stats to literal values from the file footer. The lowered
expression is reduced and evaluated once for the full file. If it evaluates to
`true`, the file stats reader can return an all-false pruning mask without
reading child layouts.

For the layout model around these pruning points, see
[Layouts](../../concepts/layouts.md) and [Scanning](../../concepts/scanning.md).
46 changes: 0 additions & 46 deletions vortex-array/src/expr/expression.rs
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,7 @@ use vortex_error::vortex_ensure;
use vortex_session::VortexSession;

use crate::dtype::DType;
use crate::expr::StatsCatalog;
use crate::expr::display::DisplayTreeExpr;
use crate::expr::stats::Stat;
use crate::scalar_fn::ScalarFnRef;
use crate::scalar_fn::fns::root::Root;

Expand Down Expand Up @@ -114,28 +112,6 @@ impl Expression {
self.scalar_fn.validity(self)
}

/// An expression over zone-statistics which implies all records in the zone evaluate to false.
///
/// Given an expression, `e`, if `e.stat_falsification(..)` evaluates to true, it is guaranteed
/// that `e` evaluates to false on all records in the zone. However, the inverse is not
/// necessarily true: even if the falsification evaluates to false, `e` need not evaluate to
/// true on all records.
///
/// The [`StatsCatalog`] can be used to constrain or rename stats used in the final expr.
///
/// # Examples
///
/// - An expression over one variable: `x > 0` is false for all records in a zone if the maximum
/// value of the column `x` in that zone is less than or equal to zero: `max(x) <= 0`.
/// - An expression over two variables: `x > y` becomes `max(x) <= min(y)`.
/// - A conjunctive expression: `x > y AND z < x` becomes `max(x) <= min(y) OR min(z) >= max(x).
///
/// Some expressions, in theory, have falsifications but this function does not support them
/// such as `x < (y < z)` or `x LIKE "needle%"`.
pub fn stat_falsification(&self, catalog: &dyn StatsCatalog) -> Option<Expression> {
self.scalar_fn().stat_falsification(self, catalog)
}

/// Returns an expression that proves this predicate is definitely false from stats.
///
/// `scope` is the dtype of the row this expression evaluates over.
Expand Down Expand Up @@ -164,28 +140,6 @@ impl Expression {
crate::stats::rewrite::StatsRewriteCtx::new(session, scope).satisfy(self)
}

/// Returns an expression representing the zoned statistic for the given stat, if available.
///
/// The [`StatsCatalog`] returns expressions that can be evaluated using the zone map as a
/// scope. Expressions can implement this function to propagate such statistics through the
/// expression tree. For example, the `a + 10` expression could propagate `min: min(a) + 10`.
///
/// NOTE(gatesn): we currently cannot represent statistics over nested fields. Please file an
/// issue to discuss a solution to this.
pub fn stat_expression(&self, stat: Stat, catalog: &dyn StatsCatalog) -> Option<Expression> {
self.scalar_fn().stat_expression(self, stat, catalog)
}

/// Returns an expression representing the zoned maximum statistic, if available.
pub fn stat_min(&self, catalog: &dyn StatsCatalog) -> Option<Expression> {
self.stat_expression(Stat::Min, catalog)
}

/// Returns an expression representing the zoned maximum statistic, if available.
pub fn stat_max(&self, catalog: &dyn StatsCatalog) -> Option<Expression> {
self.stat_expression(Stat::Max, catalog)
}

/// Format the expression as a compact string.
///
/// Since this is a recursive formatter, it is exposed on the public Expression type.
Expand Down
2 changes: 0 additions & 2 deletions vortex-array/src/expr/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -34,15 +34,13 @@ pub(crate) mod field;
pub mod forms;
mod optimize;
pub mod proto;
pub mod pruning;
pub mod stats;
pub mod transform;
pub mod traversal;

pub use analysis::*;
pub use expression::*;
pub use exprs::*;
pub use pruning::StatsCatalog;

pub trait VortexExprExt {
/// Accumulate all field references from this expression and its children in a set
Expand Down
27 changes: 0 additions & 27 deletions vortex-array/src/expr/pruning/mod.rs

This file was deleted.

Loading
Loading