Migrate to new stats falsification rules#8345
Conversation
## Summary Adds docs for `StatsRewriteRule` and its functions. Can be considered a follow-up for #8345, but can be individually merged. Signed-off-by: Adam Gutglick <adam@spiraldb.com>
033c73f to
5ed375f
Compare
Port file pruning to session stats rewrites Signed-off-by: "Nicholas Gates" <nick@nickgates.com> Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: "Nicholas Gates" <nick@nickgates.com> Signed-off-by: Nicholas Gates <nick@nickgates.com>
5ed375f to
e8dd011
Compare
Merging this PR will not alter performance
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ❌ | Simulation | take_10k_random |
197.9 µs | 255.5 µs | -22.53% |
| ❌ | Simulation | take_10k_contiguous |
218.5 µs | 275.6 µs | -20.72% |
| ❌ | Simulation | patched_take_10k_contiguous_patches |
232.2 µs | 290.9 µs | -20.17% |
| ❌ | Simulation | patched_take_10k_random |
244.2 µs | 303.4 µs | -19.51% |
| ❌ | Simulation | chunked_varbinview_opt_canonical_into[(1000, 10)] |
178 µs | 213.2 µs | -16.53% |
| ❌ | Simulation | chunked_varbinview_opt_into_canonical[(1000, 10)] |
193.4 µs | 229.3 µs | -15.67% |
| ❌ | Simulation | decompress_rd[f64, (100000, 0.01)] |
890.6 µs | 1,024.5 µs | -13.07% |
| ❌ | Simulation | decompress_rd[f64, (100000, 0.1)] |
890.6 µs | 1,024.5 µs | -13.07% |
| ⚡ | Simulation | chunked_varbinview_canonical_into[(1000, 10)] |
198.5 µs | 162.7 µs | +21.99% |
| ⚡ | Simulation | varbinview_large |
130.4 µs | 112.6 µs | +15.81% |
| ⚡ | Simulation | chunked_varbinview_canonical_into[(100, 100)] |
308.7 µs | 273.6 µs | +12.81% |
| ⚡ | Simulation | chunked_varbinview_into_canonical[(100, 100)] |
367.7 µs | 332.5 µs | +10.57% |
Tip
Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.
Comparing ngates/public-stats-rewrite-rules (3e7c2d6) with develop (98d4a6a)
Signed-off-by: "Nicholas Gates" <nick@nickgates.com> Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: "Nicholas Gates" <nick@nickgates.com> Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: "Nicholas Gates" <nick@nickgates.com> Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
| use crate::scalar::Scalar; | ||
| use crate::scalar_fn::fns::stat::StatFn; | ||
|
|
||
| /// A target that can bind abstract statistics to concrete expressions. |
There was a problem hiding this comment.
What does it mean to bind a statistic to an expression? What makes statistics abstract?
There was a problem hiding this comment.
What does it mean to bind a statistic to an expression? What makes statistics abstract?
Yeah stats stuff I am intermingling with an expressions change, so I think the "stat" expression becomes explicitly an "Expression::Placeholder", meaning it must be replaced prior to execution.
It's the same logic as our StatCatalog. Except the current StatCatalog means you have to re-run falsification over the entire expression any time your stats come from a different place, e.g. FileStats vs ZoneMap vs ArrayStats.
Here you take the falsified expression, then "bind" the stats from wherever you get them from.
| @@ -115,14 +119,52 @@ impl FileStatsLayoutReader { | |||
| Ok(result.as_bool().value() == Some(true)) | |||
| refs | ||
| } | ||
|
|
||
| fn collect_referenced_stat_field_names(expr: &Expression, refs: &mut HashSet<FieldName>) { |
There was a problem hiding this comment.
we have Node implemented for Expression, you should be able to visit the tree using that instead of hand-rolling the recursion
| } | ||
| } | ||
|
|
||
| fn bool_literal(expr: &Expression) -> Option<Option<bool>> { |
| return Ok(None); | ||
| }; | ||
| let required_stats = filter_required_stats(&lowered, binder.required_stats); | ||
| if required_stats.map().is_empty() && !matches!(bool_literal(&lowered), Some(Some(true))) { |
There was a problem hiding this comment.
maybe add a comment explaining this if statement?
| available_stats, | ||
| required_stats: Relation::new(), | ||
| }; | ||
| let Some(lowered) = bind_stats(predicate, &mut binder)? else { |
There was a problem hiding this comment.
Maybe I'm just stupid - but I think both "lowering" and "bindings" are worth defining and explaining somewhere.
|
I think I mostly have a lot of questions 😅 |
Signed-off-by: "Nicholas Gates" <nick@nickgates.com>
Polar Signals Profiling ResultsLatest Run
Previous Runs (2)
Powered by Polar Signals Cloud |
Benchmarks: PolarSignals ProfilingVortex (geomean): 1.009x ➖ How to read Verdict and Engines
datafusion / vortex-file-compressed (1.009x ➖, 0↑ 2↓)
No file size changes detected. |
Benchmarks: TPC-H SF=1 on NVMEVerdict: No clear signal (low confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (0.943x ➖, 2↑ 0↓)
datafusion / vortex-compact (0.938x ➖, 1↑ 0↓)
datafusion / parquet (0.945x ➖, 2↑ 0↓)
datafusion / arrow (0.938x ➖, 3↑ 0↓)
duckdb / vortex-file-compressed (0.954x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.954x ➖, 1↑ 1↓)
duckdb / parquet (0.945x ➖, 4↑ 0↓)
duckdb / duckdb (0.963x ➖, 0↑ 0↓)
File Size Changes (10 files changed, +2.2% overall, 9↑ 1↓)
Totals:
|
Benchmarks: FineWeb NVMeVerdict: No clear signal (low confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (1.000x ➖, 0↑ 0↓)
datafusion / vortex-compact (1.000x ➖, 1↑ 0↓)
datafusion / parquet (1.013x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (0.946x ➖, 1↑ 0↓)
duckdb / vortex-compact (1.042x ➖, 0↑ 2↓)
duckdb / parquet (1.002x ➖, 0↑ 0↓)
File Size Changes (2 files changed, -0.1% overall, 0↑ 2↓)
Totals:
|
Signed-off-by: "Nicholas Gates" <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: "Nicholas Gates" <nick@nickgates.com>
| @@ -0,0 +1,199 @@ | |||
| // SPDX-License-Identifier: Apache-2.0 | |||
|
I think a few commits from other branches made it here by mistake |
joseph-isaacs
left a comment
There was a problem hiding this comment.
Almost there. I trust the other matches can we fuzz before merging.
Also needs to comments
|
|
||
| Vortex uses statistics to prove when a filter cannot match a row group, zone, or | ||
| file. The proof expression returns `true` when the input can be skipped. It | ||
| returns `false` or `null` when pruning is not proven. |
There was a problem hiding this comment.
I think its just easier to map nulls to nulls
| #[derive(Clone)] | ||
| pub struct Binary; | ||
|
|
||
| fn simplify_and(lhs: &Expression, rhs: &Expression) -> Option<Expression> { |
Signed-off-by: "Nicholas Gates" <nick@nickgates.com>
| @@ -0,0 +1,199 @@ | |||
| // SPDX-License-Identifier: Apache-2.0 | |||
There was a problem hiding this comment.
The rebase seems wrong?
| `true`, the file stats reader can return an all-false pruning mask without | ||
| reading child layouts. | ||
|
|
||
| Scan planning uses `checked_pruning_expr` to lower a falsified expression against |
Signed-off-by: "Nicholas Gates" <nick@nickgates.com>
Signed-off-by: "Nicholas Gates" <nick@nickgates.com>
Signed-off-by: "Nicholas Gates" <nick@nickgates.com>
Signed-off-by: "Nicholas Gates" <nick@nickgates.com>
Switch over to using the new rewrite rule registry from the session, instead of using the scalar fn vtable.