feat(cbo): expose ColumnStatistics with conservative NDV from zonemap by LuciferYang · Pull Request #511 · lance-format/lance-spark

LuciferYang · 2026-05-08T02:51:49Z

Closes #510.

Summary

New ColumnStatsAggregator reduces per-zone ZoneStats into DSv2 ColumnStatistics with min / max / nullCount + a conservative NDV when every non-null zone has min == max.
LanceStatistics.columnStats() exposes the aggregated map to Spark's CBO via the standard Statistics.columnStats() API.
LanceScanBuilder loads zonemap stats for projected zonemap-indexed columns up to lance.cbo.column.stats.max.columns (default 8). Filter and partition columns are always loaded — they already drive fragment pruning.
Master kill-switch (spark.lance.cbo.column.stats.enabled, default true) + per-scan option for safe rollback. Both flow through parseTypedFlags so catalog-level configs propagate the same way as fix(spark): gss initiate failed on hms executors; spark.sql.catalog read options not applied #476.

Configuration

SparkConf	Per-scan option	Default	Purpose
`spark.lance.cbo.column.stats.enabled`	`lance.cbo.column.stats.enabled`	`true`	Master kill-switch. Disables `columnStats()` reporting entirely.
`spark.lance.cbo.column.stats.max.columns`	`lance.cbo.column.stats.max.columns`	`8`	Cap on zonemap-indexed columns whose stats are loaded at plan time. Filter and partition columns are always loaded — the cap only bounds extra projection-driven I/O.

Per-scan options win over SparkConf. Master switch off restores prior behavior (no columnStats() reported).

feat(sql): support USING zonemap with distributed and consolidated build paths #513 — adds USING zonemap as a recognized CREATE INDEX method. Not a strict prerequisite (this PR's tests don't depend on it), but pairing them lets users opt into a zonemap-only index for CBO column stats without paying btree's index-build cost.

What's NOT in this PR (follow-ups)

avgLen / maxLen for variable-length types — separate issue.
Sketch-based NDV on unsorted columns — needs a Lance-core RFC.
ANALYZE TABLE and persistent column statistics — separate follow-up PR; depends on this one.

Test plan

make test SPARK_VERSION=3.5 SCALA_VERSION=2.12 — ColumnStatsAggregatorTest (17), LanceStatisticsTest (19), LanceSparkReadOptionsSerializationTest (8) all pass.
make test SPARK_VERSION=3.4 SCALA_VERSION=2.12 — same suite passes.
make test SPARK_VERSION=4.0 SCALA_VERSION=2.13 — same suite passes.
make test SPARK_VERSION=4.1 SCALA_VERSION=2.13 — same suite passes.
make lint — Checkstyle + Spotless clean.
Manual: query a Lance table with EXPLAIN COST and confirm columnStats lines appear under the Statistics block when zonemap indexes exist on filter columns.
Manual: set spark.lance.cbo.column.stats.enabled=false and confirm columnStats disappears (kill-switch verification).

Risks

Adds a plan-time per-column zonemap-stats fetch on top of the filter-column loading that already happens. Bounded by max.columns cap (default 8). On wide tables with no filter columns, this is the only added I/O.
NDV path is precondition-gated and returns null on failure — never poisons CBO with a guessed value.

github-actions · 2026-05-08T02:52:07Z

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

LuciferYang · 2026-05-08T03:20:40Z

Still a draft.

ColumnStatsAggregator reduces per-zone ZoneStats to DSv2 ColumnStatistics with min/max/nullCount + a conservative NDV when every non-null zone has min == max. LanceStatistics.columnStats() exposes the aggregated map. LanceScanBuilder loads zonemap stats for projected zonemap-indexed columns up to a configurable cap (default 8). The conservative NDV path produces an exact NDV when applicable (every zone single-valued); returns null otherwise so Spark CBO falls back to row-count heuristics rather than guess. Most useful for sorted/clustered low-cardinality columns (e.g. d_year, ca_state) — no extra I/O on top of the zonemap load. Configuration (per-scan option / SparkConf key): - lance.cbo.column.stats.enabled / spark.lance.cbo.column.stats.enabled (default true) - lance.cbo.column.stats.max.columns / spark.lance.cbo.column.stats.max.columns (default 8) The cap default (8) is a conservative starting point that bounds plan-time zonemap I/O on wide-projection queries. Users can raise it for deeper CBO coverage or set the master switch to false for a no-op rollback.

github-actions Bot added the enhancement New feature or request label May 8, 2026

LuciferYang force-pushed the cbo-phase1-3a branch from 10bf1ae to 201bb7e Compare May 8, 2026 02:56

LuciferYang marked this pull request as draft May 8, 2026 02:57

LuciferYang mentioned this pull request May 8, 2026

Enable Spark CBO with column statistics from zonemap #510

Open

LuciferYang changed the title ~~feat(cbo): Phase 1 ColumnStatistics + Phase 3a NDV from zonemap stats~~ feat(cbo): expose ColumnStatistics with conservative NDV from zonemap May 8, 2026

LuciferYang force-pushed the cbo-phase1-3a branch from 201bb7e to cada2a5 Compare May 8, 2026 03:06

LuciferYang force-pushed the cbo-phase1-3a branch 2 times, most recently from fbcfd0a to d8eeb13 Compare May 8, 2026 03:26

LuciferYang force-pushed the cbo-phase1-3a branch from d8eeb13 to edf600a Compare May 8, 2026 03:29

LuciferYang mentioned this pull request May 8, 2026

Distributed scalar-index build races on per-fragment zonemap.lance #514

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cbo): expose ColumnStatistics with conservative NDV from zonemap#511

feat(cbo): expose ColumnStatistics with conservative NDV from zonemap#511
LuciferYang wants to merge 1 commit into
lance-format:mainfrom
LuciferYang:cbo-phase1-3a

LuciferYang commented May 8, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

LuciferYang commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

LuciferYang commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Configuration

Related

What's NOT in this PR (follow-ups)

Test plan

Risks

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

LuciferYang commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

LuciferYang commented May 8, 2026 •

edited

Loading