Skip to content

feat(cbo): expose ColumnStatistics with conservative NDV from zonemap#511

Draft
LuciferYang wants to merge 1 commit into
lance-format:mainfrom
LuciferYang:cbo-phase1-3a
Draft

feat(cbo): expose ColumnStatistics with conservative NDV from zonemap#511
LuciferYang wants to merge 1 commit into
lance-format:mainfrom
LuciferYang:cbo-phase1-3a

Conversation

@LuciferYang
Copy link
Copy Markdown
Contributor

@LuciferYang LuciferYang commented May 8, 2026

Closes #510.

Summary

  • New ColumnStatsAggregator reduces per-zone ZoneStats into DSv2 ColumnStatistics with min / max / nullCount + a conservative NDV when every non-null zone has min == max.
  • LanceStatistics.columnStats() exposes the aggregated map to Spark's CBO via the standard Statistics.columnStats() API.
  • LanceScanBuilder loads zonemap stats for projected zonemap-indexed columns up to lance.cbo.column.stats.max.columns (default 8). Filter and partition columns are always loaded — they already drive fragment pruning.
  • Master kill-switch (spark.lance.cbo.column.stats.enabled, default true) + per-scan option for safe rollback. Both flow through parseTypedFlags so catalog-level configs propagate the same way as fix(spark): gss initiate failed on hms executors; spark.sql.catalog read options not applied #476.

Configuration

SparkConf Per-scan option Default Purpose
spark.lance.cbo.column.stats.enabled lance.cbo.column.stats.enabled true Master kill-switch. Disables columnStats() reporting entirely.
spark.lance.cbo.column.stats.max.columns lance.cbo.column.stats.max.columns 8 Cap on zonemap-indexed columns whose stats are loaded at plan time. Filter and partition columns are always loaded — the cap only bounds extra projection-driven I/O.

Per-scan options win over SparkConf. Master switch off restores prior behavior (no columnStats() reported).

Related

What's NOT in this PR (follow-ups)

  • avgLen / maxLen for variable-length types — separate issue.
  • Sketch-based NDV on unsorted columns — needs a Lance-core RFC.
  • ANALYZE TABLE and persistent column statistics — separate follow-up PR; depends on this one.

Test plan

  • make test SPARK_VERSION=3.5 SCALA_VERSION=2.12ColumnStatsAggregatorTest (17), LanceStatisticsTest (19), LanceSparkReadOptionsSerializationTest (8) all pass.
  • make test SPARK_VERSION=3.4 SCALA_VERSION=2.12 — same suite passes.
  • make test SPARK_VERSION=4.0 SCALA_VERSION=2.13 — same suite passes.
  • make test SPARK_VERSION=4.1 SCALA_VERSION=2.13 — same suite passes.
  • make lint — Checkstyle + Spotless clean.
  • Manual: query a Lance table with EXPLAIN COST and confirm columnStats lines appear under the Statistics block when zonemap indexes exist on filter columns.
  • Manual: set spark.lance.cbo.column.stats.enabled=false and confirm columnStats disappears (kill-switch verification).

Risks

  • Adds a plan-time per-column zonemap-stats fetch on top of the filter-column loading that already happens. Bounded by max.columns cap (default 8). On wide tables with no filter columns, this is the only added I/O.
  • NDV path is precondition-gated and returns null on failure — never poisons CBO with a guessed value.

@github-actions github-actions Bot added the enhancement New feature or request label May 8, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@LuciferYang LuciferYang marked this pull request as draft May 8, 2026 02:57
@LuciferYang LuciferYang changed the title feat(cbo): Phase 1 ColumnStatistics + Phase 3a NDV from zonemap stats feat(cbo): expose ColumnStatistics with conservative NDV from zonemap May 8, 2026
@LuciferYang
Copy link
Copy Markdown
Contributor Author

Still a draft.

@LuciferYang LuciferYang force-pushed the cbo-phase1-3a branch 2 times, most recently from fbcfd0a to d8eeb13 Compare May 8, 2026 03:26
ColumnStatsAggregator reduces per-zone ZoneStats to DSv2 ColumnStatistics
with min/max/nullCount + a conservative NDV when every non-null zone has
min == max. LanceStatistics.columnStats() exposes the aggregated map.
LanceScanBuilder loads zonemap stats for projected zonemap-indexed columns
up to a configurable cap (default 8).

The conservative NDV path produces an exact NDV when applicable (every zone
single-valued); returns null otherwise so Spark CBO falls back to row-count
heuristics rather than guess. Most useful for sorted/clustered low-cardinality
columns (e.g. d_year, ca_state) — no extra I/O on top of the zonemap load.

Configuration (per-scan option / SparkConf key):
- lance.cbo.column.stats.enabled / spark.lance.cbo.column.stats.enabled (default true)
- lance.cbo.column.stats.max.columns / spark.lance.cbo.column.stats.max.columns (default 8)

The cap default (8) is a conservative starting point that bounds plan-time
zonemap I/O on wide-projection queries. Users can raise it for deeper CBO
coverage or set the master switch to false for a no-op rollback.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enable Spark CBO with column statistics from zonemap

1 participant