Skip to content

Spec: Add content stats to spec#14234

Merged
nastra merged 17 commits into
apache:mainfrom
nastra:content-stats-spec
May 20, 2026
Merged

Spec: Add content stats to spec#14234
nastra merged 17 commits into
apache:mainfrom
nastra:content-stats-spec

Conversation

@nastra

@nastra nastra commented Oct 2, 2025

Copy link
Copy Markdown
Contributor

These spec changes are related to the content stats proposal. Note that the root content_stats field will use ID 146 in the manifest.

@github-actions github-actions Bot added the Specification Issues that may introduce spec changes. label Oct 2, 2025
@github-actions

github-actions Bot commented Nov 2, 2025

Copy link
Copy Markdown

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions Bot added the stale label Nov 2, 2025
@nastra nastra added not-stale and removed stale labels Nov 3, 2025
@nastra nastra force-pushed the content-stats-spec branch from e86f035 to a1f6ad4 Compare November 24, 2025 14:07
@nastra nastra requested a review from danielcweeks November 24, 2025 14:08
@nastra nastra force-pushed the content-stats-spec branch 2 times, most recently from ae11c0a to 919c682 Compare November 24, 2025 14:32
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment on lines +723 to +724
| avg_value_count | `int` | 4 | false | The avg value count for variable-length types (string/binary) |
| max_value_count | `long` | 5 | false | The max value count for variable-length types (string/binary) |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be avg_value_length and max_value_length. I don't think this represents a count.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue since you brought up those two in the design doc, any thoughts about the exact naming of these two fields?

@rdblue rdblue Apr 1, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dan is correct. These should be avg_value_size_in_bytes.

We should check with engine people to see if we want to include the max. If it isn't going to be valuable in the short term, we should add it later. Also, I think that we need to be consistent about the types. This is long but the average is int.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least for Spark I think avg/max would be useful as we could report those through SupportsReportStatistics to Spark for CBO

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ebyhr could the avg/max value size be used in Trino to have a better estimate for the data_size statistic?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this size on disk or in memory? Trino CBO (TableScanStatsRule) uses the average size. The max value size is unused as far as I know.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nastra, did you confirm whether max is used by Spark?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue yes and I think the only place where the max would be used by Spark is in the DSv2 Column Stats: https://github.com/apache/spark/blob/dc57455589a9f160638ab3526ca359eb84f76d67/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala#L101, which we could pass via the SupportsReportStatistics API

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just because it is passed doesn't mean it is used. Can we find out if this is aspirational or whether it is really needed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some additional investigation and maxLen in Spark is carried forward/aggregated/visualized in a few places (like JoinEstimation / DataSourceV2Relation / EstimationUtils / DescribeColumnExec) but not really used in cardinality or size calculations like avgLen is used today, meaning that we wouldn't have an immediate use for having max_value_size_in_bytes today. So I think the right call for now is to leave this out of the spec and add it later when we have an immediate use in a query engine

Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
Comment thread format/spec.md Outdated
@gaborkaszab

Copy link
Copy Markdown
Contributor

Hi @nastra, could you please clarify where we would store NDVs? Apologies if this is already covered in a design doc I may have missed. Also, would it be possible to extend the design to optionally support partition-level statistics structures such as bitmap-based sketches for NDV estimation and histograms (e.g., KLL sketches)? Thank you!

@deniskuzZ those type of stats are not handled by this design as those are stored separately in Puffin files (you might want to take a look at NDVSketchUtil). Those are then e.g. later used by Spark in SparkScan#estimateStatistics.

Thanks for confirming that! In that case, I’m not entirely sure why Gabor’s proposal on standardizing and extending column stats was rejected. We were under the impression that the new design would also cover that aspect: https://docs.google.com/document/d/1H9uYt53Q1_CcOXOfLcr0hXRxvqflg_k_xeVorMLrWbM/edit?tab=t.0 Should we give it another try and start a new thread? cc @gaborkaszab, @pvary

I haven't seen this proposal yet, so let me take a look first before responding to it. I also meant to reply to your NDV question earlier and updated my comment to say "NDVs are not handled..." instead of "those type of stats are not handled...". In any case, let me first read that proposal that you linked.

Hey @nastra and @deniskuzZ ,
Thanks for bringing up that proposal doc I had earlier!
As I see this: ContentStats ATM are meant to be used on a per file basis (and aggregated to each level of the metadata tree structure, if I'm not mistaken). I think this ContentStats data structure is something we can re-use for extending PartitionStatistics to introduce column-level stats on a per-partition level too. Just for the record this has been asked by Trino and Hive people for a while, I got a number of requests around this since the above mentioned proposal doc went public.

Now about sketches (of any kind, e.g. NDV or histograms):

  • I don't think these should live within ContentStats, but we can examine other opportunities to have them on a per-partition basis, as currently per-table basis is what we have for sketches (well NDV, but histograms could be a meaningful addition IMO). One option is to have puffin files wired in into partition stats too.
  • I'm somewhat hesitant to add sketches on a per-partition level TBH, but would be nice to hear other opinions here. Just a back of the envelope calculation: a Theta sketch with default size/precision is 50KB, with 100k partitions it's 5GB for a Theta sketch per column. We could reduce this by sacrificing precision, e.g. k=512 => sketch size is 6KB *100k partitions => 600MB per column. These are in-memory sizes, we for sure load only a small subset of them, however, the disk size even though is smaller but could be half of the in-memory size. Still looks a lot, and I recall complaints for this when working on Impala, would be nice to hear other experiences.

@nastra

nastra commented May 19, 2026

Copy link
Copy Markdown
Contributor Author

Also, would it be possible to extend the design to optionally support partition-level statistics structures such as bitmap-based sketches for NDV estimation and histograms (e.g., KLL sketches)

@deniskuzZ coming back to this question. Can you help me understand why we would want to store such sketches at the file level when we already have them in NDVSketchUtil? This PR is intentionally scoped to field-level statistics and replaces what we previously stored in a few Maps, so I'm not sure we want to store sketches of any kind in content stats.

EDIT: looks like @gaborkaszab already provided a good summary while I was writing this up

@github-actions github-actions Bot added the core label May 19, 2026
@deniskuzZ

deniskuzZ commented May 19, 2026

Copy link
Copy Markdown
Member

@deniskuzZ coming back to this question. Can you help me understand why we would want to store such sketches at the file level when we already have them in NDVSketchUtil? This PR is intentionally scoped to field-level statistics and replaces what we previously stored in a few Maps, so I'm not sure we want to store sketches of any kind in content stats.

We do not want to store these sketches at the file level. The more natural fit would be partition-level aggregates, potentially persisted separately (e.g. in Parquet or another compact structure). In Hive, several optimizer features already rely on partition-level statistics such as NDVs and histograms, so there is practical value in exposing richer partition stats.

That said, I am not convinced Puffin is the best fit for this use case. As far as I understand, it lacks a secondary index, and someone mentioned scalability concerns when the number of blobs becomes very large (e.g. 100k partitions × 100 columns). Note, there was also an idea to store column statistics in Puffin on a per-partition basis, with references from the existing partition statistics files.

Also, well-designed queries should never need to load statistics for all partitions and all columns at once. In practice, queries typically touch only a relatively small subset of partitions and referenced columns, so metadata access should remain proportional to the actual query scope rather than the total table size.

@nastra nastra force-pushed the content-stats-spec branch from 3ef5760 to 5f01387 Compare May 19, 2026 17:23
@nastra

nastra commented May 20, 2026

Copy link
Copy Markdown
Contributor Author

thanks everyone for reviewing and participating in the discussion. I'd like to get this merged, so that we can make some progress on the implementation. The spec isn't finalized yet and we can still change/update things before v4 is ratified

@nastra nastra merged commit f825d09 into apache:main May 20, 2026
7 checks passed
@nastra nastra deleted the content-stats-spec branch May 20, 2026 06:30
@nssalian nssalian added this to the Iceberg 1.12.0 milestone Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core not-stale Specification Issues that may introduce spec changes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.