Spec: Add content stats to spec by nastra · Pull Request #14234 · apache/iceberg

nastra · 2025-10-02T09:43:34Z

These spec changes are related to the content stats proposal. Note that the root content_stats field will use ID 146 in the manifest.

github-actions · 2025-11-02T00:19:02Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

danielcweeks · 2026-02-04T16:06:18Z

+| avg_value_count  | `int`               | 4                                   | false    | The avg value count for variable-length types (string/binary)                                                                                                                  |
+| max_value_count  | `long`              | 5                                   | false    | The max value count for variable-length types (string/binary)                                                                                                                  |


Should this be avg_value_length and max_value_length. I don't think this represents a count.

@rdblue since you brought up those two in the design doc, any thoughts about the exact naming of these two fields?

Dan is correct. These should be avg_value_size_in_bytes.

We should check with engine people to see if we want to include the max. If it isn't going to be valuable in the short term, we should add it later. Also, I think that we need to be consistent about the types. This is long but the average is int.

At least for Spark I think avg/max would be useful as we could report those through SupportsReportStatistics to Spark for CBO

@ebyhr could the avg/max value size be used in Trino to have a better estimate for the data_size statistic?

Is this size on disk or in memory? Trino CBO (TableScanStatsRule) uses the average size. The max value size is unused as far as I know.

@nastra, did you confirm whether max is used by Spark?

@rdblue yes and I think the only place where the max would be used by Spark is in the DSv2 Column Stats: https://github.com/apache/spark/blob/dc57455589a9f160638ab3526ca359eb84f76d67/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala#L101, which we could pass via the SupportsReportStatistics API

Just because it is passed doesn't mean it is used. Can we find out if this is aspirational or whether it is really needed?

I did some additional investigation and maxLen in Spark is carried forward/aggregated/visualized in a few places (like JoinEstimation / DataSourceV2Relation / EstimationUtils / DescribeColumnExec) but not really used in cardinality or size calculations like avgLen is used today, meaning that we wouldn't have an immediate use for having max_value_size_in_bytes today. So I think the right call for now is to leave this out of the spec and add it later when we have an immediate use in a query engine

gaborkaszab · 2026-05-19T14:57:10Z

Hi @nastra, could you please clarify where we would store NDVs? Apologies if this is already covered in a design doc I may have missed. Also, would it be possible to extend the design to optionally support partition-level statistics structures such as bitmap-based sketches for NDV estimation and histograms (e.g., KLL sketches)? Thank you!

@deniskuzZ those type of stats are not handled by this design as those are stored separately in Puffin files (you might want to take a look at NDVSketchUtil). Those are then e.g. later used by Spark in SparkScan#estimateStatistics.

Thanks for confirming that! In that case, I’m not entirely sure why Gabor’s proposal on standardizing and extending column stats was rejected. We were under the impression that the new design would also cover that aspect: https://docs.google.com/document/d/1H9uYt53Q1_CcOXOfLcr0hXRxvqflg_k_xeVorMLrWbM/edit?tab=t.0 Should we give it another try and start a new thread? cc @gaborkaszab, @pvary

I haven't seen this proposal yet, so let me take a look first before responding to it. I also meant to reply to your NDV question earlier and updated my comment to say "NDVs are not handled..." instead of "those type of stats are not handled...". In any case, let me first read that proposal that you linked.

Hey @nastra and @deniskuzZ ,
Thanks for bringing up that proposal doc I had earlier!
As I see this: ContentStats ATM are meant to be used on a per file basis (and aggregated to each level of the metadata tree structure, if I'm not mistaken). I think this ContentStats data structure is something we can re-use for extending PartitionStatistics to introduce column-level stats on a per-partition level too. Just for the record this has been asked by Trino and Hive people for a while, I got a number of requests around this since the above mentioned proposal doc went public.

Now about sketches (of any kind, e.g. NDV or histograms):

I don't think these should live within ContentStats, but we can examine other opportunities to have them on a per-partition basis, as currently per-table basis is what we have for sketches (well NDV, but histograms could be a meaningful addition IMO). One option is to have puffin files wired in into partition stats too.
I'm somewhat hesitant to add sketches on a per-partition level TBH, but would be nice to hear other opinions here. Just a back of the envelope calculation: a Theta sketch with default size/precision is 50KB, with 100k partitions it's 5GB for a Theta sketch per column. We could reduce this by sacrificing precision, e.g. k=512 => sketch size is 6KB *100k partitions => 600MB per column. These are in-memory sizes, we for sure load only a small subset of them, however, the disk size even though is smaller but could be half of the in-memory size. Still looks a lot, and I recall complaints for this when working on Impala, would be nice to hear other experiences.

nastra · 2026-05-19T14:59:47Z

Also, would it be possible to extend the design to optionally support partition-level statistics structures such as bitmap-based sketches for NDV estimation and histograms (e.g., KLL sketches)

@deniskuzZ coming back to this question. Can you help me understand why we would want to store such sketches at the file level when we already have them in NDVSketchUtil? This PR is intentionally scoped to field-level statistics and replaces what we previously stored in a few Maps, so I'm not sure we want to store sketches of any kind in content stats.

EDIT: looks like @gaborkaszab already provided a good summary while I was writing this up

deniskuzZ · 2026-05-19T17:18:43Z

@deniskuzZ coming back to this question. Can you help me understand why we would want to store such sketches at the file level when we already have them in NDVSketchUtil? This PR is intentionally scoped to field-level statistics and replaces what we previously stored in a few Maps, so I'm not sure we want to store sketches of any kind in content stats.

We do not want to store these sketches at the file level. The more natural fit would be partition-level aggregates, potentially persisted separately (e.g. in Parquet or another compact structure). In Hive, several optimizer features already rely on partition-level statistics such as NDVs and histograms, so there is practical value in exposing richer partition stats.

That said, I am not convinced Puffin is the best fit for this use case. As far as I understand, it lacks a secondary index, and someone mentioned scalability concerns when the number of blobs becomes very large (e.g. 100k partitions × 100 columns). Note, there was also an idea to store column statistics in Puffin on a per-partition basis, with references from the existing partition statistics files.

Also, well-designed queries should never need to load statistics for all partitions and all columns at once. In practice, queries typically touch only a relatively small subset of partitions and referenced columns, so metadata access should remain proportional to the actual query scope rather than the total table size.

Co-authored-by: Steven Zhen Wu <stevenz3wu@gmail.com>

Co-authored-by: Daniel Weeks <daniel.weeks@databricks.com>

nastra · 2026-05-20T06:29:51Z

thanks everyone for reviewing and participating in the discussion. I'd like to get this merged, so that we can make some progress on the implementation. The spec isn't finalized yet and we can still change/update things before v4 is ratified

github-actions Bot added the Specification Issues that may introduce spec changes. label Oct 2, 2025

github-actions Bot added the stale label Nov 2, 2025

nastra added not-stale and removed stale labels Nov 3, 2025

nastra force-pushed the content-stats-spec branch from e86f035 to a1f6ad4 Compare November 24, 2025 14:07

nastra requested a review from danielcweeks November 24, 2025 14:08

nastra force-pushed the content-stats-spec branch 2 times, most recently from ae11c0a to 919c682 Compare November 24, 2025 14:32

nastra requested review from amogh-jahagirdar and singhpk234 January 9, 2026 15:21

nastra commented Jan 22, 2026

View reviewed changes