Spec: Add content stats to spec#14234
Conversation
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
e86f035 to
a1f6ad4
Compare
ae11c0a to
919c682
Compare
| | avg_value_count | `int` | 4 | false | The avg value count for variable-length types (string/binary) | | ||
| | max_value_count | `long` | 5 | false | The max value count for variable-length types (string/binary) | |
There was a problem hiding this comment.
Should this be avg_value_length and max_value_length. I don't think this represents a count.
There was a problem hiding this comment.
@rdblue since you brought up those two in the design doc, any thoughts about the exact naming of these two fields?
There was a problem hiding this comment.
Dan is correct. These should be avg_value_size_in_bytes.
We should check with engine people to see if we want to include the max. If it isn't going to be valuable in the short term, we should add it later. Also, I think that we need to be consistent about the types. This is long but the average is int.
There was a problem hiding this comment.
At least for Spark I think avg/max would be useful as we could report those through SupportsReportStatistics to Spark for CBO
There was a problem hiding this comment.
@ebyhr could the avg/max value size be used in Trino to have a better estimate for the data_size statistic?
There was a problem hiding this comment.
Is this size on disk or in memory? Trino CBO (TableScanStatsRule) uses the average size. The max value size is unused as far as I know.
There was a problem hiding this comment.
@nastra, did you confirm whether max is used by Spark?
There was a problem hiding this comment.
@rdblue yes and I think the only place where the max would be used by Spark is in the DSv2 Column Stats: https://github.com/apache/spark/blob/dc57455589a9f160638ab3526ca359eb84f76d67/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/Statistics.scala#L101, which we could pass via the SupportsReportStatistics API
There was a problem hiding this comment.
Just because it is passed doesn't mean it is used. Can we find out if this is aspirational or whether it is really needed?
There was a problem hiding this comment.
I did some additional investigation and maxLen in Spark is carried forward/aggregated/visualized in a few places (like JoinEstimation / DataSourceV2Relation / EstimationUtils / DescribeColumnExec) but not really used in cardinality or size calculations like avgLen is used today, meaning that we wouldn't have an immediate use for having max_value_size_in_bytes today. So I think the right call for now is to leave this out of the spec and add it later when we have an immediate use in a query engine
1513125 to
d25883c
Compare
Hey @nastra and @deniskuzZ , Now about sketches (of any kind, e.g. NDV or histograms):
|
@deniskuzZ coming back to this question. Can you help me understand why we would want to store such sketches at the file level when we already have them in EDIT: looks like @gaborkaszab already provided a good summary while I was writing this up |
We do not want to store these sketches at the file level. The more natural fit would be partition-level aggregates, potentially persisted separately (e.g. in Parquet or another compact structure). In Hive, several optimizer features already rely on partition-level statistics such as NDVs and histograms, so there is practical value in exposing richer partition stats. That said, I am not convinced Puffin is the best fit for this use case. As far as I understand, it lacks a secondary index, and someone mentioned scalability concerns when the number of blobs becomes very large (e.g. 100k partitions × 100 columns). Note, there was also an idea to store column statistics in Puffin on a per-partition basis, with references from the existing partition statistics files. Also, well-designed queries should never need to load statistics for all partitions and all columns at once. In practice, queries typically touch only a relatively small subset of partitions and referenced columns, so metadata access should remain proportional to the actual query scope rather than the total table size. |
Co-authored-by: Steven Zhen Wu <stevenz3wu@gmail.com>
Co-authored-by: Steven Zhen Wu <stevenz3wu@gmail.com>
Co-authored-by: Daniel Weeks <daniel.weeks@databricks.com>
3ef5760 to
5f01387
Compare
|
thanks everyone for reviewing and participating in the discussion. I'd like to get this merged, so that we can make some progress on the implementation. The spec isn't finalized yet and we can still change/update things before v4 is ratified |
These spec changes are related to the content stats proposal. Note that the root
content_statsfield will use ID 146 in the manifest.