Skip to content

GH-3574: parquet-hadoop: Statistics.toParquetStatistics: always set null_count#3575

Open
mdibaiee wants to merge 1 commit into
apache:masterfrom
mdibaiee:metadata-null-counts-truncated
Open

GH-3574: parquet-hadoop: Statistics.toParquetStatistics: always set null_count#3575
mdibaiee wants to merge 1 commit into
apache:masterfrom
mdibaiee:metadata-null-counts-truncated

Conversation

@mdibaiee
Copy link
Copy Markdown

Rationale for this change

Missing null_count statistics for columns in parquet files can cause issues with downstream consumers of these files. It is not necessary to omit this statistic for columns which are larger than the truncation configuration, since despite the truncation, their nullability can be asserted with confidence. It is reasonable to keep omitting min/max statistics due to the rationale explained in the comment in the code.

What changes are included in this PR?

Always add null_count statistics for columns in parquet files, unconditional of their size.

Are these changes tested?

Yes, TestParquetMetadataConverter.java has been updated to reflect these changes

Are there any user-facing changes?

I think we can consider the additional null_count statistic's appearance as a user-facing change

Closes #3574

public static Statistics toParquetStatistics(
org.apache.parquet.column.statistics.Statistics stats, int truncateLength) {
Statistics formatStats = new Statistics();
formatStats.setNull_count(stats.getNumNulls());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to preserve the check stats.isEmpty() for this?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, addressed 👍🏽

@mdibaiee mdibaiee force-pushed the metadata-null-counts-truncated branch from 83b6461 to b6d0e5b Compare May 23, 2026 12:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

null_count is omitted for large columns in parquet files

2 participants