Skip to content

ARROW-10783: [Rust] [DataFusion] Implement row count statistics for Parquet TableProviderParquet statistics [WIP]#8860

Closed
XiaokunDing wants to merge 4 commits into
apache:masterfrom
XiaokunDing:parquet_statistics
Closed

ARROW-10783: [Rust] [DataFusion] Implement row count statistics for Parquet TableProviderParquet statistics [WIP]#8860
XiaokunDing wants to merge 4 commits into
apache:masterfrom
XiaokunDing:parquet_statistics

Conversation

@XiaokunDing

Copy link
Copy Markdown

No description provided.

@XiaokunDing XiaokunDing changed the title Implement row count statistics for Parquet TableProviderParquet statistics ARROW-10783 Implement row count statistics for Parquet TableProviderParquet statistics Dec 7, 2020
@XiaokunDing XiaokunDing changed the title ARROW-10783 Implement row count statistics for Parquet TableProviderParquet statistics #ARROW-10783 Implement row count statistics for Parquet TableProviderParquet statistics Dec 7, 2020
@XiaokunDing XiaokunDing changed the title #ARROW-10783 Implement row count statistics for Parquet TableProviderParquet statistics ARROW-10783 [Rust] [DataFusion] Implement row count statistics for Parquet TableProviderParquet statistics Dec 7, 2020
@nevi-me nevi-me changed the title ARROW-10783 [Rust] [DataFusion] Implement row count statistics for Parquet TableProviderParquet statistics ARROW-10783: [Rust] [DataFusion] Implement row count statistics for Parquet TableProviderParquet statistics Dec 7, 2020
@github-actions

github-actions Bot commented Dec 7, 2020

Copy link
Copy Markdown

@XiaokunDing XiaokunDing changed the title ARROW-10783: [Rust] [DataFusion] Implement row count statistics for Parquet TableProviderParquet statistics ARROW-10783: [Rust] [DataFusion] Implement row count statistics for Parquet TableProviderParquet statistics [ WIP] Dec 7, 2020
@XiaokunDing XiaokunDing changed the title ARROW-10783: [Rust] [DataFusion] Implement row count statistics for Parquet TableProviderParquet statistics [ WIP] ARROW-10783: [Rust] [DataFusion] Implement row count statistics for Parquet TableProviderParquet statistics Dec 7, 2020
@XiaokunDing XiaokunDing changed the title ARROW-10783: [Rust] [DataFusion] Implement row count statistics for Parquet TableProviderParquet statistics ARROW-10783: [Rust] [DataFusion] Implement row count statistics for Parquet TableProviderParquet statistics [WIP] Dec 7, 2020
@apache apache deleted a comment from github-actions Bot Dec 7, 2020

@seddonm1 seddonm1 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI I have independently prepared a PR but was waiting for #8831

}

/// Get the statistics from parquet file format
pub fn statistics(&self) -> Option<Statistics> {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By doing this you are parsing the parquet file headers a second time when they have already been parsed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, for clarity, the ParquetReader already performs this code. You could expose the metadata produced in the first pass and then you don't need to reimplement this logic (which has to read all the files again).


/// The table statistics
#[derive(Clone, Debug)]
pub struct Statistics {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is repeating what you have done in #8831

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Mike, this is just a draft code for WIP, when #8831 have been merged I'll recommit it.

@XiaokunDing XiaokunDing closed this Dec 8, 2020
@XiaokunDing XiaokunDing reopened this Dec 8, 2020
@seddonm1

seddonm1 commented Dec 8, 2020

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot added the needs-rebase A PR that needs to be rebased by the author label Dec 9, 2020
}
}

if num_rows >= i64::MAX {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you could use std::cmp::min and std::cmp::max instead of these if statements to find min/max values

@Dandandan

Copy link
Copy Markdown
Contributor

@XiaokunDing what about changing the statistics to be optional per statistic, but not the statistics itself?
I was looking at adding it e.g. for a MemTable where it is easy to add a num_rows statistic, but (currently) not so trivial to add a total_byte_size implementation. Having the num_rows available could e.g. allow more efficient joins.
I guess this would be the same for other statistics that could be added later on.

@Dandandan

Copy link
Copy Markdown
Contributor

So the proposal would be to change the statistics struct to

pub struct Statistics {
    /// The number of table rows
    pub num_rows: Option<usize>,
    /// total byte of the table rows
    pub total_byte_size: Option<usize>,
}

and the Statistics doesn't need to be wrapped as Option<Statistics> anymore.

@Dandandan

Copy link
Copy Markdown
Contributor

Added a PR for that here: #8889

@Dandandan

Copy link
Copy Markdown
Contributor

@XiaokunDing do you plan updating this PR?

alamb pushed a commit that referenced this pull request Dec 17, 2020
…eProvider

This is an implementation of Statistics for the Parquet datasource as this PR seems to have gone quiet: #8860

This PR exposes the ParquetMetaData produced by the file reader as even though initially we only consume `row_count` and `total_byte_size` there is column level metadata that may be very useful in future for potentially large performance gains.

Closes #8944 from seddonm1/parquet-statistics

Authored-by: Mike Seddon <seddonm1@gmail.com>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
@alamb

alamb commented Dec 17, 2020

Copy link
Copy Markdown
Contributor

FYI this PR was partially superceded by #8944 -- unless @XiaokunDing objects, I think we should close this PR.

@jorgecarleitao

Copy link
Copy Markdown
Member

Closing as duplicate. Thanks @XiaokunDing for this initial implementation and for your interest in the Arrow project!

Let us know if there was anything on our side that led you to not continue this work, so that we can also improve our process. If not, I will assume that you did not have the time, which is perfectly understandable :)

alamb pushed a commit to apache/arrow-rs that referenced this pull request Apr 20, 2021
…eProvider

This is an implementation of Statistics for the Parquet datasource as this PR seems to have gone quiet: apache/arrow#8860

This PR exposes the ParquetMetaData produced by the file reader as even though initially we only consume `row_count` and `total_byte_size` there is column level metadata that may be very useful in future for potentially large performance gains.

Closes #8944 from seddonm1/parquet-statistics

Authored-by: Mike Seddon <seddonm1@gmail.com>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component: Rust - DataFusion Component: Rust needs-rebase A PR that needs to be rebased by the author

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants