Extract parquet statistics for StructArray#11289
Extract parquet statistics for StructArray#11289Lordworms wants to merge 2 commits intoapache:mainfrom
StructArray#11289Conversation
|
@alamb I'll try to add support for struct in DataPage in next PR. |
StructArray
| // For example a ListArray could correspond to anything from 1 to 3 levels | ||
| // in the parquet schema | ||
| return None; | ||
| match field.data_type() { |
There was a problem hiding this comment.
Perhaps we should extract the repeated code into a helper function? Could also reduce nesting with an early return for the case where it's not nested. So something like:
| match field.data_type() { | |
| pub(crate) fn parquet_column<'a>( | |
| parquet_schema: &SchemaDescriptor, | |
| arrow_schema: &'a Schema, | |
| name: &str, | |
| ) -> Option<(usize, &'a FieldRef)> { | |
| let (root_idx, field) = arrow_schema.fields.find(name)?; | |
| if !field.data_type().is_nested() { | |
| let parquet_idx = find_parquet_idx(parquet_schema, root_idx)?; | |
| return Some((parquet_idx, field)); | |
| } | |
| // Nested field | |
| match field.data_type() { | |
| DataType::Struct(_) => { | |
| let parquet_idx = find_parquet_idx(parquet_schema, root_idx)?; | |
| Some((parquet_idx, field)) | |
| } | |
| _ => { | |
| if field.data_type().is_nested() { | |
| // Nested fields are not supported and require non-trivial logic | |
| // to correctly walk the parquet schema accounting for the | |
| // logical type rules - <https://github.com/apache/parquet-format/blob/master/LogicalTypes.md> | |
| // | |
| // For example a ListArray could correspond to anything from 1 to 3 levels | |
| // in the parquet schema | |
| None | |
| } else { | |
| let parquet_idx = find_parquet_idx(parquet_schema, root_idx)?; | |
| Some((parquet_idx, field)) | |
| } | |
| } | |
| } | |
| } | |
| // Helper function to find parquet index - TBD: this could be more efficient | |
| fn find_parquet_idx(parquet_schema: &SchemaDescriptor, root_idx: usize) -> Option<usize> { | |
| (0..parquet_schema.columns().len()) | |
| .find(|x| parquet_schema.get_column_root_idx(*x) == root_idx) | |
| } |
There was a problem hiding this comment.
Sure, I can do it
| Arc::new(BooleanArray::from(vec![Some(true)])) as ArrayRef, | ||
| ), | ||
| ]); | ||
|
|
There was a problem hiding this comment.
I wondered if we needed a struct array with top-level nulls. But for the stats, it shouldn't matter whether the null is coming from the top level or the nested arrays. That would be testing the how the stats are calculated. (This assumes I'm interpreting the way nulls are handled correctly.)
But it might be worth having some null counts to see that they get calculated correctly?
There was a problem hiding this comment.
That's what I concern about. Since it is hard and seems meaningless to have a top-level nullcount. The reason I leave it like this is just to follow the test pattern.
There was a problem hiding this comment.
This blog post provides good examples on when you'd represent nulls at different levels: https://arrow.apache.org/blog/2022/10/08/arrow-parquet-encoding-part-2/
There was a problem hiding this comment.
I was going to read the valid mask but it actually contains in the data page, not sure whether we need to read the actual data at this time since it is a Metadata level analysis.
There was a problem hiding this comment.
I don't think you need to do anything differently. I was trying to find some tests in the arrow crate for when the statistics are written but I haven't been able to find any for writing nested structs.
There was a problem hiding this comment.
Here's some tests in the arrow crate that are writing a 2 level struct: https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/arrow_writer/mod.rs#L1516-L1618 Not sure if that's helpful or not.
There was a problem hiding this comment.
Sure, I'll take a look at it.
c1a3e62 to
15cd7c5
Compare
| metadata: &[&RowGroupMetaData], | ||
| index: usize, | ||
| data_type: &DataType, | ||
| ) -> Vec<u64> { |
There was a problem hiding this comment.
If you preferred, this could also be expressed as a fold:
let num_row_groups = metadata.len();
fields.iter().fold(vec![0; num_row_groups], |mut acc, field| {
let field_null_counts = Self::get_null_counts_recursive(
metadata,
index + 1,
field.data_type(),
);
acc.iter_mut().zip(field_null_counts.iter()).for_each(|(a, b)| *a += b);
acc
})There was a problem hiding this comment.
Sure, sorry for the late response
|
Since we have merged apache/arrow-rs#6046 now upstream in arrow-rs, @Lordworms would you be willing to port this PR to the other repo? |
Sure, I'll open a PR later today. |
|
filed apache/arrow-rs#6090 |
|
Thank you @Lordworms - sorry for the delay / runaround. I just haven't had a chance to focus on this PR. I was hoping someone with more structured type experience would be able to help review it. 🎣 |
|
superceded by apache/arrow-rs#6090 (which I know is waiting for a review -- I just need to find time to study the nested types or find someone else who does) |
Which issue does this PR close?
Closes #10609
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?