-
Notifications
You must be signed in to change notification settings - Fork 1.9k
parquet: Add support for row group pruning on FixedSizeBinary #9646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
acac9ec to
588c5bc
Compare
588c5bc to
4740dac
Compare
4740dac to
496ba9a
Compare
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @progval -- as always this is looking quite good. However, I have a suggestion about data validation / error checking that i think is worth discussing prior to merge
| )) | ||
| } | ||
| Some(DataType::FixedSizeBinary(size)) => { | ||
| Some(ScalarValue::FixedSizeBinary( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we verify that the resulting bytes have the correct size? As in verify that *size is the same as (s.$bytes_func().to_vec().len()?
If not, I suggest we ignore the error (but don't use the statistics) the same way as is done for non UTF8 values i string statistics. Perhaps we can at least debug when this happens with debug!
Since this data can come from any arbitrary parquet writer, I think it is good practice to validate what is in there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh sure, good catch.
The code for strings is:
let s = std::str::from_utf8(s.$bytes_func())
.map(|s| s.to_string())
.ok();
Some(ScalarValue::Utf8(s))
which looks like it's treated as a NULL, rather than ignored as a statistics value. Should I do the same here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think converting to NULL is the correct behavior. In this case that results in ignoring the value in terms of range analysis. More details are here if you are curious:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, done. I also added a log statement for strings, for consistency.
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM -- thank you @progval
Which issue does this PR close?
Closes #9645.
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Yes
Are there any user-facing changes?
It should be invisible other than a performance improvement.