Fix NullArrayReader (#1245)#1246
Conversation
| self.rep_levels_buffer = self.record_reader.consume_rep_levels()?; | ||
|
|
||
| // Must consume bitmap buffer | ||
| self.record_reader.consume_bitmap_buffer()?; |
There was a problem hiding this comment.
Why this is even getting computed at all is definitely a valid question, but at the same time a workload where the speed of reading arrays of nulls is the bottleneck feels decidedly... niche 😆
There was a problem hiding this comment.
One place it comes up is where you have multiple Parquet files representing a "table", with one or more columns being relatively sparse. If you have a file dropped every hour, then it may be that some of the files have a null column while others have a few values.
It doesn't seem likely it would be the bottleneck (the other columns in the file probably would be), but that's at least how it's come up for us.
There was a problem hiding this comment.
Yeah, I didn't mean to suggest NullArrayReader itself is niche, but rather there is probably a limited benefit to optimising it to not compute the bitmask unnecessarily 😄
alamb
left a comment
There was a problem hiding this comment.
Thanks @tustvold -- I'll get this out in arrow 9.0.0 next week; I can try to release a patchset sooner if you need @bjchambers
|
No need to rush. Next week should be fine. I can just wait to upgrade until
this is out. Thanks for the speedy fix!
…On Sat, Jan 29, 2022 at 3:19 AM Andrew Lamb ***@***.***> wrote:
***@***.**** approved this pull request.
Thanks @tustvold <https://github.com/tustvold> -- I'll get this out in
arrow 9.0.0 next week; I can try to release a patchset sooner if you need
@bjchambers <https://github.com/bjchambers>
—
Reply to this email directly, view it on GitHub
<#1246 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAIY6HXTGYKE5TH6CONQVDUYPEKTANCNFSM5NCAZKAQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Which issue does this PR close?
Closes #1245.
Rationale for this change
Originally reported by @bjchambers, a sanity check added in #1054 fires when reading NullArrays.
This highlighted two problems:
NullArrayReaderhas always been somewhat broken as it would never flush the accumulated null buffer, instead just growing it indefinitelyWhat changes are included in this PR?
Fixes the panic, and adds a basic test of the
NullArrayReaderAre there any user-facing changes?
NullArrayReader should work again