Skip to content

Fix NullArrayReader (#1245)#1246

Merged
alamb merged 1 commit into
apache:masterfrom
tustvold:null-array-reader
Jan 29, 2022
Merged

Fix NullArrayReader (#1245)#1246
alamb merged 1 commit into
apache:masterfrom
tustvold:null-array-reader

Conversation

@tustvold

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Closes #1245.

Rationale for this change

Originally reported by @bjchambers, a sanity check added in #1054 fires when reading NullArrays.

This highlighted two problems:

  • NullArrayReader has always been somewhat broken as it would never flush the accumulated null buffer, instead just growing it indefinitely
  • There were literally no tests for NullArrayReader 😱

What changes are included in this PR?

Fixes the panic, and adds a basic test of the NullArrayReader

Are there any user-facing changes?

NullArrayReader should work again

self.rep_levels_buffer = self.record_reader.consume_rep_levels()?;

// Must consume bitmap buffer
self.record_reader.consume_bitmap_buffer()?;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this is even getting computed at all is definitely a valid question, but at the same time a workload where the speed of reading arrays of nulls is the bottleneck feels decidedly... niche 😆

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One place it comes up is where you have multiple Parquet files representing a "table", with one or more columns being relatively sparse. If you have a file dropped every hour, then it may be that some of the files have a null column while others have a few values.

It doesn't seem likely it would be the bottleneck (the other columns in the file probably would be), but that's at least how it's come up for us.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I didn't mean to suggest NullArrayReader itself is niche, but rather there is probably a limited benefit to optimising it to not compute the bitmask unnecessarily 😄

@alamb alamb left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tustvold -- I'll get this out in arrow 9.0.0 next week; I can try to release a patchset sooner if you need @bjchambers

@bjchambers

bjchambers commented Jan 29, 2022 via email

Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parquet v8.0.0 panics when reading all null column to NullArray

3 participants