Skip to content

parquet: Speed up BitReader/DeltaBitPackDecoder#325

Merged
sunchao merged 2 commits into
apache:masterfrom
kornholi:pq-bitreader-allocs
May 24, 2021
Merged

parquet: Speed up BitReader/DeltaBitPackDecoder#325
sunchao merged 2 commits into
apache:masterfrom
kornholi:pq-bitreader-allocs

Conversation

@kornholi

Copy link
Copy Markdown
Contributor

This PR removes some reference counting in BitReader and a few allocations in DeltaBitPackDecoder.

At least for the datasets I tested with, delta-encoded integer columns decode around 50% faster. A SELECT AVG(foo) through datafusion was about 30% faster as well.

kornholi added 2 commits May 18, 2021 19:01
From a quick test, this speeds up reading delta-packed int columns by
over 30%.
From a quick test, it seems to decode around 10% faster overall.
@kornholi kornholi force-pushed the pq-bitreader-allocs branch from 753019c to 10d8390 Compare May 19, 2021 01:01
@Dandandan

Dandandan commented May 20, 2021

Copy link
Copy Markdown
Contributor

@yordan-pavlov

yordan-pavlov commented May 20, 2021

Copy link
Copy Markdown
Contributor

@kornholi thank you for looking into BitReader and DeltaBitPackDecoder; these performance improvements will combine very well with the change I have been working on here #200; similar to you, I have also found the current implementation of the parquet reader to be fairly inefficient in some places where for example it's unnecessarily creating clones of ByteBufferPtr, or allocating Vecs, etc. I can't wait to see what the combined performance improvements will be.

@alamb

alamb commented May 21, 2021

Copy link
Copy Markdown
Contributor

I restarted the CI checks on this PR as the failure on the windows tests seemed unrelated to your changes

@alamb alamb added parquet Changes to the parquet crate enhancement Any new improvement worthy of a entry in the changelog labels May 21, 2021

@alamb alamb left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good to me, but I would feel better if @sunchao took a look at it before we merged it in

);
assert!(loaded == self.values_current_mini_block);
} else {
self.deltas_in_mini_block.clear();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the need for this change -- was calling clear() a major bottleneck? Or was it having to reinitialize the entire deltas_in_mini_block to default() in the self.use_batch branch?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, the resize is expensive even though it optimizes down to mostly a memset (only 4 elems in the array in my tests). Around a 5% throughput difference.

@sunchao sunchao left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @kornholi !

@sunchao sunchao merged commit b2de544 into apache:master May 24, 2021
@kornholi kornholi deleted the pq-bitreader-allocs branch May 24, 2021 03:05
alamb pushed a commit that referenced this pull request Jun 4, 2021
* parquet: Avoid temporary `BufferPtr`s in `BitReader`

From a quick test, this speeds up reading delta-packed int columns by
over 30%.

* parquet: Avoid some allocations in `DeltaBitPackDecoder`

From a quick test, it seems to decode around 10% faster overall.
alamb added a commit that referenced this pull request Jun 5, 2021
* parquet: Avoid temporary `BufferPtr`s in `BitReader`

From a quick test, this speeds up reading delta-packed int columns by
over 30%.

* parquet: Avoid some allocations in `DeltaBitPackDecoder`

From a quick test, it seems to decode around 10% faster overall.

Co-authored-by: Kornelijus Survila <kornholijo@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants