Faster parquet DictEncoder (~20%)#2123
Conversation
|
Running benchmarks with just the change to ahash show no significant performance change. This is not entirely surprising as the current implementation uses crc32 which is very cheap to compute (although not DOS resistant). The change to hashbrown nets a non-trivial return where value encoding is the major bottleneck, this diminishes as additional overheads from nulls, lists, etc... take effect. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #2123 +/- ##
==========================================
- Coverage 83.71% 82.51% -1.20%
==========================================
Files 225 240 +15
Lines 59567 62234 +2667
==========================================
+ Hits 49865 51355 +1490
- Misses 9702 10879 +1177 ☔ View full report in Codecov by Sentry. |
| rand = { version = "0.8", default-features = false, features = ["std", "std_rng"] } | ||
| futures = { version = "0.3", default-features = false, features = ["std"], optional = true } | ||
| tokio = { version = "1.0", optional = true, default-features = false, features = ["macros", "fs", "rt", "io-util"] } | ||
| hashbrown = { version = "0.12", default-features = false } |
There was a problem hiding this comment.
There is a feature 'inline-more" which is enabled by default in hashbrown which gives sometimes a bit better performance.
There was a problem hiding this comment.
By disabling this here, we can delegate that decision downstream
|
|
||
| impl<T: DataType> Encoder<T> for DictEncoder<T> { | ||
| fn put(&mut self, values: &[T::T]) -> Result<()> { | ||
| for i in values { |
There was a problem hiding this comment.
Not sure if it's a bottleneck, it might be faster to compute the hashes for values in one go (i.e. vectorized)?
alamb
left a comment
There was a problem hiding this comment.
The code looks good to me but I am concerned about the new dependencies as I believe some people use parquet after compiling to WASM or on embedded devices.
I am curious what other maintainers think too
| rust-version = "1.57" | ||
|
|
||
| [dependencies] | ||
| ahash = "0.7" |
There was a problem hiding this comment.
These seem to be new dependencies (if optional features are not enabled)
|
|
||
| state: ahash::RandomState, | ||
|
|
||
| /// Used to provide a lookup from value to unique value |
There was a problem hiding this comment.
Given the replication of this pattern (maybe now in three places?) perhaps we can factor it into its own structure, mostly for readability as the use of HashMap to implement a HashSet takes some thought to totally grok
There was a problem hiding this comment.
I did consider this, but I was unsure where to put it. It can't live in arrow, as parquet needs to compile without arrow, but aside from creating a new crate I wasn't really sure where to put it...
There was a problem hiding this comment.
|
I'm going to get this in as I need it for #1764, we have time until the next release to address any issues. |
|
Benchmark runs are scheduled for baseline = 985760f and contender = 6ce4c4e. 6ce4c4e is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
| use std::hash::Hash; | ||
|
|
||
| /// Storage trait for [`Interner`] | ||
| pub trait Storage { |
Which issue does this PR close?
Part of #1764
Rationale for this change
The existing implementation is complex, and slower
What changes are included in this PR?
Gives the encoder the same treatment as #1861, switching to using ahash and hashbrown.
Are there any user-facing changes?
No