Faster parquet DictEncoder (~20%) by tustvold · Pull Request #2123 · apache/arrow-rs

tustvold · 2022-07-21T21:29:52Z

Which issue does this PR close?

Part of #1764

Rationale for this change

The existing implementation is complex, and slower

What changes are included in this PR?

Gives the encoder the same treatment as #1861, switching to using ahash and hashbrown.

Are there any user-facing changes?

No

tustvold · 2022-07-21T21:32:05Z

Running benchmarks with just the change to ahash show no significant performance change. This is not entirely surprising as the current implementation uses crc32 which is very cheap to compute (although not DOS resistant).

The change to hashbrown nets a non-trivial return where value encoding is the major bottleneck, this diminishes as additional overheads from nulls, lists, etc... take effect.

write_batch primitive/4096 values primitive                                                                             
                        time:   [1.5325 ms 1.5331 ms 1.5338 ms]
                        thrpt:  [115.02 MiB/s 115.07 MiB/s 115.12 MiB/s]
                 change:
                        time:   [-20.677% -20.632% -20.590%] (p = 0.00 < 0.05)
                        thrpt:  [+25.929% +25.995% +26.068%]
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
Benchmarking write_batch primitive/4096 values primitive non-null: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.5s, enable flat sampling, or reduce sample count to 50.
write_batch primitive/4096 values primitive non-null                                                                             
                        time:   [1.4838 ms 1.4847 ms 1.4857 ms]
                        thrpt:  [116.44 MiB/s 116.52 MiB/s 116.59 MiB/s]
                 change:
                        time:   [-12.080% -12.017% -11.954%] (p = 0.00 < 0.05)
                        thrpt:  [+13.577% +13.659% +13.739%]
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe
write_batch primitive/4096 values bool                                                                            
                        time:   [111.01 us 111.09 us 111.19 us]
                        thrpt:  [10.224 MiB/s 10.233 MiB/s 10.240 MiB/s]
                 change:
                        time:   [-0.8794% -0.6831% -0.4488%] (p = 0.00 < 0.05)
                        thrpt:  [+0.4508% +0.6878% +0.8872%]
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
write_batch primitive/4096 values bool non-null                                                                            
                        time:   [52.931 us 53.012 us 53.094 us]
                        thrpt:  [21.411 MiB/s 21.444 MiB/s 21.477 MiB/s]
                 change:
                        time:   [-2.2177% -2.1085% -1.9913%] (p = 0.00 < 0.05)
                        thrpt:  [+2.0318% +2.1539% +2.2680%]
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  5 (5.00%) high mild
  10 (10.00%) high severe
write_batch primitive/4096 values string                                                                            
                        time:   [891.20 us 891.52 us 891.88 us]
                        thrpt:  [89.239 MiB/s 89.275 MiB/s 89.306 MiB/s]
                 change:
                        time:   [-8.4838% -8.4391% -8.3955%] (p = 0.00 < 0.05)
                        thrpt:  [+9.1650% +9.2170% +9.2703%]
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
Benchmarking write_batch primitive/4096 values string non-null: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.2s, enable flat sampling, or reduce sample count to 60.
write_batch primitive/4096 values string non-null                                                                             
                        time:   [1.0208 ms 1.0213 ms 1.0218 ms]
                        thrpt:  [77.889 MiB/s 77.931 MiB/s 77.970 MiB/s]
                 change:
                        time:   [+0.0730% +0.1746% +0.2545%] (p = 0.00 < 0.05)
                        thrpt:  [-0.2538% -0.1743% -0.0730%]
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

Benchmarking write_batch nested/4096 values primitive list: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.8s, enable flat sampling, or reduce sample count to 50.
write_batch nested/4096 values primitive list                                                                             
                        time:   [1.9798 ms 2.0064 ms 2.0368 ms]
                        thrpt:  [80.409 MiB/s 81.627 MiB/s 82.725 MiB/s]
                 change:
                        time:   [+0.9435% +1.8832% +3.0013%] (p = 0.00 < 0.05)
                        thrpt:  [-2.9139% -1.8484% -0.9347%]
                        Change within noise threshold.
Found 19 outliers among 100 measurements (19.00%)
  1 (1.00%) high mild
  18 (18.00%) high severe
write_batch nested/4096 values primitive list non-null                                                                             
                        time:   [2.4385 ms 2.4696 ms 2.5038 ms]
                        thrpt:  [76.896 MiB/s 77.959 MiB/s 78.952 MiB/s]
                 change:
                        time:   [-0.1096% +1.1302% +2.5102%] (p = 0.10 > 0.05)
                        thrpt:  [-2.4488% -1.1176% +0.1097%]
                        No change in performance detected.

codecov-commenter · 2022-07-21T21:54:49Z

Codecov Report

Attention: Patch coverage is 90.24390% with 8 lines in your changes missing coverage. Please review.

Project coverage is 82.51%. Comparing base (5e3facf) to head (a07d513).
Report is 2352 commits behind head on master.

Files with missing lines	Patch %	Lines
parquet/src/encodings/encoding/dict_encoder.rs	91.37%	5 Missing ⚠️
parquet/src/util/interner.rs	90.90%	2 Missing ⚠️
parquet/src/encodings/encoding/mod.rs	50.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2123      +/-   ##
==========================================
- Coverage   83.71%   82.51%   -1.20%     
==========================================
  Files         225      240      +15     
  Lines       59567    62234    +2667     
==========================================
+ Hits        49865    51355    +1490     
- Misses       9702    10879    +1177

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…ct-encoder

Dandandan · 2022-07-22T05:34:07Z

 rand = { version = "0.8", default-features = false, features = ["std", "std_rng"] }
 futures = { version = "0.3", default-features = false, features = ["std"], optional = true }
 tokio = { version = "1.0", optional = true, default-features = false, features = ["macros", "fs", "rt", "io-util"] }
+hashbrown = { version = "0.12", default-features = false }


There is a feature 'inline-more" which is enabled by default in hashbrown which gives sometimes a bit better performance.

By disabling this here, we can delegate that decision downstream

Dandandan · 2022-07-22T09:27:24Z

+
+impl<T: DataType> Encoder<T> for DictEncoder<T> {
+    fn put(&mut self, values: &[T::T]) -> Result<()> {
+        for i in values {


Not sure if it's a bottleneck, it might be faster to compute the hashes for values in one go (i.e. vectorized)?

alamb

The code looks good to me but I am concerned about the new dependencies as I believe some people use parquet after compiling to WASM or on embedded devices.

I am curious what other maintainers think too

cc @sunchao @nevi-me @viirya @HaoYang670

alamb · 2022-07-22T10:32:34Z

 rust-version = "1.57"

 [dependencies]
+ahash = "0.7"


These seem to be new dependencies (if optional features are not enabled)

alamb · 2022-07-22T10:43:30Z

+
+    state: ahash::RandomState,
+
+    /// Used to provide a lookup from value to unique value


Given the replication of this pattern (maybe now in three places?) perhaps we can factor it into its own structure, mostly for readability as the use of HashMap to implement a HashSet takes some thought to totally grok

I did consider this, but I was unsure where to put it. It can't live in arrow, as parquet needs to compile without arrow, but aside from creating a new crate I wasn't really sure where to put it...

https://www.youtube.com/watch?v=PAAkCSZUG1c&t=9m28s 🤷

tustvold · 2022-07-29T07:30:33Z

I'm going to get this in as I need it for #1764, we have time until the next release to address any issues.

ursabot · 2022-07-29T08:12:08Z

Benchmark runs are scheduled for baseline = 985760f and contender = 6ce4c4e. 6ce4c4e is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

alamb · 2022-07-29T11:32:54Z

+use std::hash::Hash;
+
+/// Storage trait for [`Interner`]
+pub trait Storage {


Faster parquet DictEncoder

5d8d756

github-actions Bot added the parquet Changes to the parquet crate label Jul 21, 2022

tustvold changed the title ~~Faster parquet DictEncoder~~ Faster parquet DictEncoder (~20%) Jul 21, 2022

tustvold mentioned this pull request Jul 21, 2022

Push gather down to Parquet Encoder #2109

Closed

Merge remote-tracking branch 'upstream/master' into faster-parquet-di…

4c38a63

…ct-encoder

Dandandan reviewed Jul 22, 2022

View reviewed changes

alamb approved these changes Jul 22, 2022

View reviewed changes

tustvold added 3 commits July 23, 2022 09:16

Reserve dictionary capacity

e9b527a

Split out interner

8be38af

Fix RAT

a07d513

tustvold merged commit 6ce4c4e into apache:master Jul 29, 2022

alamb reviewed Jul 29, 2022

View reviewed changes


		state: ahash::RandomState,

		/// Used to provide a lookup from value to unique value

Uh oh!

Conversation

tustvold commented Jul 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

tustvold commented Jul 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Jul 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tustvold commented Jul 29, 2022

Uh oh!

ursabot commented Jul 29, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

tustvold commented Jul 21, 2022 •

edited

Loading

tustvold commented Jul 21, 2022 •

edited

Loading

codecov-commenter commented Jul 21, 2022 •

edited

Loading