ARROW-5269: [C++][Archery] Mark relevant benchmarks as regression by fsaintjacques · Pull Request #4285 · apache/arrow

fsaintjacques · 2019-05-09T17:12:59Z

The goal of this change is to mark benchmarks candidate for automated regression checks. Some benchmarks were refactored for various reasons:

Code deduplication where needed
Uniform input size for benchmarks in the same suite, usually to minimize the effect of cache hierarchy.
Favor external repetitions over manual repetitions and mintime when possible.
Add cmake ARROW_BUILD_BENCHMARKS_REFERENCE to toggle reference benchmarks.
Remove default benchmark filter of ^Regression.
Remove BM_ prefix from benchmark names
Fixes and addons to archery to support the --pdb option for debugging and add support for benchmarks reported as items_per_seconds.
Adds archery benchmark list sub-command to list suites and benchmarks

wesm · 2019-05-09T17:56:26Z

I'm not sure why the "Regression" designation is necessary. It seems like we would want to monitor the performance of all of our benchmarks for significant changes.

fsaintjacques · 2019-05-09T18:19:34Z

Multiple reasons:

some benchmarks are flaky.
some benchmarks exists only for point of reference (anything with Naive/Dummy in the name) for other benchmarks or hardware discover-ability, e.g. the memory latency and bandwidth benchmarks.
some benchmarks are heavyweight

I think it's preferable to white list than include everything and possibly increase false positive regression alerts. It also lower the time to run benchmarks since archery is explicitly passing (by default) the --benchmark_filter="^Regression" options.

fsaintjacques · 2019-05-09T18:21:06Z

By the way, I have no attachment to Regression it can be something else, but the function name (and hence the benchmark name) paired with a regex are the easiest way to label and filter.

wesm · 2019-05-09T18:22:23Z

If it's an issue with reporting on flaky benchmarks, we could create a "blacklist file" to instruct the reporter not to report regressions in known flakes.

nealrichardson · 2019-05-10T14:52:07Z

Some thoughts, for what they're worth:

A benefit of blacklisting vs. whitelisting (whatever the mechanism for indicating it) is that the default is to include benchmarks in regression testing, and I think that's the right default--if you're adding a benchmark, most likely the purpose is to make sure performance doesn't go down in the future. The bias should be in favor of this, and make developers have to opt out by blacklisting.
I'm generally of the opinion that flaky tests have negative value--you don't notice them when they pass, and when they fail, you don't believe them--and should either be fixed immediately or deleted (or skipped, if you're into that). My instinct is that the same is true for flaky benchmarks.
Maybe heavy benchmarks belong in a different place altogether since we aren't going to run them routinely.

fsaintjacques · 2019-05-10T17:37:36Z

Ok, I'll just remove the filtering by default. Should we strip the BM_ out of every benchmark names? I don't see the value in keeping this.

wesm · 2019-05-15T19:20:50Z

It's been several days and I'm still having a hard time swallowing the "Regression*" naming convention

fsaintjacques · 2019-05-21T15:49:18Z

@ursabot benchmark

ursabot · 2019-05-21T15:49:20Z

I've successfully started builds for this PR

ursabot · 2019-05-21T16:18:18Z

AMD64 Ubuntu 18.04 C++ Benchmark

Build failed.

codecov-io · 2019-05-21T16:38:35Z

Codecov Report

Merging #4285 into master will increase coverage by 0.98%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #4285      +/-   ##
==========================================
+ Coverage    88.3%   89.28%   +0.98%     
==========================================
  Files         780      636     -144     
  Lines       98400    87138   -11262     
  Branches     1251        0    -1251     
==========================================
- Hits        86891    77801    -9090     
+ Misses      11273     9337    -1936     
+ Partials      236        0     -236

Impacted Files	Coverage Δ
cpp/src/plasma/thirdparty/ae/ae.c	`70.75% <0%> (-0.95%)`	⬇️
cpp/src/plasma/test/external_store_tests.cc	`100% <0%> (ø)`	⬆️
cpp/src/plasma/test/client_tests.cc	`100% <0%> (ø)`	⬆️
cpp/src/plasma/test/serialization_tests.cc	`100% <0%> (ø)`	⬆️
go/arrow/ipc/writer.go
go/arrow/math/uint64_amd64.go
go/arrow/memory/memory_avx2_amd64.go
go/arrow/ipc/file_reader.go
js/src/enum.ts
go/arrow/array/builder.go
... and 144 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 68329e7...27633c9. Read the comment docs.

fsaintjacques · 2019-05-22T17:53:21Z

@ursabot benchmark

fsaintjacques · 2019-05-22T17:58:26Z

@pitrou note that I added the cmake configuration flag ARROW_BUILD_BENCHMARKS_REFERENCE which toggles a C/C++ define ARROW_WITH_BENCHMARKS_REFERENCE. The goal is to avoid compiling (and running) benchmarks which are not used for monitoring purposes but only point of reference, e.g. memory latency/bandwidth, naive implementations of algorithms...

Thus if you add a new benchmark and you know that it's only used for a point of reference of the true benchmark, you should probably wrap it with said #ifdef.

ursabot · 2019-05-22T18:52:51Z

I've successfully started builds for this PR

ursabot · 2019-05-22T19:03:27Z

AMD64 Ubuntu 18.04 C++ Benchmark

Build failed with an exception.

kszucs · 2019-05-24T16:18:36Z

Really just a nit, but ARROW_WITH_REFERENCE_BENCHMARKS might be a better choice.

Well, there's ARROW_BUILD_BENCHMARKS already. The convention seems to be ARROW_BUILD_xxx to enable the xxx target, and ARROW_WITH_yyy to enable support for the yyy external library.

Are you suggesting changing the s/BUILD/WITH/ or the swap of BENCHMARKS/REFERENCE?

kszucs

LGTM, just a couple nits

kszucs · 2019-05-24T16:21:33Z

@ursabot benchmark

ursabot · 2019-05-24T16:21:35Z

I've successfully started builds for this PR

kszucs · 2019-05-24T16:23:27Z

Why did We get empty output in https://ci.ursalabs.org/#builders/73/builds/20 ?

fsaintjacques · 2019-05-24T16:26:24Z

@kszucs because benchmarks were renamed and thus the intersection between 2 run is empty. See https://github.com/apache/arrow/blob/master/dev/archery/archery/benchmark/compare.py#L115-L116

ursabot · 2019-05-24T16:29:49Z

AMD64 Ubuntu 18.04 C++ Benchmark

Build failed with an exception.

pitrou

Thank you very much for doing this. Here are some comments and questions.

- Ensure that Builder benchmarks are working on inputs of the same (approximately) size in bytes. This allows relative comparison between builders. - Renamed benchmarks by prefixing `Regression`. - Fixed extra string copy in BuildStringDictionary.

- Use the same buffer size in all bitmap benchmarks. - Fix some reporting numbers

- Add '\n' do diff json output adhering to jsonlines - Add support for items_per_second metrics - Add `--pdb` option to drop a pdb shell on uncaught exception

- Use a single data generator - Fix multithread input size - Exclude multithread from regressions

- Favor external repetitions over manual repetitions and mintime when possible. - Add cmake ARROW_BUILD_BENCHMARKS_REFERENCE to toggle reference benchmarks. - Remove default benchmark filter of `^Regression`. - Remove Regression prefix from benchmark

fsaintjacques · 2019-05-28T15:28:10Z

@pitrou @kszucs I addressed the comments.

pitrou

Thanks @fsaintjacques . just a couple more things and it'll be good IMO :-)

pitrou

+1 :-)

fsaintjacques force-pushed the ARROW-5269-regression-candidates branch 2 times, most recently from fe050bb to 0592ecb Compare May 21, 2019 15:49