Skip to content

Add Benchmark::ThreadRange() version with increment instead of multiply#283

Merged
EricWF merged 2 commits into
google:masterfrom
hydroo:master
Sep 3, 2016
Merged

Add Benchmark::ThreadRange() version with increment instead of multiply#283
EricWF merged 2 commits into
google:masterfrom
hydroo:master

Conversation

@hydroo

@hydroo hydroo commented Aug 31, 2016

Copy link
Copy Markdown
Contributor

Hi,

referring to #282.

for my work I want to use a range of threads.
So far the incrementing strategy seems to be limited to just multiplying with a factor.
I.e. you can only use thread counts like 1, 2, 4, 8, 16 or 1, 3, 9, 27.

So I made this small patch.

Should I create a pull request?

If you have an idea and a patch you should always make a PR. Issues are for people without patches :-P

for my work I want to use a range of threads

You must have a lot of threads... I'm initially skeptical of the need for this feature. Typically you don't want to exceed the number of threads supported by the hardware (at least by much). Please try and add a motivating example when you open the PR.

I'm closing this issue to make way for the PR.

Yes we do have up to 24 or 48 threads per board at the moment.

I use it for benchmarking synchronization algorithms. (Not a very common thing to do,)
I will probably need a few more things in the future like thread pinning and more accurate version of PauseTiming so I can make sure they leave at an exact time.
I'm not even sure google/benchmark is the right tool for the job. But it seems nice. The output, functionality and code is pretty simple and clear.

Motivating example:

template<typename barrier>
static void BM_barrier(benchmark::State& state) {

    while (state.KeepRunning()) {

        static barrier b;
        if (state.thread_index == 0) { b.init(state.threads); }

        state.PauseTiming();

        auto start = std::chrono::high_resolution_clock::now();
        b.wait(state.thread_index);
        auto end   = std::chrono::high_resolution_clock::now();

        auto elapsed_seconds = std::chrono::duration_cast<std::chrono::duration<double>>(end - start);

        if (elapsed_seconds < std::chrono::nanoseconds(0)) {
            state.SkipWithError("elapsed seconds < 0");
            break;
        }

        state.SetIterationTime(elapsed_seconds.count());       

        if (state.thread_index == 0) { b.destroy(); }
    }                                                                                                                                                                       
}

BENCHMARK_TEMPLATE(BM_barrier, baseline_barrier)->ThreadRange2(1, 4)->UseManualTime();
BENCHMARK_TEMPLATE(BM_barrier, central_counter_barrier)->ThreadRange2(1, 4)->UseManualTime();
BENCHMARK_TEMPLATE(BM_barrier, sense_reversing_centralized_barrier)->ThreadRange2(1, 4)->UseManualTime();

BENCHMARK_MAIN();

@googlebot

Copy link
Copy Markdown

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed, please reply here (e.g. I signed it!) and we'll verify. Thanks.


  • If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
  • If you signed the CLA as a corporation, please let us know the company's name.

@coveralls

Copy link
Copy Markdown

Coverage Status

Coverage decreased (-0.8%) to 86.906% when pulling f9e71dd on hydroo:master into 61f570e on google:master.

@AppVeyorBot

Copy link
Copy Markdown

Build benchmark 401 completed (commit b457d0c0eb by @hydroo)

@hydroo

hydroo commented Aug 31, 2016

Copy link
Copy Markdown
Contributor Author

I signed it!

@googlebot

Copy link
Copy Markdown

CLAs look good, thanks!

@EricWF

EricWF commented Sep 2, 2016

Copy link
Copy Markdown
Contributor

I will probably need a few more things in the future like thread pinning and more accurate version of PauseTiming so I can make sure they leave at an exact time.

FYI PR #286 makes PauseTiming() per thread instead of per-process so it doesn't perform synchronization and block the threads.

@hydroo

hydroo commented Sep 2, 2016

Copy link
Copy Markdown
Contributor Author

Thanks for the info.

Comment thread include/benchmark/benchmark_api.h Outdated
// Foo in 16 threads
Benchmark* ThreadRange(int min_threads, int max_threads);

Benchmark* ThreadRange2(int min_threads, int max_threads);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest DenseThreadRange(int min_threads, int max_threads, int stride=1) since it matches the DenseRange function we already have.

@EricWF

EricWF commented Sep 3, 2016

Copy link
Copy Markdown
Contributor

Alright so there are a couple things left to do.

  1. The DenseThreadRange(...) declaration in benchmark_api.h needs to be documented like the other functions around it. Take a look at DenseRange for an example.
  2. This needs a test. Somethink like this should work:
static void BM_ThreadRanges(benchmark::State& st) {
  switch (st.range(0)) {
  case 1:
    assert(st.threads == 1 || st.threads == 2 || st.threads == 3);
    break;
  case 2:
    assert(st.threads == 1 || st.threads == 3);
    break;
  case 3:
    assert(st.threads == 5 || st.threads == 8 || st.threads == 11 || st.threads == 14);
    break;
  default:
    assert(false && "Invalid test case number");
  }
  while (st.KeepRunning()) {}
}
BENCHMARK(BM_ThreadRanges)->Arg(1)->DenseThreadRange(1, 3);
BENCHMARK(BM_ThreadRanges)->Arg(2)->DenseThreadRange(1, 3, 2);
BENCHMARK(BM_ThreadRanges)->Arg(3)->DenseThreadRange(5, 14, 3);

@hydroo

hydroo commented Sep 3, 2016

Copy link
Copy Markdown
Contributor Author

Is there a clang-format file for this project? I just tried clang-format -style=Google on benchmark_api.h and it changed a lot lot. So that's probably not what you do.

@EricWF

EricWF commented Sep 3, 2016

Copy link
Copy Markdown
Contributor

There is a config file now. It was just checked in today.
To format patches use clang-format-diff

@hydroo

hydroo commented Sep 3, 2016

Copy link
Copy Markdown
Contributor Author

When I make test, benchark_test.cc is not executed. Is there a non-manual way to run it? Or how is this intended?

I add the test to the end of benchark_test.cc, ok?

@hydroo

hydroo commented Sep 3, 2016

Copy link
Copy Markdown
Contributor Author

This clang format file seems to be for a clang version newer than 3.8 (Newest Ubuntu).
Error: error: unknown key 'IncludeIsMainRegex'

@EricWF

EricWF commented Sep 3, 2016

Copy link
Copy Markdown
Contributor

Stick the test wherever makes sense and use --style=Google if the config doesn't work.

@hydroo

hydroo commented Sep 3, 2016

Copy link
Copy Markdown
Contributor Author

I force pushed my master, since I had troubles with git. I used this opportunity to clean up my commits as well. It's now only two.

@coveralls

Copy link
Copy Markdown

Coverage Status

Coverage increased (+0.09%) to 87.219% when pulling 86ebccf on hydroo:master into 60e88c2 on google:master.

@AppVeyorBot

Copy link
Copy Markdown

Build benchmark 442 failed (commit b9ba8c6b88 by @hydroo)

@EricWF

EricWF commented Sep 3, 2016

Copy link
Copy Markdown
Contributor

Looks good to me! Thanks for the patch.

@EricWF EricWF merged commit 72be952 into google:master Sep 3, 2016
@EricWF

EricWF commented Sep 3, 2016

Copy link
Copy Markdown
Contributor

To discuss some of your other comments:

more accurate version of PauseTiming so I can make sure they leave at an exact time.

PauseTiming now act's a lot differently. In particular it's thread-local and doesn't act as a barrier so hopefully this solves your problem. Also SetIterationTime is also per-thread now. The final value is the average of each of the threads values.

I'm looking into "thread pinning" as a feature, but I'm not sure what the correct way to pin the threads is.
Could you clarify what you need here?

@hydroo

hydroo commented Sep 5, 2016

Copy link
Copy Markdown
Contributor Author

Thanks!

Last Saturday I had big troubles getting my code to work in the benchmark harness, again.
I liked having the barrier there.
I am not sure what was going on.
It would help, if I could acquire shared resources before spawning any threads and pass them.
If all threads start the benchmark at the same time, I have to make sure everyone has the correct data.
Starting the benchmarks via my own main seems to not help with that, since I can only start all benchmarks at the same time, and not individually (I can only acquire all resources before running any benchmark, and release them afterwards).

Can threads overtake each other in KeepRunning() or is there a barrier between iterations?

As for SetIterationTime, I would also like to have the maximum of all threads and the time from the earliest before to the latest after (kind of time maximum time collectively spent in the code).

Since I benchmark synchronization primitives I want to make sure (1) threads are not migrated during benchmark execution, (2) threads are placed in a sensible way (not 1, 3 on 2 cpus with 2 threads each, e.g.), perhaps I want to even place them individually myself because distances between CPUs affect performance.

Last time I used pthread_getaffinity_np, pthread_setaffinity_np. Haven't looked into it yet. Maybe there is a newer, nicer way. Also sched_setaffinity exists on linux.

@EricWF

EricWF commented Sep 6, 2016

Copy link
Copy Markdown
Contributor

Last Saturday I had big troubles getting my code to work in the benchmark harness, again.
I liked having the barrier there.

Sorry but the barrier in PauseTiming() essentially made multithreaded benchmarks synchronous , which is obviously not what we want. Removing the barrier drastically improved the timing accuracy of multithreaded benchmarks.

I plan to add barrier functionality to State so that users can block when need be, but I don't think PauseTiming() should be it.

I am not sure what was going on.
It would help, if I could acquire shared resources before spawning any threads and pass them.
If all threads start the benchmark at the same time, I have to make sure everyone has the correct data.
Starting the benchmarks via my own main seems to not help with that, since I can only start all benchmarks at the same time, and not individually (I can only acquire all resources before running any benchmark, and release them afterwards).

That shouldn't be a problem with the existing API. See the example below.

Can threads overtake each other in KeepRunning() or is there a barrier between iterations?

KeepRunning() only acts as a barrier on the first and last iteration. This ensures that the initialization code has been run before any thread enters the loop and that the teardown code only runs after all threads are done. You should be able to setup/teardown shared state between multiple threads like this:

void SetupGlobalState();
void TeardownGlobalState();
std::vector<int> global_state;
static BM_Foo(benchmark::State& st) {
  if (st.thread_index == 0) { // A single thread executes this setup.
    SetupGlobalState();
  }
  PinCurrentThread(); // All thread execute this
  while (st.KeepRunning()) { // No thread starts the loop until all threads are here.
     /* ... */
  } // No thread passes here until all threads are complete.
  if (st.thread_index == 0) {
    TeardownGlobalState();
  }
}

As for SetIterationTime, I would also like to have the maximum of all threads and the time from the earliest before to the latest after (kind of time maximum time collectively spent in the code).

Yeah I think SetIterationTime() could offer a bunch more functionality. I'm looking into better support for custom timers.

Since I benchmark synchronization primitives I want to make sure (1) threads are not migrated during benchmark execution, (2) threads are placed in a sensible way (not 1, 3 on 2 cpus with 2 threads each, e.g.), perhaps I want to even place them individually myself because distances between CPUs affect performance.

I think users will generally want to place the threads themselves. That's why IDK if this can be a library feature. Also it's super easy for users to pin threads today (see above).

Last time I used pthread_getaffinity_np, pthread_setaffinity_np. Haven't looked into it yet. Maybe there is a newer, nicer way. Also sched_setaffinity exists on linux.

No that's the way to do it. But as mentioned above it's not clear what to set the affinity to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants