Use TBB to parallelise unary functions of large containers

## Description
Vectorised unary functions (e.g., ```exp```, ```log```) currently use either looping or Eigen's vectorised functions (for arithmetic types only) to evaluate containers. We could instead use the TBB's ```parallel_for``` functionality to evaluate the container in parallel.

A minimal example of this would be applying ```exp``` to every element of a large Eigen vector:
```
  using Eigen::VectorXd;
  VectorXd m_d = VectorXd::Random(10000000);
  VectorXd m_res(10000000);

  tbb::task_scheduler_init init;
  tbb::parallel_for(
    tbb::blocked_range<size_t>(0, m_d.size()), 
    [&m_d,&m_res](const tbb::blocked_range<size_t>& r) {
      for (size_t i = r.begin(); i < r.end(); ++i)
        m_res[i] = std::exp(m_d[i]);
    });

```

Benchmarking shows a performance of ~2x that of Eigen's vectorised ```exp```, and ~5x that of a simple loop (benchmarking code at end of post):
```
Run on (16 X 3700 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 64 KiB (x8)
  L2 Unified 512 KiB (x8)
  L3 Unified 8192 KiB (x2)
Load Average: 1.53, 0.87, 0.79
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
ParExp       10259973 ns     10225812 ns           71
EigExp       23918425 ns     23917527 ns           28
LoopExp      53930647 ns     53929832 ns           10

```

This could provide large performance improvements for containers of autodiff types (where Eigen's vectorised functions can't be used), or for containers of arithmetic types where a vectorised function isn't available in Eigen.

I still need to do more testing to identify the size at which parallelisation becomes beneficial, but things look pretty promising

#### Current Version:
v3.2.0

#### Code used for benchmarking:
```
#include <stan/math/mix.hpp>
#include <benchmark/benchmark.h>
#include <tbb/parallel_for.h>
#include <tbb/blocked_range.h>

static void ParExp(benchmark::State& state) {
  using Eigen::VectorXd;
  VectorXd m_d = VectorXd::Random(10000000);
  VectorXd m_res(10000000);


  for (auto _ : state) {
    tbb::task_scheduler_init init;
    tbb::parallel_for(
      tbb::blocked_range<size_t>(0, m_d.size()), 
      [&m_d,&m_res](const tbb::blocked_range<size_t>& r) {
        for (size_t i = r.begin(); i < r.end(); ++i)
          m_res[i] = std::exp(m_d[i]);
      });
  }
}
BENCHMARK(ParExp);

static void EigExp(benchmark::State& state) {
  using Eigen::VectorXd;
  VectorXd m_d = VectorXd::Random(10000000);
  VectorXd m_res(10000000);

  for (auto _ : state) {
    m_res.array() = m_d.array().exp();
  }
}
BENCHMARK(EigExp);

static void LoopExp(benchmark::State& state) {
  using Eigen::VectorXd;
  VectorXd m_d = VectorXd::Random(10000000);
  VectorXd m_res(10000000);

  for (auto _ : state) {
    for(size_t i = 0; i < m_d.size(); ++i) {
      m_res[i] = std::exp(m_d[i]);
    }
  }
}
BENCHMARK(LoopExp);

BENCHMARK_MAIN();

```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use TBB to parallelise unary functions of large containers #1918

Description

Current Version:

Code used for benchmarking:

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Use TBB to parallelise unary functions of large containers #1918

Description

Description

Current Version:

Code used for benchmarking:

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions