Description
Vectorised unary functions (e.g., exp, log) currently use either looping or Eigen's vectorised functions (for arithmetic types only) to evaluate containers. We could instead use the TBB's parallel_for functionality to evaluate the container in parallel.
A minimal example of this would be applying exp to every element of a large Eigen vector:
using Eigen::VectorXd;
VectorXd m_d = VectorXd::Random(10000000);
VectorXd m_res(10000000);
tbb::task_scheduler_init init;
tbb::parallel_for(
tbb::blocked_range<size_t>(0, m_d.size()),
[&m_d,&m_res](const tbb::blocked_range<size_t>& r) {
for (size_t i = r.begin(); i < r.end(); ++i)
m_res[i] = std::exp(m_d[i]);
});
Benchmarking shows a performance of ~2x that of Eigen's vectorised exp, and ~5x that of a simple loop (benchmarking code at end of post):
Run on (16 X 3700 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x8)
L1 Instruction 64 KiB (x8)
L2 Unified 512 KiB (x8)
L3 Unified 8192 KiB (x2)
Load Average: 1.53, 0.87, 0.79
-----------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------
ParExp 10259973 ns 10225812 ns 71
EigExp 23918425 ns 23917527 ns 28
LoopExp 53930647 ns 53929832 ns 10
This could provide large performance improvements for containers of autodiff types (where Eigen's vectorised functions can't be used), or for containers of arithmetic types where a vectorised function isn't available in Eigen.
I still need to do more testing to identify the size at which parallelisation becomes beneficial, but things look pretty promising
Current Version:
v3.2.0
Code used for benchmarking:
#include <stan/math/mix.hpp>
#include <benchmark/benchmark.h>
#include <tbb/parallel_for.h>
#include <tbb/blocked_range.h>
static void ParExp(benchmark::State& state) {
using Eigen::VectorXd;
VectorXd m_d = VectorXd::Random(10000000);
VectorXd m_res(10000000);
for (auto _ : state) {
tbb::task_scheduler_init init;
tbb::parallel_for(
tbb::blocked_range<size_t>(0, m_d.size()),
[&m_d,&m_res](const tbb::blocked_range<size_t>& r) {
for (size_t i = r.begin(); i < r.end(); ++i)
m_res[i] = std::exp(m_d[i]);
});
}
}
BENCHMARK(ParExp);
static void EigExp(benchmark::State& state) {
using Eigen::VectorXd;
VectorXd m_d = VectorXd::Random(10000000);
VectorXd m_res(10000000);
for (auto _ : state) {
m_res.array() = m_d.array().exp();
}
}
BENCHMARK(EigExp);
static void LoopExp(benchmark::State& state) {
using Eigen::VectorXd;
VectorXd m_d = VectorXd::Random(10000000);
VectorXd m_res(10000000);
for (auto _ : state) {
for(size_t i = 0; i < m_d.size(); ++i) {
m_res[i] = std::exp(m_d[i]);
}
}
}
BENCHMARK(LoopExp);
BENCHMARK_MAIN();
Description
Vectorised unary functions (e.g.,
exp,log) currently use either looping or Eigen's vectorised functions (for arithmetic types only) to evaluate containers. We could instead use the TBB'sparallel_forfunctionality to evaluate the container in parallel.A minimal example of this would be applying
expto every element of a large Eigen vector:Benchmarking shows a performance of ~2x that of Eigen's vectorised
exp, and ~5x that of a simple loop (benchmarking code at end of post):This could provide large performance improvements for containers of autodiff types (where Eigen's vectorised functions can't be used), or for containers of arithmetic types where a vectorised function isn't available in Eigen.
I still need to do more testing to identify the size at which parallelisation becomes beneficial, but things look pretty promising
Current Version:
v3.2.0
Code used for benchmarking: