Skip to content

Use TBB to parallelise unary functions of large containers #1918

@andrjohns

Description

@andrjohns

Description

Vectorised unary functions (e.g., exp, log) currently use either looping or Eigen's vectorised functions (for arithmetic types only) to evaluate containers. We could instead use the TBB's parallel_for functionality to evaluate the container in parallel.

A minimal example of this would be applying exp to every element of a large Eigen vector:

  using Eigen::VectorXd;
  VectorXd m_d = VectorXd::Random(10000000);
  VectorXd m_res(10000000);

  tbb::task_scheduler_init init;
  tbb::parallel_for(
    tbb::blocked_range<size_t>(0, m_d.size()), 
    [&m_d,&m_res](const tbb::blocked_range<size_t>& r) {
      for (size_t i = r.begin(); i < r.end(); ++i)
        m_res[i] = std::exp(m_d[i]);
    });

Benchmarking shows a performance of ~2x that of Eigen's vectorised exp, and ~5x that of a simple loop (benchmarking code at end of post):

Run on (16 X 3700 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 64 KiB (x8)
  L2 Unified 512 KiB (x8)
  L3 Unified 8192 KiB (x2)
Load Average: 1.53, 0.87, 0.79
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
ParExp       10259973 ns     10225812 ns           71
EigExp       23918425 ns     23917527 ns           28
LoopExp      53930647 ns     53929832 ns           10

This could provide large performance improvements for containers of autodiff types (where Eigen's vectorised functions can't be used), or for containers of arithmetic types where a vectorised function isn't available in Eigen.

I still need to do more testing to identify the size at which parallelisation becomes beneficial, but things look pretty promising

Current Version:

v3.2.0

Code used for benchmarking:

#include <stan/math/mix.hpp>
#include <benchmark/benchmark.h>
#include <tbb/parallel_for.h>
#include <tbb/blocked_range.h>

static void ParExp(benchmark::State& state) {
  using Eigen::VectorXd;
  VectorXd m_d = VectorXd::Random(10000000);
  VectorXd m_res(10000000);


  for (auto _ : state) {
    tbb::task_scheduler_init init;
    tbb::parallel_for(
      tbb::blocked_range<size_t>(0, m_d.size()), 
      [&m_d,&m_res](const tbb::blocked_range<size_t>& r) {
        for (size_t i = r.begin(); i < r.end(); ++i)
          m_res[i] = std::exp(m_d[i]);
      });
  }
}
BENCHMARK(ParExp);

static void EigExp(benchmark::State& state) {
  using Eigen::VectorXd;
  VectorXd m_d = VectorXd::Random(10000000);
  VectorXd m_res(10000000);

  for (auto _ : state) {
    m_res.array() = m_d.array().exp();
  }
}
BENCHMARK(EigExp);

static void LoopExp(benchmark::State& state) {
  using Eigen::VectorXd;
  VectorXd m_d = VectorXd::Random(10000000);
  VectorXd m_res(10000000);

  for (auto _ : state) {
    for(size_t i = 0; i < m_d.size(); ++i) {
      m_res[i] = std::exp(m_d[i]);
    }
  }
}
BENCHMARK(LoopExp);

BENCHMARK_MAIN();

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions