Skip to content

Latest commit

 

History

History
executable file
·
390 lines (309 loc) · 14.3 KB

File metadata and controls

executable file
·
390 lines (309 loc) · 14.3 KB

sycl_ext_oneapi_cuda_async_barrier

Notice

Copyright © 2022-2022 Intel Corporation. All rights reserved.

Khronos® is a registered trademark and SYCL™ and SPIR™ are trademarks of The Khronos Group Inc. OpenCL™ is a trademark of Apple Inc. used by permission by Khronos.

Contact

To report problems with this extension, please open a new issue at:

Dependencies

This extension is written against the SYCL 2020 revision 4 specification. All references below to the "core SYCL specification" or to section numbers in the SYCL specification refer to that revision.

Status

This is an experimental extension specification, intended to provide early access to features and gather community feedback. Interfaces defined in this specification are implemented in DPC++, but they are not finalized and may change incompatibly in future versions of DPC++ without prior notice. Shipping software products should not rely on APIs defined in this specification.

Note

This extension is currently implemented in DPC++ only for NVIDIA GPU devices with Compute Capability of 8.0 or above and only when using the CUDA backend.

Introduction

This document describes an extension that adds barrier, which acts similarly to std::barrier.

Specification

Feature Test Macro

This extension provides a feature-test macro as described in the core SYCL specification section 6.3.3 "Feature test macros". Therefore, an implementation supporting this extension must predefine the macro SYCL_EXT_ONEAPI_CUDA_ASYNC_BARRIER to one of the values defined in the table below. Applications can test for the existence of this macro to determine if the implementation supports this feature, or applications can test the macro’s value to determine which of the extension’s APIs the implementation supports.

Value Description

1

The APIs of this experimental extension are not versioned, so the feature-test macro always has this value.

Overview

This extension introduces asynchronous barrier for CUDA devices. This extends group_barrier by splitting it into two calls - arrive and wait. The wait call blocks until the predetermined number of work items in the same work group call arrive. This is also very similar to std::barrier introduced in c++20.

Implementing this requires some space in local memory, where the state of the barrier is kept (this is true even for the CUDA compute capability 8 that support this functionality in hardware).

This extension introduces a barrier class, encapsulating the state of an asynchronous barrier. It must be used only in local memory, and reserves space in local memory even on hardware with native support (e.g. devices with CUDA Compute Capability 8.0). Once a barrier is initialized, it can be reused for multiple cycles of arrivals and waits.

The barrier class is an optional kernel feature as defined by the core SYCL specification section 5.7 "Optional kernel features". This class may only be used on devices that have the aspect::ext_oneapi_cuda_async_barrier aspect. Attempting to submit a kernel using this class to another device causes the kernel invocation command (e.g. parallel_for) to throw a synchronous exception with the errc::kernel_not_supported error code.

Note

Currently the design for optional device features is not fully implemented, so error handling explained in the paragraph above has certain limitations:

Applications that target heterogeneous devices will need to be careful using this extension if one of the devices does not support this CUDA feature. Kernels using this extension may need to be split into a separate translation unit or be linked with -fsycl-device-code-split=per_kernel in order to avoid false compilation errors.

If an application mistakenly submits a kernel using this extension to a device that does not support it, they may not get a synchronous exception as stated above (instead the error will be asynchronous).

Cycle of arrivals and waits

After a predetermined number of arrivals, the barrier moves into the next cycle. Any wait that was previously called with the arrival token from this cycle is unblocked, and future calls to wait with the arrival token from this cycle will not block. The pending count is reset and any future arrivals happen in the next cycle. However, at least one wait (or test_wait returning true) must happen with the arrival token from the current cycle before any arrivals can happen in the next cycle. Another arrival happening before the wait is undefined behavior.

wait and test_wait can only be called with an arrival token from the current cycle (in which case wait will block until the barrier moves into next cycle) or the previous cycle (in which case wait will not block). Calling wait or test_wait with an arrival token from any other cycle results in undefined behavior.

Interface

barrier class has the following member functions:

void initialize(uint32_t expected_count) : Initializes the barrier with an expected number of arrivals, representing the number of arrivals required to unblock calls to wait. This function only needs to be called by one work-item in work group. After the initialization a barrier operation (such as group_barrier()) needs to be executed by all work-items using the barrier object before they can use the newly initialized barrier object. If expected_count is greater than the value returned by max(), behavior is undefined.

void invalidate() : Invalidates a previously initialized barrier, enabling its associated memory to be repurposed. This function must be called by only one work-item in the work group. Before the invalidation a barrier operation (such as group_barrier()) needs to be executed by all work-items using the barrier object, after the last call on the barrier. After the invalidation a barrier operation (such as group_barrier()) needs to be executed by any work-items reusing the memory before the memory is reused. Calling any member function except initialize() on an invalidated barrier results in undefined behavior.

arrival_token arrive() : Executes arrival operation and returns a token that is needed for the wait call corresponding to this arrival.

arrival_token arrive_and_drop() : Reduces expected arrival count for future cycles by one, executes arrival operation and returns a token that is needed for the wait call corresponding to this arrival.

arrival_token arrive_no_complete(int32_t count) : Executes arrival operation that counts as count arrivals and returns a token that is needed for the wait call corresponding to this arrival. If this is the last arrival that causes the cycle to complete, behavior is undefined. That means count must be strictly lower than the remaining number of arrivals required to complete this cycle.

This can be used to signal many arrivals by one function call. However, it should not be used on its own, as it can not be the last arrival in a cycle. So it should either be followed with a call to arrive or by a barrier operation, such as (such as group_barrier()) after which a different work-item is guaranteed to call arrive within the same cycle.

arrival_token arrive_and_drop_no_complete(int32_t count) : Reduces expected arrival count for future cycles by count, executes arrival operation that counts as count arrivals and returns a token that is needed for the wait call corresponding to this arrival. This must not be the last arrival thet causes the cycle to complete - it would be undefined behavior. That means count must be strictly lower than the remaining number of arrivals required to complete this cycle.

This can be used to signal many arrivals by one function call. However, it should not be used on its own, as it can not be the last arrival in a cycle. So it should either be followed with a call to arrive or by a barrier operation, such as (such as group_barrier()) after which a different work-item is guaranteed to call arrive within the same cycle.

void arrive_copy_async() : Schedules arrive operation to be triggered asynchronously when all previous asynchronous memory copies initiated by the calling work item complete. Before the arrive operation is triggered, the pending count on the barrier is increased by 1, so after the arrival there is no change to the pending count. Pending count with the increase by this call must not exceed the value returned by max. If it does, it causes undefined behavior.

void arrive_copy_async_no_inc() : Schedules arrive operation to be triggered asynchronously when all previous asynchronous memory copies initiated by the calling work item complete.

void wait(arrival_token arrival) : Executes wait operation, blocking until the predetermined number of work items have called arrive.

bool test_wait(arrival_token arrival) : Checks whether all the arrivals have already happened for the current cycle, returning true if they did and false if wait(arrival) would block.

void arrive_and_wait() : Equivalent to calling wait(arrive()).

static constexpr uint64_t max() : Returns the maximum value of the expected and pending counts supported by the implementation.

Sample Header

namespace sycl::ext::oneapi::experimental::cuda {

class barrier {
  [implementation defined internal state]

public:
  using arrival_token = [implementation defined];

  // barriers cannot be moved or copied
  barrier(const barrier &other) = delete;
  barrier(barrier &&other) noexcept = delete;
  barrier &operator=(const barrier &other) = delete;
  barrier &operator=(barrier &&other) noexcept = delete;

  void initialize(uint32_t expected_count);
  void invalidate();
  arrival_token arrive();
  arrival_token arrive_and_drop();
  arrival_token arrive_no_complete(int32_t count);
  arrival_token arrive_and_drop_no_complete(int32_t count);
  void arrive_copy_async();
  void arrive_copy_async_no_inc();
  void wait(arrival_token arrival);
  bool test_wait(arrival_token arrival);
  void arrive_and_wait();
  static constexpr uint64_t max();
};

} // namespace sycl::ext::oneapi::experimental::cuda

Examples

using namespace sycl;
using namespace sycl::ext::oneapi::experimental::cuda;

[...]

q.submit([&](handler &cgh) {
  auto acc = buf.get_access<access::mode::read_write>(cgh);
  accessor<int, 1, access::mode::read_write, access::target::local> loc(
      N, cgh);
  accessor<barrier, 1, access::mode::read_write, access::target::local>
      loc_barrier(2, cgh);
  cgh.parallel_for(nd_range<1>(N, N), [=](nd_item<1> item) {
    size_t idx = item.get_local_linear_id();
    loc[idx] = acc[idx];
    if (idx <= 2) {
      loc_barrier[idx].initialize(N);
    }
    item.barrier(access::fence_space::local_space);
    for (int i = 0; i < N; i++) {
      int val = loc[idx];
      barrier::arrival_token arr = loc_barrier[0].arrive();
      val += 1;
      int dst_idx = (idx + 1) % N;
      loc_barrier[0].wait(arr);
      loc[dst_idx] = val;
      loc_barrier[1].wait(loc_barrier[1].arrive());
    }
    acc[idx] = loc[idx];
  });
});
using namespace sycl;
using namespace sycl::ext::oneapi::experimental::cuda;

[...]

q.submit([&](handler &cgh) {
  auto acc = buf.get_access<access::mode::read_write>(cgh);
  accessor<int, 1, access::mode::read_write, access::target::local> loc(
      N, cgh);
  accessor<barrier, 1, access::mode::read_write, access::target::local>
      loc_barrier(2, cgh);
  cgh.parallel_for(nd_range<1>(N, N), [=](nd_item<1> item) {
    size_t idx = item.get_local_linear_id();
    if (idx == 0) {
      loc_barrier[0].initialize(2*N);
    }
    item.barrier(access::fence_space::local_space);
    item.async_work_group_copy(loc.get_pointer(), data_acc.get_pointer(),
                                N);
    loc_barrier->arrive_copy_async_no_inc();

    [... do some other work ...]

    loc_barrier->arrive_and_wait();

    [... use copied data ...]

  });
});
using namespace sycl;
using namespace sycl::ext::oneapi::experimental::cuda;

[...]

q.submit([&](handler &cgh) {
  auto acc = buf.get_access<access::mode::read_write>(cgh);
  accessor<int, 1, access::mode::read_write, access::target::local> loc(
      N, cgh);
  accessor<barrier, 1, access::mode::read_write, access::target::local>
      loc_barrier(2, cgh);
  cgh.parallel_for(nd_range<1>(N, N), [=](nd_item<1> item) {
    size_t idx = item.get_local_linear_id();
    loc[idx] = acc[idx];
    if (idx <= 2) {
      loc_barrier[idx].initialize(N/2);
    }
    item.barrier(access::fence_space::local_space);
    for(int i=0; i<N; i++){
      if(idx > i){

        [...]

        barrier::arrival_token arr = loc_barrier->arrive();

        [...]

        loc_barrier->wait(arr);

        [...]

      }
    } else if(idx == i){

        [...]

        loc_barrier->arrive_and_drop();
        break;
    }
  });
});

Issues

  1. Is barrier the best name? Reasons for that name are that it is mostly in line with c++20 std::barrier and CUDA has the same name for this functionality. However it might be confusing with group_barrier, which is not present in c++20 and has a different name in CUDA - __syncthreads. Earlier version of CUDA docs called this awbarrier. Now that name is deprecated and they call it asynchronous barrier in text and barrier in code. Related PTX instructions use mbarrier. Other ideas for the name: "non-blocking barrier" and "split barrier".

RESOLUTION: We will use the name barrier.