Skip to content

Latest commit

 

History

History
361 lines (256 loc) · 19.7 KB

File metadata and controls

361 lines (256 loc) · 19.7 KB

SYCL_INTEL_sub_group

Introduction

Important
The functionality introduced by this extension is deprecated in favor of the standard sub-group functionality outlined in Section 4.9.1.8 "sub_group_class", Section 4.17.3 "Group functions" and Section 4.17.4 "Group algorithms library" of the SYCL 2020 Specification, Revision 3. All functionality related to a device’s "primary" sub-group size will be redefined in a future extension written against SYCL 2020.
Note
Khronos® is a registered trademark and SYCL™ and SPIR™ are trademarks of The Khronos Group Inc. OpenCL™ is a trademark of Apple Inc. used by permission by Khronos.
Note
This document is better viewed when rendered as html with asciidoctor. GitHub does not render image icons.

This document describes an extension which introduces a new sub_group class representing an implementation-defined grouping of work-items in a work-group.

Name Strings

SYCL_INTEL_sub_group

Notice

Copyright (c) 2019-2020 Intel Corporation. All rights reserved.

Status

Working Draft

This is a preview extension specification, intended to provide early access to a feature for review and community feedback. When the feature matures, this specification may be released as a formal extension.

Because the interfaces defined by this specification are not final and are subject to change they are not intended to be used by shipping software products.

Version

Built On: 2026-03-23
Revision: 1

Contact

John Pennycook, Intel (john 'dot' pennycook 'at' intel 'dot' com)

Dependencies

This extension is written against the SYCL 1.2.1 specification, Revision 6 and the SYCL_INTEL_device_specific_kernel_queries extension.

Overview

A sub-group represents an implementation-defined grouping of work-items in a work-group. The work-items within a sub-group can communicate and synchronize independently of work-items in other sub-groups, and sub-groups are therefore commonly mapped to SIMD hardware where it exists.

The first version of this document is focused on exposing sub-group functionality to the NDRange form of SYCL parallel_for, and does not address hierarchical parallelism.

Towards a Generic Group Abstraction

Providing a generic group abstraction encapsulating the shared functionality of all synchronizable SYCL groups (i.e. work-groups and sub-groups) in a single interface would enable users to write more general code and simplify the introduction of additional SYCL groups in the future (e.g. device-wide synchronization groups). Some names in this proposal are chosen to demonstrate how this may look:

  • The common interface members of sub_group do not reference sub-groups by name, opting instead for generic names like get_group_range().

  • sub_group defines a number of types and static members to simplify writing generic code.

Attributes

The [[intel::sub_group_size(S)]] attribute indicates that the kernel must be compiled and executed with a specific sub-group size. The value of S must be a compile-time integral constant expression. The kernel should only be submitted to a device that supports that sub-group size (as reported by info::device::sub_group_sizes). If the kernel is submitted to a device that does not support the requested sub-group size, or a device on which the requested sub-group size is incompatible with any language features used by the kernel, the implementation must throw a synchronous exception with the errc::feature_not_supported error code from the kernel invocation command.

The [[intel::named_sub_group_size(NAME)]] attribute indicates that the kernel must be compiled and executed with a named sub-group size. NAME must be one of the following special tokens: automatic, primary. If NAME is automatic, the implementation is free to select any of the valid sub-group sizes associated with the device to which the kernel is submitted; the manner in which the sub-group size is selected is implementation-defined. If NAME is primary, the implementation will select the device’s primary sub-group size (as reported by the info::device::primary_sub_group_size query) for all kernels with this attribute.

There are special requirements whenever a device function defined in one translation unit makes a call to a device function that is defined in a second translation unit. In such a case, the second device function is always declared using SYCL_EXTERNAL. If the kernel calling these device functions is defined using a sub-group size attribute, the functions declared using SYCL_EXTERNAL must be similarly decorated to ensure that the same sub-group size is used. This decoration must exist in both the translation unit making the call and also in the translation unit that defines the function. If the sub-group size attribute is missing in the translation unit that makes the call, or if the sub-group size of the called function does not match the sub-group size of the calling function, the program is ill-formed and the compiler must raise a diagnostic.

If no sub-group size attribute appears on a kernel or SYCL_EXTERNAL function, the default behavior is as-if [[intel::named_sub_group_size(primary)]] was specified. This behavior may be overridden by an implementation (e.g. via compiler flags). Only one sub-group size attribute may appear on a kernel or SYCL_EXTERNAL function.

Note that a compiler may choose a different sub-group size for each kernel and SYCL_EXTERNAL function using an automatic sub-group size. If kernels with an automatic sub-group size call SYCL_EXTERNAL functions using an automatic sub-group size, the program may be ill-formed. The behavior when SYCL_EXTERNAL is used in conjunction with an automatic sub-group size is implementation-defined, and code relying on specific behavior should not be expected to be portable across implementations. If a kernel calls a SYCL_EXTERNAL function with an incompatible sub-group size, the compiler must raise a diagnostic — it is expected that this diagnostic will be raised during link-time, since this is the first time the compiler will see both translation units together.

Compiler Flags

The -fsycl-default-sub-group-size flag controls the default sub-group size used within a translation unit, which applies to all kernels and SYCL_EXTERNAL functions without an explicitly specified sub-group size. If the argument passed to -fsycl-default-sub-group-size is an integer S, all kernels and functions without an explicitly specified sub-group size are compiled as-if [[intel::sub_group_size(S)]] was specified. If the argument passed to -fsycl-default-sub-group-size is a string NAME, all kernels and functions without an explicitly specified sub-group size are compiled as-if [[intel::named_sub_group_size(NAME)]] was specified.

Sub-group Queries

Several aspects of sub-group functionality are implementation-defined: the size and number of sub-groups for certain work-group sizes is implementation-defined (and may differ for each kernel); and different devices may make different guarantees with respect to how sub-groups within a work-group are scheduled. Developers can query these behaviors at a device level and for individual kernels. The sub-group size for a given combination of kernel, device and work-group size is fixed.

Each sub-group in a work-group is one-dimensional. If the number of work-items in the highest-numbered dimension of a work-group is evenly divisible by the sub-group size, all sub-groups in the work-group will contain the same number of work-items. Additionally, the numbering of work-items in a sub-group reflects the linear numbering of the work-items in the work-group. Specifically, if a work-item has linear ID is in the sub-group and linear ID iw in the work-group, the work-item with linear ID is+1 in the sub-group has linear ID iw+1 in the work-group.

To maximize portability across devices, developers should not assume that work-items within a sub-group execute in lockstep, that two sub-groups within a work-group will make independent forward progress with respect to one another, nor that remainders arising from work-group division will be handled in a specific way.

The device descriptors below are added to the info::device enumeration class:

Device Descriptors Return Type Description

info::device::max_num_sub_groups

uint32_t

Returns the maximum number of sub-groups in a work-group for any kernel executed on the device. The minimum value is 1.

info::device::sub_group_independent_forward_progress

bool

Returns true if the device supports independent forward progress of sub-groups with respect to other sub-groups in the same work-group.

info::device::primary_sub_group_size

size_t

Return a sub-group size supported by this device that is guaranteed to support all core language features for the device.

info::device::sub_group_sizes

std::vector<size_t>

Returns a std::vector of size_t containing the set of sub-group sizes supported by the device. Each sub-group size is a power of 2 in the range [1, 231]. Not all sub-group sizes are guaranteed to be compatible with all core language features; any incompatibilities are implementation-defined.

An additional query is added to the kernel class, enabling an input value to be passed to get_info. The original get_info query from the SYCL_INTEL_device_specific_kernel_queries extension should be used for queries that do not specify an input type.

Member Functions Description

template <info::kernel_device_specific param>typename info::param_traits<info::kernel_device_specific, param>::return_type get_info(const device &dev, typename info::param_traits<info::kernel_device_specific, param>::input_type value) const

Query information from a kernel using the info::kernel_device_specific descriptor for a specific device and input parameter. The expected value of the input parameter depends on the information being queried.

The kernel descriptors below are added to the info::kernel_device_specific enumeration class:

Kernel Descriptors Input Type Return Type Description

info::kernel_device_specific::max_num_sub_groups

N/A

uint32_t

Returns the maximum number of sub-groups for this kernel.

info::kernel_device_specific::compile_num_sub_groups

N/A

uint32_t

Returns the number of sub-groups specified by the kernel, or 0 (if not specified).

info::kernel_device_specific::max_sub_group_size

range<D>

uint32_t

Returns the maximum sub-group size for this kernel launched with the specified work-group size.

info::kernel_device_specific::compile_sub_group_size

N/A

uint32_t

Returns the sub-group size of the kernel, set implicitly by the implementation or explicitly using a kernel attribute. Returns 0 if the requested size was automatic, and returns the device’s primary sub-group size if the requested size was primary.

The sub_group Class

The sub_group class encapsulates all functionality required to represent a particular sub-group within a parallel execution. It has common by-value semantics and is not default or user-constructible, and can only be accessed in ND-range kernels.

To provide access to the sub_group class, a new member function is added to the nd_item class:

Member Functions Description

sub_group get_sub_group() const

Return the sub-group to which the work-item belongs.

Core Member Functions

The core member functions of the sub-group class provide a mechanism for a developer to query properties of a sub-group and a work-item’s position in it.

Member Functions Description

id<1> get_local_id() const

Return an id representing the index of the work-item within the sub-group.

uint32_t get_linear_local_id() const

Return a uint32_t representing the index of the work-item within the sub-group.

range<1> get_local_range() const

Return a SYCL range representing the number of work-items in the sub-group.

range<1> get_max_local_range() const

Return a SYCL range representing the maximum number of work-items in any sub-group within the nd-range.

id<1> get_group_id() const

Return an id representing the index of the sub-group within the work-group.

uint32_t get_linear_group_id() const

Return a uint32_t representing the index of the sub-group within the work-group.

range<1> get_group_range() const

Return a SYCL range representing the number of sub-groups within the work-group.

range<1> get_max_group_range() const

Return a SYCL range representing the maximum number of sub-groups per work-group within the nd-range.

An example usage of the sub_group class is given below:

parallel_for<class kernel>(..., [&](nd_item item)
{
  sub_group sg = item.get_sub_group();
  for (int v = sg.get_local_id(); v < N; v += sg.get_local_range())
  {
    ...
  }
});

Synchronization Functions

A sub-group barrier synchronizes all work-items in a sub-group, and orders memory operations with a memory fence to all address spaces.

Member Functions Description

void barrier() const

Execute a sub-group barrier.

Shuffles

The shuffle sub-group functions perform arbitrary communication between pairs of work-items in a sub-group. Common patterns — such as shifting all values in a sub-group by a fixed number of work-items — are exposed as specialized shuffles that may be accelerated in hardware.

Member Functions Description

template <typename T> T shuffle(T x, id<1> local_id) const

Exchange values of x between work-items in the sub-group in an arbitrary pattern. Returns the value of x from the work-item with the specified id. The value of local_id must be between 0 and the sub-group size.

template <typename T> T shuffle_down(T x, uint32_t delta) const

Exchange values of x between work-items in the sub-group via a shift. Returns the value of x from the work-item whose id is delta larger than the calling work-item. The value returned when the result of id + delta is greater than or equal to the sub-group size is undefined.

template <typename T> T shuffle_up(T x, uint32_t delta) const

Exchange values of x between work-items in the sub-group via a shift. Returns the value of x from the work-item whose id is delta smaller than the calling work-item. The value of returned when the result of id - delta is less than zero is undefined.

template <typename T> T shuffle_xor(T x, id<1> mask) const

Exchange pairs of values of x between work-items in the sub-group. Returns the value of x from the work-item whose id is equal to the exclusive-or of the calling work-item’s id and mask. mask must be a compile-time constant value that is the same for all work-items in the sub-group.

Sample Header

namespace cl {
namespace sycl {
namespace intel {
struct sub_group {

  using id_type = id<1>;
  using range_type = range<1>;
  using linear_id_type = uint32_t;
  static constexpr int32_t dimensions = 1;

  id_type get_local_id() const;
  linear_id_type get_local_linear_id() const;
  range_type get_local_range() const;
  range_type get_max_local_range() const;

  id_type get_group_id() const;
  linear_id_type get_group_linear_id() const;
  range_type get_group_range() const;

  void barrier() const;

  template <typename T>
  T shuffle(T x, id<1> local_id) const;

  template <typename T>
  T shuffle_down(T x, uint32_t delta) const;

  template <typename T>
  T shuffle_up(T x, uint32_t delta) const;

  template <typename T>
  T shuffle_xor(T x, id<1> mask) const;

};
} // intel
} // sycl
} // cl

Issues

  1. Should sub-group query results for specific kernels depend on work-group size?

    RESOLVED: Yes, this is required by OpenCL devices. Devices that do not require the work-group size can ignore the parameter.

  2. Should sub-group "shuffles" be member functions?

    RESOLVED: Yes, the four shuffles in this extension are a defining feature of sub-groups. Higher-level algorithms (such as those in the SubGroupAlgorithms proposal) may build on them, the same way as higher-level algorithms using work-groups build on work-group local memory.

  3. What should the sub-group size compatible with all features be called?

    RESOLVED: The name adopted is "primary", to convey that it is an integral part of sub-group support provided by the device. Other names considered are listed here for posterity: "default", "stable", "fixed", "core". These terms are easy to misunderstand (i.e. the "default" size may not be chosen by default, the "stable" size is unrelated to the software release cycle, the "fixed" sub-group size may change between devices or compiler releases, the "core" size is unrelated to hardware cores).

  4. How does sub-group size interact with SYCL_EXTERNAL functions? The current behavior requires exact matching. Should this be relaxed to allow alternative implementations (e.g. link-time optimization, multi-versioning)?

    RESOLVED: Exact matching is required to ensure that developers can reason about the portability of their code across different implementations. Setting the default sub-group size to "primary" and providing an override flag to select "automatic" everywhere means that only advanced developers who are tuning sub-group size on a per-kernel basis will have to worry about potential matching issues.

Feature test macro

This extension provides a feature-test macro as described in the core SYCL specification section 6.3.3 "Feature test macros". Therefore, an implementation supporting this extension must predefine the macro SYCL_EXT_ONEAPI_SUB_GROUP to one of the values defined in the table below. Applications can test for the existence of this macro to determine if the implementation supports this feature, or applications can test the macro’s value to determine which of the extension’s APIs the implementation supports.

Value Description

1

Initial extension version. Base features are supported.

Revision History

Rev Date Author Changes

1

2019-04-19

John Pennycook

Initial public working draft

2

2020-03-16

John Pennycook

Separate class definition from algorithms

3

2020-04-21

John Pennycook

Update max_sub_group_size query

4

2020-04-21

John Pennycook

Restore missing barrier function

5

2020-04-21

John Pennycook

Restore sub-group shuffles as member functions

6

2020-04-22

John Pennycook

Align with SYCL_INTEL_device_specific_kernel_queries

7

2020-07-13

John Pennycook

Clarify that reqd_sub_group_size must be a compile-time constant

8

2020-10-21

John Pennycook

Define default behavior and reduce verbosity

9

2021-03-30

John Pennycook

Rename auto to automatic