The DPC++ Compiler compiles C++ and SYCL* source files with code for both CPU and a wide range of compute accelerators such as GPU and FPGA.
- Prerequisites
- Build DPC++ toolchain
- Use DPC++ toolchain
- C++ standard
- Known Issues and Limitations
- CUDA backend limitations
- Find More
git- Downloadcmakeversion 3.14 or later - Downloadpython- Downloadninja- Download- C++ compiler
Throughout this document DPCPP_HOME denotes the path to the local directory
created as DPC++ workspace. It might be useful to create an environment variable
with the same name.
Linux:
export DPCPP_HOME=~/sycl_workspace
mkdir $DPCPP_HOME
cd $DPCPP_HOME
git clone https://github.com/intel/llvm -b syclWindows (64-bit):
Open a developer command prompt using one of two methods:
- Click start menu and search for "x64 Native Tools Command Prompt for VS XXXX", where XXXX is a version of installed Visual Studio.
- Ctrl-R, write "cmd", click enter, then run
"C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Auxiliary\Build\vcvarsall.bat" x64
set DPCPP_HOME=%USERPROFILE%\sycl_workspace
mkdir %DPCPP_HOME%
cd %DPCPP_HOME%
git clone https://github.com/intel/llvm -b syclThe easiest way to get started is to use the buildbot configure and compile scripts.
In case you want to configure CMake manually the up-to-date reference for variables is in these files.
Linux:
python $DPCPP_HOME/llvm/buildbot/configure.py
python $DPCPP_HOME/llvm/buildbot/compile.pyWindows (64-bit):
python %DPCPP_HOME%\llvm\buildbot\configure.py
python %DPCPP_HOME%\llvm\buildbot\compile.pyYou can use the following flags with configure.py:
--system-ocl-> Don't Download OpenCL deps via cmake but use the system ones--no-werror-> Don't treat warnings as errors when compiling llvm--cuda-> use the cuda backend (see Nvidia CUDA)--shared-libs-> Build shared libraries-t-> Build type (debug or release)-o-> Path to build directory--cmake-gen-> Set build system type (e.g.--cmake-gen "Unix Makefiles")
Ahead-of-time compilation for the Intel® processors is enabled by default. For more, see opencl-aot documentation.
There is experimental support for building and linking DPC++ runtime with libc++ library instead of libstdc++. To enable it the following CMake options should be used.
Linux:
-DSYCL_USE_LIBCXX=ON \
-DSYCL_LIBCXX_INCLUDE_PATH=<path to libc++ headers> \
-DSYCL_LIBCXX_LIBRARY_PATH=<path to libc++ and libc++abi libraries>
There is experimental support for DPC++ for CUDA devices.
To enable support for CUDA devices, follow the instructions for the Linux
DPC++ toolchain, but add the --cuda flag to configure.py
Enabling this flag requires an installation of CUDA 10.2 on the system, refer to NVIDIA CUDA Installation Guide for Linux.
Currently, the only combination tested is Ubuntu 18.04 with CUDA 10.2 using a Titan RTX GPU (SM 71), but it should work on any GPU compatible with SM 50 or above. The default SM for the NVIDIA CUDA backend is 5.0. Users can specify lower values, but some features may not be supported.
TODO: add instructions how to deploy built DPC++ toolchain.
The DPC++ toolchain support on CUDA platforms is still in an experimental phase. Currently, the DPC++ toolchain relies on having a recent OpenCL implementation on the system in order to link applications to the DPC++ runtime. The OpenCL implementation is not used at runtime if only the CUDA backend is used in the application, but must be installed.
The OpenCL implementation provided by the CUDA SDK is OpenCL 1.2, which is too old to link with the DPC++ runtime and lacks some symbols.
We recommend installing the low level CPU runtime, following the instructions in the next section.
Instead of installing the low level CPU runtime, it is possible to build and install the Khronos ICD loader, which contains all the symbols required.
To run DPC++ applications on OpenCL devices, OpenCL implementation(s) must be present in the system.
To run DPC++ applications on Level Zero devices, Level Zero implementation(s) must be present in the system. You can find the link to the Level Zero spec in the following section Find More.
The Level Zero RT for GPU, OpenCL RT for GPU, OpenCL RT for CPU, FPGA
emulation RT and TBB runtime which are needed to run DPC++ application
on Intel GPU or Intel CPU devices can be downloaded using links in
the dependency configuration file
and installed following the instructions below. The same versions are used in
PR testing.
Linux:
-
Extract the archive. For example, for the archives
oclcpuexp_<cpu_version>.tar.gzandfpgaemu_<fpga_version>.tar.gzyou would run the following commands# Extract OpenCL FPGA emulation RT mkdir -p /opt/intel/oclfpgaemu_<fpga_version> cd /opt/intel/oclfpgaemu_<fpga_version> tar zxvf fpgaemu_<fpga_version>.tar.gz # Extract OpenCL CPU RT mkdir -p /opt/intel/oclcpuexp_<cpu_version> cd /opt/intel/oclcpuexp_<cpu_version> tar -zxvf oclcpu_rt_<cpu_version>.tar.gz
-
Create ICD file pointing to the new runtime (requires root access)
# OpenCL FPGA emulation RT echo /opt/intel/oclfpgaemu_<fpga_version>/x64/libintelocl_emu.so > /etc/OpenCL/vendors/intel_fpgaemu.icd # OpenCL CPU RT echo /opt/intel/oclcpuexp_<cpu_version>/x64/libintelocl.so > /etc/OpenCL/vendors/intel_expcpu.icd
-
Extract TBB libraries. For example, for the archive tbb-<tbb_version>-lin.tgz
mkdir -p /opt/intel/tbb_<tbb_version> cd /opt/intel/tbb_<tbb_version> tar -zxvf tbb*lin.tgz
-
Copy files from or create symbolic links to TBB libraries in OpenCL RT folder:
# OpenCL FPGA emulation RT ln -s /opt/intel/tbb_<tbb_version>/tbb/lib/intel64/gcc4.8/libtbb.so /opt/intel/oclfpgaemu_<fpga_version>/x64 ln -s /opt/intel/tbb_<tbb_version>/tbb/lib/intel64/gcc4.8/libtbbmalloc.so /opt/intel/oclfpgaemu_<fpga_version>/x64 ln -s /opt/intel/tbb_<tbb_version>/tbb/lib/intel64/gcc4.8/libtbb.so.2 /opt/intel/oclfpgaemu_<fpga_version>/x64 ln -s /opt/intel/tbb_<tbb_version>/tbb/lib/intel64/gcc4.8/libtbbmalloc.so.2 /opt/intel/oclfpgaemu_<fpga_version>/x64 # OpenCL CPU RT ln -s /opt/intel/tbb_<tbb_version>/tbb/lib/intel64/gcc4.8/libtbb.so /opt/intel/oclcpuexp_<cpu_version>/x64 ln -s /opt/intel/tbb_<tbb_version>/tbb/lib/intel64/gcc4.8/libtbbmalloc.so /opt/intel/oclcpuexp_<cpu_version>/x64 ln -s /opt/intel/tbb_<tbb_version>/tbb/lib/intel64/gcc4.8/libtbb.so.2 /opt/intel/oclcpuexp_<cpu_version>/x64 ln -s /opt/intel/tbb_<tbb_version>/tbb/lib/intel64/gcc4.8/libtbbmalloc.so.2 /opt/intel/oclcpuexp_<cpu_version>/x64
-
Configure library paths (requires root access)
echo /opt/intel/oclfpgaemu_<fpga_version>/x64 > /etc/ld.so.conf.d/libintelopenclexp.conf echo /opt/intel/oclcpuexp_<cpu_version>/x64 >> /etc/ld.so.conf.d/libintelopenclexp.conf ldconfig -f /etc/ld.so.conf.d/libintelopenclexp.conf
Windows (64-bit):
-
If you need OpenCL runtime for Intel
GPUas well, then update/install it first. Do it before installing OpenCL runtime for IntelCPUruntime as OpenCL runtime for IntelGPUinstaller may re-write some important files or settings and make existing OpenCL runtime for IntelCPUruntime not working properly. -
Extract the archive with OpenCL runtime for Intel
CPUand/or for IntelFPGAemulation using links in the dependency configuration file. For example, toc:\oclcpu_rt_<cpu_version>. -
Extract the archive with TBB runtime using links in the dependency configuration file. For example, to
c:\tbb_<tbb_version>. -
Run
Command PromptasAdministrator. To do that clickStartbutton, typeCommand Prompt, click the Right mouse button on it, then clickRun As Administrator, then clickYesto confirm. -
In the opened windows run
install.batprovided with the extracted files to install runtime to the system and setup environment variables. So, if the extracted files are inc:\oclcpu_rt_<cpu_version>\folder, then type the command:# Install OpenCL FPGA emulation RT # Answer N to clean previous OCL_ICD_FILENAMES configuration c:\oclfpga_rt_<fpga_version>\install.bat c:\tbb_<tbb_version>\tbb\bin\intel64\vc14 # Install OpenCL CPU RT # Answer Y to setup CPU RT side-bi-side with FPGA RT c:\oclcpu_rt_<cpu_version>\install.bat c:\tbb_<tbb_version>\tbb\bin\intel64\vc14
To verify that built DPC++ toolchain is working correctly, run:
Linux:
python $DPCPP_HOME/llvm/buildbot/check.pyWindows (64-bit):
python %DPCPP_HOME%\llvm\buildbot\check.pyIf no OpenCL GPU/CPU runtimes are available, the corresponding tests are skipped.
If CUDA support has been built, it is tested only if there are CUDA devices available.
Khronos* SYCL* conformance test suite (CTS) is intended to validate implementation conformance to Khronos* SYCL* specification. DPC++ compiler is expected to pass significant number of tests, and it keeps improving.
Follow Khronos* SYCL* CTS instructions from README file to obtain test sources and instructions how build and execute the tests.
To configure testing of DPC++ toochain set
SYCL_IMPLEMENTATION=Intel_SYCL and
Intel_SYCL_ROOT=<path to the SYCL installation> CMake variables.
Linux:
cmake -DIntel_SYCL_ROOT=$DPCPP_HOME/deploy -DSYCL_IMPLEMENTATION=Intel_SYCL ...Windows (64-bit):
cmake -DIntel_SYCL_ROOT=%DPCPP_HOME%\deploy -DSYCL_IMPLEMENTATION=Intel_SYCL ...Building Doxygen documentation is similar to building the product itself. First, the following tools need to be installed:
- doxygen
- graphviz
Then you'll need to add the following options to your CMake configuration command:
-DLLVM_ENABLE_DOXYGEN=ON
After CMake cache is generated, build the documentation with doxygen-sycl
target. It will be put to $DPCPP_HOME/llvm/build/tools/sycl/doc/html
directory.
A simple DPC++ or SYCL* program consists of following parts:
- Header section
- Allocating buffer for data
- Creating SYCL queue
- Submitting command group to SYCL queue which includes the kernel
- Wait for the queue to complete the work
- Use buffer accessor to retrieve the result on the device and verify the data
- The end
Creating a file simple-sycl-app.cpp with the following C++/SYCL code:
#include <CL/sycl.hpp>
int main() {
// Creating buffer of 4 ints to be used inside the kernel code
cl::sycl::buffer<cl::sycl::cl_int, 1> Buffer(4);
// Creating SYCL queue
cl::sycl::queue Queue;
// Size of index space for kernel
cl::sycl::range<1> NumOfWorkItems{Buffer.get_count()};
// Submitting command group(work) to queue
Queue.submit([&](cl::sycl::handler &cgh) {
// Getting write only access to the buffer on a device
auto Accessor = Buffer.get_access<cl::sycl::access::mode::write>(cgh);
// Executing kernel
cgh.parallel_for<class FillBuffer>(
NumOfWorkItems, [=](cl::sycl::id<1> WIid) {
// Fill buffer with indexes
Accessor[WIid] = (cl::sycl::cl_int)WIid.get(0);
});
});
// Getting read only access to the buffer on the host.
// Implicit barrier waiting for queue to complete the work.
const auto HostAccessor = Buffer.get_access<cl::sycl::access::mode::read>();
// Check the results
bool MismatchFound = false;
for (size_t I = 0; I < Buffer.get_count(); ++I) {
if (HostAccessor[I] != I) {
std::cout << "The result is incorrect for element: " << I
<< " , expected: " << I << " , got: " << HostAccessor[I]
<< std::endl;
MismatchFound = true;
}
}
if (!MismatchFound) {
std::cout << "The results are correct!" << std::endl;
}
return MismatchFound;
}
To build simple-sycl-app put bin and lib to PATHs:
Linux:
export PATH=$DPCPP_HOME/llvm/build/bin:$PATH
export LD_LIBRARY_PATH=$DPCPP_HOME/llvm/build/lib:$LD_LIBRARY_PATHWindows (64-bit):
set PATH=%DPCPP_HOME%\llvm\build\bin;%PATH%
set LIB=%DPCPP_HOME%\llvm\build\lib;%LIB%and run following command:
clang++ -fsycl simple-sycl-app.cpp -o simple-sycl-app.exeWhen building for CUDA, use the CUDA target triple as follows:
clang++ -fsycl -fsycl-targets=nvptx64-nvidia-cuda-sycldevice \
simple-sycl-app.cpp -o simple-sycl-app-cuda.exeThis simple-sycl-app.exe application doesn't specify SYCL device for
execution, so SYCL runtime will use default_selector logic to select one
of accelerators available in the system or SYCL host device.
In this case, the behavior of the default_selector can be altered
using the SYCL_BE environment variable, setting PI_CUDA forces
the usage of the CUDA backend (if available), PI_OPENCL will
force the usage of the OpenCL backend.
SYCL_BE=PI_CUDA ./simple-sycl-app-cuda.exeThe default is the OpenCL backend if available. If there are no OpenCL or CUDA devices available, the SYCL host device is used. The SYCL host device executes the SYCL application directly in the host, without using any low-level API.
NOTE: nvptx64-nvidia-cuda-sycldevice is usable with -fsycl-targets
if clang was built with the cmake option SYCL_BUILD_PI_CUDA=ON.
Linux & Windows (64-bit):
./simple-sycl-app.exe
The results are correct!NOTE: Currently, when the application has been built with the CUDA target,
the CUDA backend must be selected at runtime using the SYCL_BE environment
variable.
SYCL_BE=PI_CUDA ./simple-sycl-app-cuda.exeNOTE: DPC++/SYCL developers can specify SYCL device for execution using
device selectors (e.g. cl::sycl::cpu_selector, cl::sycl::gpu_selector,
Intel FPGA selector(s)) as
explained in following section Code the program for a specific
GPU.
To specify OpenCL device SYCL provides the abstract cl::sycl::device_selector
class which the can be used to define how the runtime should select the best
device.
The method cl::sycl::device_selector::operator() of the SYCL
cl::sycl::device_selector is an abstract member function which takes a
reference to a SYCL device and returns an integer score. This abstract member
function can be implemented in a derived class to provide a logic for selecting
a SYCL device. SYCL runtime uses the device for with the highest score is
returned. Such object can be passed to cl::sycl::queue and cl::sycl::device
constructors.
The example below illustrates how to use cl::sycl::device_selector to create
device and queue objects bound to Intel GPU device:
#include <CL/sycl.hpp>
int main() {
class NEOGPUDeviceSelector : public cl::sycl::device_selector {
public:
int operator()(const cl::sycl::device &Device) const override {
using namespace cl::sycl::info;
const std::string DeviceName = Device.get_info<device::name>();
const std::string DeviceVendor = Device.get_info<device::vendor>();
return Device.is_gpu() && (DeviceName.find("HD Graphics NEO") != std::string::npos);
}
};
NEOGPUDeviceSelector Selector;
try {
cl::sycl::queue Queue(Selector);
cl::sycl::device Device(Selector);
} catch (cl::sycl::invalid_parameter_error &E) {
std::cout << E.what() << std::endl;
}
}
The device selector below selects an NVIDIA device only, and won't execute if there is none.
class CUDASelector : public cl::sycl::device_selector {
public:
int operator()(const cl::sycl::device &Device) const override {
using namespace cl::sycl::info;
const std::string DriverVersion = Device.get_info<device::driver_version>();
if (Device.is_gpu() && (DriverVersion.find("CUDA") != std::string::npos)) {
std::cout << " CUDA device found " << std::endl;
return 1;
};
return -1;
}
};- DPC++ runtime is built as C++14 library.
- DPC++ compiler is building apps as C++17 apps by default.
- DPC++ device compiler fails if the same kernel was used in different translation units.
- SYCL host device is not fully supported.
- 32-bit host/target is not supported.
- DPC++ works only with OpenCL low level runtimes which support out-of-order queues.
- On Windows linking DPC++ applications with
/MTdflag is known to cause crashes.
- Backend is only supported on Linux
- The only combination tested is Ubuntu 18.04 with CUDA 10.2 using a Titan RTX GPU (SM 71), but it should work on any GPU compatible with SM 50 or above
- The NVIDIA OpenCL headers conflict with the OpenCL headers required for this project and may cause compilation issues on some platforms
- DPC++ specification: https://spec.oneapi.com/versions/latest/elements/dpcpp/source/index.html
- SYCL* 1.2.1 specification: www.khronos.org/registry/SYCL/specs/sycl-1.2.1.pdf
- oneAPI Level Zero specification: https://spec.oneapi.com/versions/latest/oneL0/index.html
*Other names and brands may be claimed as the property of others.