Hello SYCL
At this point we assume that you have set up the pre-requisites for developing using ComputeCpp and we will proceed by writing our first SYCL application.
'Hello SYCL'
#include <iostream>
#include <CL/sycl.hpp>
namespace sycl = cl::sycl;
int main(int, char**) {
<<Setup host storage>>
<<Initialize device selector>>
<<Initialize queue>>
{
<<Setup device storage>>
<<Execute kernel>>
}
<<Print results>>
return 0;
}
The first thing we do is include the universal SYCL header. You only ever need to use this one header - it provides the entire cl::sycl
namespace. For ease of use, we will rename the namespace to just sycl
. This will be quicker to type, while still avoiding any name conflicts.
Setup host storage
sycl::float4 a = { 1.0, 2.0, 3.0, 4.0 };
sycl::float4 b = { 4.0, 3.0, 2.0, 1.0 };
sycl::float4 c = { 0.0, 0.0, 0.0, 0.0 };
In main
, we begin by setting up host storage for the data that we want to operate on. Our goal is to compute c = a + b
, where the variables are vectors. To help us achieve this, the API provides the vec<T, size>
type, which is a vector of a basic scalar type. It has template parameters for the scalar type and the size. It is meant to be used more like a geometrical vector than std::vector
, and so it only supports sizes of up to 16. But dont despair, there are plenty of ways to work on larger sets of data, which well see later. We use float4
, which is just vec<float, 4>
.
Initialize device selector
sycl::default_selector device_selector;
The SYCL model is built on top of the OpenCL model, so if you have experience with that API, you should be familar with most of the terms used here. In the SYCL model, a computer consists of a host (the CPU) connected to zero or more OpenCL devices. Devices are made available to the user through platforms - for example, a vendor specific driver might be a platform exposing that vendors GPU and CPU as OpenCL devices.
To do anything on the device side, we need to have some representation of the device. SYCL provides a set of classes called selectors, which are used to choose platforms and devices. Here, we initialize a default_selector
, which uses heuristics to find the most performant device of any type in a common configuration. If there is an accelerator (GPU, FPGA, ..) available, it will most likely select that, otherwise it will select the CPU.
Initialize queue
sycl::queue queue(device_selector);
std::cout << "Running on "
<< queue.get_device().get_info<sycl::info::device::name>()
<< "\n";
After that, we initialize a queue with the device that the selector chooses. A SYCL queue encapsulates all states necessary for execution. This includes an OpenCL context, platform and device to be used.
Setup device storage
{ // start of scope, ensures data copied back to host
sycl::buffer<sycl::float4, 1> a_sycl(&a, sycl::range<1>(1));
sycl::buffer<sycl::float4, 1> b_sycl(&b, sycl::range<1>(1));
sycl::buffer<sycl::float4, 1> c_sycl(&c, sycl::range<1>(1));
On most systems, the host and the device do not share physical memory. For example, the CPU might use RAM and the GPU might use its own on-die VRAM. SYCL needs to know which data it will be sharing between the host and the devices.
For this purpose, SYCL buffers exist. The buffer<T, dims>
class is generic over the element type and the number of dimensions, which can be one, two or three. When passed a raw pointer like in this case, the buffer(T* ptr, range size)
constructor takes ownership of the memory it has been passed. This means that we absolutely cannot use that memory ourselves while the buffer
exists, which is why we begin a C++ scope. At the end of their scope, the buffers will be destroyed and the memory returned to the user. The size argument is a range<dims>
object, which has to have the same number of dimensions as the buffer and is initialized with the number of elements in each dimension. Here, we have one dimension with one element.
Buffers are not associated with a particular queue or context, so they are capable of handling data transparently between multiple devices. They also do not require read/write information, as this is specified per operation.
Execute kernel
queue.submit([&] (sycl::handler& cgh) {
auto a_acc = a_sycl.get_access<sycl::access::mode::read>(cgh);
auto b_acc = b_sycl.get_access<sycl::access::mode::read>(cgh);
auto c_acc = c_sycl.get_access<sycl::access::mode::discard_write>(cgh);
cgh.single_task<class vector_addition>([=] () {
c_acc[0] = a_acc[0] + b_acc[0];
});
});
} // end of scope, ensures data copied back to host
This part is a little complicated, so let us go over it in more detail. The whole thing is technically a single function call to queue::submit
. submit
accepts a function object parameter, which encapsulates a command group. For this purpose, the function object accepts a command group handler
constructed by the SYCL runtime and handed to us as the argument. All operations using a given command group handler are part of the same command group.
Note that the command group lambda captures by reference. This is fine, even though, as we will see later, kernel execution is asynchronous. The SYCL specification effectively guarantees that the host-side part of the command group will finish before the call to submit
exits - otherwise a referenced variable could be modified. Its only the device-side part that can continue executing afterwards.
In general, the lambda doesnt have to capture by reference - it could also capture by value. For SYCL objects in particular, this will be valid and have very little overhead. The specification requires that a copy of any SYCL object refers to the same underlying resource. Nevertheless, capturing by reference is recommended as better practice in this case to avoid the unnecessary copies.
A command group is a way to encapsulate a device-side operation and all its data dependencies in a single object by grouping all the related commands (function calls). Effectively, what this achieves is preventing data race conditions, resource leaking and other problems by letting the SYCL runtime know the prerequisites for executing device-side code correctly. Generally, it is a bad idea to try to move resources constructed within a command group out of the lambda scope, and the SYCL specification prevents it with move semantics.
In our command group, we first setup accessors. In general, these objects define the inputs and outputs of a device-side operation. The accessors also provide access to various forms of memory. In this case, they allow us to access the memory owned by the buffers created earlier. Remember that we passed ownership of our data to the buffer, so we can no longer use the float4
objects, and accessors are the only way to access data in buffer
objects. The buffer::get_access(handler&)
method has two template parameters, the second one taking a default value.
The first is an access mode. We use access::mode::read
for the arguments and access::mode::discard_write
for the result. discard_write
can be used whenever we write to the whole buffer and do not care about its previous contents. Since it will be overwritten entirely, we can discard whatever was there before.
The second parameter is the type of memory we want to access the data from. We will see the available types of memory in the section on memory accesses. For now we use the default value.
Finally, we submit a kernel function object to the command group. The kernel is code that will be executed on the device, and thus (hopefully) accelerated. There are a few ways to do this, and single_task
is the simplest - as the name suggests, the kernel is executed once. Note that the kernel lambda has to capture by value.
Inside the kernel, we perform vector addition. The accessor class overloads operator[] (size_t i)
, which returns a reference to the i-th element in the buffer. Note that since our buffer has float4
elements, the 0-th element is actually an entire vector rather than a single float
. The vec
class overloads various operators, in this case operator+
for per-element addition.
One thing that stands out is the class vector_addition
template parameter. As is described in the CMake integration guide, the SYCL file has to be compiled with both the host and device compiler. We now know why - this bit of C++ code will be executed on an OpenCL device, so it needs to be compiled to machine code for that device.
The device compiler has to be able to find the C++ code that it needs, and a lambda expression doesnt have a well-defined name. For this reason, we need to supply a dummy class name as a template parameter. The class has to be unique per kernel. Here we forward declare it in the invocation. We will see later that we can avoid this by defining our own function objects.
In general, submitting a kernel is the last thing you should do inside a command group. You have to submit exactly one kernel per group (per submit
call).
Execution of most things in SYCL is asynchronous. submit
returns immediately and begins executing the command group afterwards. There is no guarantee as to when it will be finished - for this, we need explicit synchronization. Here, we do it the RAII way - we end the buffer scope. The specification guarantees that after the buffers are destroyed, all operations using them will have finished. They release ownership of the vectors back to the user. Under the hood, we can expect the SYCL runtime to wait for device operations to complete and a memory transfer to occur from device to host memory. While for the most part SYCL abstracts away manual memory management, its still important to be aware of when and how memory transfers are executed. They are slow and often a bottleneck of accelerated applications, so its best to try to do as few of them as possible. We will see how to do this in later sections.
Instead of relying on scopes, we could also create host-side accessors. These would force a synchronization and memory transfer back onto the host similarly to the buffer destructor, and choosing how to read memory back is up to the user.
Print results
std::cout << " A { " << a.x() << ", " << a.y() << ", " << a.z() << ", " << a.w() << " }\n"
<< "+ B { " << b.x() << ", " << b.y() << ", " << b.z() << ", " << b.w() << " }\n"
<< "------------------\n"
<< "= C { " << c.x() << ", " << c.y() << ", " << c.z() << ", " << c.w() << " }"
<< std::endl;
Finally, we print our results. The output on the machine that built this guide is:
Running on Intel(R) HD Graphics A { 1, 2, 3, 4 } + B { 4, 3, 2, 1 } = C { 5, 5, 5, 5 }