Anatomy of a ComputeCpp app
This guide was created for versions: v0.8.0 - Latest
ComputeCpp Community Edition (CE) is a heterogeneous parallel programming platform that provides a beta pre-conformant implementation of SYCL™ 1.2.1 Khronos specification.
ComputeCpp CE 0.8.0 beta release includes the ComputeCpp CE Package version 0.8.0 and ComputeCpp CE SDK version 0.8.0. The supported OpenCL 1.2 platforms for ComputeCpp CE beta release, are AMD® and Intel®. Any platform with OpenCL SPIR 1.2, SPIR-V, or PTX support should work. For example any ARM™ device that features an OpenCL platform with the ability to consume SPIR-V.
This ComputeCpp Package version 0.8.0 supports SYCL C++ libraries for better integration with C++ frameworks and applications like SYCL Parallel STL, math libraries like SYCL Eigen Tensor Library and machine vision libraries.

The ComputeCpp Package includes the components of the ComputeCpp Community Edition beta implementation of SYCL 1.2.1 ComputeCpp Package includes the compute++ SYCL device compiler and the SYCL Runtime library along with its headers. The ComputeCpp Package can be downloaded from the ComputeCpp Download Page, and further details on how to integrate the ComputeCpp Package to an existing C++ application can be found in the ComputeCpp Integration Guide.
The ComputeCpp SDK complements the ComputeCpp Package with build system integration, sample code and documentation. The ComputeCpp Package is available at computecpp-sdk @ GitHub.
Further documentation of the system can also be found in the ComputeCpp API Documentation. There is a local copy of the API Documentation available in the directory doc/api_pages/index.html of the ComputeCpp Package.
ComputeCpp CE does not currently implement the complete SYCL 1.2 standard. Note that this is a beta version, and some performance optimizations are disabled.
Please, refer to the ComputeCpp Platform Support Notes for details on the supported platforms and limitations. Within the ComputeCpp Package there is a local copy at doc/computecpp_platform_support_notes.pdf.
Getting started with SYCL™ using ComputeCpp™
SYCL™ for OpenCL™ is a heterogeneous parallel programming interface which integrates modern C++ techniques with OpenCL™ 1.2 massively parallel device capabilities. A SYCL application is a C++ application that uses the SYCL API, in order to integrate with OpenCL platforms existing on the target system.
The following example depicts the basic structure of a SYCL application.
This example implements a simple addition of two vectors of scalar types on a SYCL device. The vectors are represented using std::array for simplicity.
#include <CL/sycl.hpp>
#include <array>
constexpr cl::sycl::access::mode sycl_read = cl::sycl::access::mode::read;
constexpr cl::sycl::access::mode sycl_write = cl::sycl::access::mode::write;
template <typename T>
class SimpleVadd;
template <typename T, size_t N>
void simple_vadd(const std::array<T, N> &A, const std::array<T, N> &B,
std::array<T, N> &C) {
cl::sycl::queue deviceQueue;
cl::sycl::range<1> numOfItems{N};
cl::sycl::buffer<T, 1> bufferA(A.data(), numOfItems);
cl::sycl::buffer<T, 1> bufferB(B.data(), numOfItems);
cl::sycl::buffer<T, 1> bufferC(C.data(), numOfItems);
deviceQueue.submit([&](cl::sycl::handler &cgh) {
auto accessorA = bufferA.template get_access<sycl_read>(cgh);
auto accessorB = bufferB.template get_access<sycl_read>(cgh);
auto accessorC = bufferC.template get_access<sycl_write>(cgh);
cgh.parallel_for<class SimpleVadd<T>>(numOfItems,
[=](cl::sycl::id<1> wiID) {
accessorC[wiID] = accessorA[wiID] + accessorB[wiID];
});
});
}
int main() {
const size_t array_size = 4;
std::array<cl::sycl::cl_int, array_size> A = {{1, 2, 3, 4}},
B = {{1, 2, 3, 4}}, C;
std::array<cl::sycl::cl_float, array_size> D = {{1.f, 2.f, 3.f, 4.f}},
E = {{1.f, 2.f, 3.f, 4.f}}, F;
simple_vadd<int, array_size>(A, B, C);
simple_vadd<float, array_size>(D, E, F);
for (unsigned int i = 0; i < array_size; i++) {
if (C[i] != A[i] + B[i]) {
return 1;
}
if (F[i] != D[i] + E[i]) {
return 1;
}
}
return 0;
}
In this example, the addition of the two vector arrays is defined using a
templated function simple_vadd
. This function implements:
C = A + B
where A, B and C are vector arrays of the same type and size.
The contents of the std::array are initially stored on the host memory, as usual in C++ programs. In order to make the contents of the array available on the device, developers must create a SYCL memory object.
On line number <1>, we include CL/sycl.hpp
.
The ComputeCpp SYCL implementation uses this single header to include all
the API interface and the dependant types defined in the API.
On line number <14>, we construct a cl::sycl::range object initialized to the
size of the vectors (the template parameter passed to the function).
Objects in SYCL are multi-dimensional, and the number of elements per component
is expressed using the cl::sycl::range objects.
A cl::sycl::range is a vector with as many components as dimensions. The value
stored in each component expresses the size on that component, starting from 1.
The range is used to create the cl::sycl::buffer object, which manages the
data on host and device. In this case, since we use vectors, we only use one
dimensional buffers (second template parameter).
The objects bufferA
, bufferB
and bufferC
are initialized with the pointer
obtained from the std::array.
In SYCL, the execution of any computation on a SYCL device
follows the OpenCL execution model. In this simple example,
we use the default SYCL device chosen by the ComputeCpp runtime library on
cl::sycl::queue
construction.
The SYCL command group functor encapsulates the vector
addition kernel alongside the input and output operations.
On line number <19>, we define the lambda function (our command group functor)
and submit it to the queue object deviceQueue
.
The Command Group defines the Command Group Scope. Any object that takes a cl::sycl::handler object can only be used in Command Group Scope. The cl::sycl::handler object can only be constructed by the runtime. Device accessors and the Kernel Functor can only be defined in this scope.
The SYCL programming model uses accessors (the
cl::sycl::accessor
class) to keep track of the access from different command
groups to memory objects. The SYCL Runtime can then use the
different access modes of the accessors to create a dependency graph, to schedule
the execution of multiple command groups in different devices while maintaining
a consistent view of the memory.
Accessors are obtained from buffers on lines <20> to <21>, by calling the
get_access
method on the buffers to create cl::sycl::accessor
objects.
On line <20>, we require a read-only accessor on the queue’s device. The
sycl_read
variable is a constexpr
variable of the enum class
cl::sycl::access::mode
. The declarations of sycl_read
and the corresponding
sycl_write
are on lines <4> and <5>.
This example implements the simple_vadd
computation in a SIMD(single
instruction, multiple data) manner: the computation will be executed on
multiple processing elements in parallel. This is the default parallel
programming mode in SYCL.
The lambda function object defined at line numbers <24> to <27> defines the
Kernel Scope. The lambda is called by the parallel_for
fuction API. In the example, we spawn as many work-items as the number of
elements of the vectors, specified in the numOfItems cl::sycl::range
variable.
The name of the kernel is specified as the template parameter of the
parallel_for
using the class type SimpleVadd
[1]
Note that this is not required when the parallel_for function is called using a functor object, the functor typename is going to be used as the SYCL kernel name.
The parallel_for
invocation will execute the lambda function object as a SYCL
kernel to an NDRange execution space. The NDRange consists of work-items
organized into work-groups.
The parallel_for
API used in the example is the simplest way of dispatching
kernels to a SYCL device. The number of work-groups is defined by the
heuristics of the ComputeCpp runtime: an nd-range will be
implicitly constructed for the OpenCL execution of the
SYCL kernel functor [2]. In this example, simple_vadd
is called
for array_size
equal to four on line number <39> and line number <40>. The
number of work-items, numOfItems
, is equal to four, so the number of
work-groups can only be one, two or four as shown in the following diagram.

On line number <26>, accessorC
(a write-only accessor) is used to store the
result of the addition of the read-only accessors accessorA
and accessorB
.
Let’s take a simplified view of what the second work-item (wiID
)
of our four-element one-dimensional NDRange will do. The wiID[1]
will add
accessorA[wiID[1]]
with accessorB[wiID[1]]
and store it to
accessorC[wiID[1]]
(depicted in diagram).
On line number <25>, the lambda function object (that defines the
Kernel Scope) takes as a parameter the current instance in the
global index space class (cl::sycl::id
). The cl::sycl::id class features
various getter methods to obtain specific values of the instance. The value of
wiID
will change for each instance in the complete NDRange execution
space. Note that SYCL allows using cl::sycl::id variables directly as index
accessor types.
The cl::sycl::queue::submit
method returns immediately after scheduling the
command group functor to the SYCL queue for execution on the SYCL device. The
command group functor will be executed on the device when the accessor
dependencies are resolved and the device is ready to perform the
Kernel Dispatch.
In the main
function of the C++ SYCL application, we define the size
of the vectors to be array_size
. On line <33>, we define the std::array
vectors A, B and C using the SYCL type cl::sycl::cl_int
.
All cl_XXX
types defined in the SYCL namespace that are types portable between
C++ host and device. ComputeCpp can convert the host type to the
corresponding device type in this case. Normal host types (e.g, int
) can still
be used, but the interoperability is not defined, and sizes may vary between
host and device.
[3].
On line number <36>, we are calling the simple_add
templated function for
cl_int
and array_size
, providing as parameters the array objects of A, B and
C. Note that on line number <37>, we are calling the simple_add
templated
function for cl_float
and array_size
for the std::array
objects D,E
and F. This shows one of the main features of SYCL: since the SYCL interface
is templated, we can write generic algorithms that can be instantiated with different types without changing the algorithm, interacting nicely with C++
programming.
Footnotes
- C++ lambda functions are unnamed function objects which capture variables in scope. Due to the fact that ComputeCpp is following the single source multiple pass techique, using a host and a device compiler, that name is going to be used for identifying the function object across the different compilers.
- If a cl::sycl::nd_range object is given to the parallel_for invocation, the SYCL runtime will be using the explicitly specified NDRange.
- compute++ is not going to change the type defined by the developer on the device side unless it is defined as a SYCL OpenCL interoperability type, e.g. cl_int, cl_long, etc.