So far, we have defined several important concepts and performed a few simple operations on the device side. However, the way in which we have been executing kernels so far was quite ignorant of a parallel device's architecture and thus wasteful. It is now time to unleash the full power of device parallelism. To do that, we first have to understand roughly how a GPU is structured.
A modern GPU has a few more "cores" than the typical CPU - around 16 or 32. These are sometimes called 'compute units'. Moreover, a compute unit is not quite like a CPU core. While CPUs are general-purpose and can easily execute a wide variety of instructions, GPUs are really only fit for large numeric computations. That is because unlike a CPU core, a compute unit is kind of like a very wide SIMD unit. It can execute the same operation over a large array of elements in a vectorized manner. When we multiply the SIMD width of every compute unit by their amount, we get around 2048 operations that can execute in parallel - much more than on a CPU.
Of course, SYCL code can run on many more kinds of devices than just GPUs. In order to support this, it provides an abstraction over the design of parallel hardware. A single execution of a given kernel is organised into work-groups and work-items.
A work-item is a single instance of a running kernel, kind of (but not quite) like a CPU thread. Each work-item is uniquely identified by a global id. Ids are not necessarily single values - they might be one, two, or three-dimensional. In the multi-dimensional cases, an id is a 'point' in an index space, with each point corresponding to a work-item.
Work-items are then organised into work-groups. Each work-group contains the same number of work-items and is uniquely identified by a work-group id. Additionally, within a work-group a work-item can be identified by its local id, and the combination of a local id with a work-group id is equivalent to the global id. The number of work-items is the global size and the number of work-items within a work-group is the local size.
Roughly speaking, a work-group corresponds to a single parallel device core (e.g. GPU compute unit), while the work-items within it correspond to elements in the per-core SIMD array. This has huge implications on how we should write our code to achieve best performance. The vectorized unit performs best when all elements are inputs to the same computation. For example, multiplying the entire array by a constant is blazing fast. On the other hand, divergent computation might be slower than on a CPU. For example, if the kernel contains an if
statement that causes some number of work-items within a work-group to take one branch and the rest to take another, the parallel device will have to deal with the divergence in a non-optimal way. A GPU, for example, might execute the same bit of code twice, first with all SIMD units taking one branch and then with all of them taking the other branch. The results would be masked to only store the correct version for a given SIMD unit. We don't want this to happen, since it effectively doubles the runtime of a particular path.
Another way to think about it is to imagine that a work-group is a squad of work-item soldiers marching in a single direction. As long as they are in sync, the march progresses correctly. However, as soon as some of the soldiers change direction, the others will run into them and cause everyone to fall over.
On the other hand, it's okay for different work-groups to take different paths through the kernel, since they are independent. For this reason, if we need to have divergent computation, it is best if we can pick our work-groups such that the divergence is on the level of work-groups rather than work-items.
In this chapter we will encrypt a string with ROT-13 (do not actually do this if you need proper encryption) in parallel.
Parallel encrypt
#include <iostream>
#include <cstring>
#include <vector>
#include <CL/sycl.hpp>
namespace sycl = cl::sycl;
int main(int, char**) {
char text[] = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc interdum in erat non scelerisque.";
const size_t len = sizeof(text);
sycl::queue queue(sycl::default_selector{});
{
<<Submit kernel>>
}
std::cout << text << std::endl;
return 0;
}
For data storage, we initialize an array with the string that we want to encrypt.
Submit kernel
sycl::buffer<char, 1> buf(text, sycl::range<1>(len));
queue.submit([&] (sycl::handler& cgh) {
auto acc = buf.get_access<sycl::access::mode::read_write>(cgh);
cgh.parallel_for<class parrot13>(
sycl::range<1>(len - 1),
[=] (sycl::item<1> item) {
size_t id = item.get_linear_id();
// reference https://en.wikipedia.org/wiki/ROT13
auto const c = acc[id];
acc[id] = (c-1/(~(~c|32)/13*2-11)*13);
} );
} );
After the typical queue setup, in our command group we submit a kernel using parallel_for
. As the name suggests, this function will execute the kernel in parallel on a number of work-items. There are several variants of this function. Here, we use the one with a single range<n>
parameter. The range specifies the global size (we use len - 1
since we don't want to flip the newline character), but the local size is left unspecified. The data we are manipulating is not grouped in any significant way. For this reason, it is best to leave the choice of local size up to the runtime, which should find an optimal value. As a rule of thumb, only specify the local size if you need to control behaviour on the work-group level (e.g. for divergence) or when you know the best performing values for a particular piece of hardware.
Corresponding to the range<n>
parameter is the item<n>
argument we receive in the kernel lambda. It makes only the global id available. The method item::get_linear_id
combines an id in index space into a single size_t
value. In the one-dimensional case, these values are the same. We then transform each letter with the encryption algorithm and write that back into the buffer.
And the result is our secret code:
Yberz vcfhz qbybe fvg nzrg, pbafrpgrghe nqvcvfpvat ryvg. Ahap vagreqhz va reng aba fpryrevfdhr.