Memory

This guide was created for versions: v0.3.1 - Latest

While for simple computations it is okay to operate purely on work-items, any more complex workload will require finer-grained control. Unfortunately, this comes at the cost of introducing some complexity. Hopefully though, we can clear everything up!

You might remember that work-items are grouped into work-groups. The splitting into work-groups is not purely conceptual - it has very real implications on memory accesses and performance. Work-groups are independent of each other. In fact, there is no way to synchronize between them in a single kernel. For this reason, two work-groups should never write to the same memory location (although they can read shared data).

OpenCL and SYCL define a clear distinction between various regions in memory and rules that govern accesses to these. Everything on the CPU side is known as host memory. It is not directly accessible from kernels, but as we've seen, buffers and accessors provide facilities for copying host data to the device and accessing it there. The corresponding accessor target is access::target::host_buffer.

On the device side, more memory regions exist:

  • Global memory is available in the same form to all work-groups and items. It can be thought of as a device-side equivalent of RAM. The corresponding target, access::target::global_buffer, is the default target for buffer::get_access. In previous examples we didn't explicitly specify a target, so this one was used.
  • Local memory is specific to a single work-group. Work-groups cannot access others' local memory, but it is shared between all work-items in a group. It can be thought of as a user-controlled cache. It is especially useful for divide-and-conquer problems where each part of computation is handled by one work-group. Local memory can be used to store the result of such a computation. Local memory is allocated per kernel execution and it cannot be filled with host data, so you have to initialize it yourself. The canonical way to allocate it is to create a access::target::local accessor inside a command group, passing it the requested allocation size.
  • Private memory is a small region dedicated to each work-item. It is much like CPU register memory. All variables created in a kernel are stored in private memory. Additionally, dedicated private_memory objects can be created for this purpose.
  • Finally, constant memory is a read-only part of global memory, which similarly can reference a host-side buffer.

In this example we will try to compute an array reduction - the sum of all its elements. The overall structure of the example is as follows:

Parallel reduction 〉 ≡

#include <array>
#include <cstdint>
#include <iostream>
#include <random>
#include <cassert>

#include <CL/sycl.hpp>
namespace sycl = cl::sycl;

int main(int, char**) {
  std::array<int32_t, 16> arr;

  std::mt19937 mt_engine(std::random_device{}());
  std::uniform_int_distribution<int32_t> idist(0, 10);

  std::cout << "Data: ";
  for (auto& el : arr) {
    el = idist(mt_engine);
    std::cout << el << " ";
  }
  std::cout << std::endl;

  sycl::buffer<int32_t, 1> buf(arr.data(), sycl::range<1>(arr.size()));

  <<Read hardware information>>

  <<Reduction loop>>

  auto acc = buf.get_access<sycl::access::mode::read>();
  std::cout << "Sum: " << acc[0] << std::endl;

  return 0;
}

The first thing we do is initialize an array of random values to be added together and a buffer for that data. We then print the values.

Read hardware information 〉 ≡

sycl::device device = sycl::default_selector{}.select_device();
sycl::queue queue(device, [] (sycl::exception_list el) {
    for (auto ex : el) { std::rethrow_exception(ex); }
  } );

auto wgroup_size = device.get_info<sycl::info::device::max_work_group_size>();
if (wgroup_size % 2 != 0) {
  throw "Work-group size has to be even!";
}
auto part_size = wgroup_size * 2;

auto has_local_mem = device.is_host()
		     || (device.get_info<sycl::info::device::local_mem_type>()
			!= sycl::info::local_mem_type::none);
auto local_mem_size = device.get_info<sycl::info::device::local_mem_size>();
if (!has_local_mem
    || local_mem_size < (wgroup_size * sizeof(int32_t)))
{
      throw "Device doesn't have enough local memory!";
}

After that we initialize a queue in a slightly different manner than before. Instead of passing a selector to the queue constructor, we request it to return a selected device object. This allows us to access hardware information.

The device::get_info function has a single template parameter specifying the piece of information that we want to retrieve. info::device::max_work_group_size is defined to be the maximum number of work-items in a work-group executing on a single compute unit. Exceeding this size should result in an error. It is not necessarily the optimal size, but it can be expected to yield good performance.

We initialize a part_size variable to be the number of elements in the array that work-group reduces. Since each work-item initially reduces two elements, it is twice the work-group size.

We also test the device for the local memory size - we cannot perform the reduction if there is too little of it or if local memory is unsupported altogether. Of course, in a real-world application a special case would have to be made to also support such devices.

Reduction loop 〉 ≡

auto len = arr.size();
while (len != 1) {
  // division rounding up
  auto n_wgroups = (len + part_size - 1) / part_size;

  queue.submit([&] (sycl::handler& cgh) {
      sycl::accessor
      <int32_t,
       1,
       sycl::access::mode::read_write,
       sycl::access::target::local>
      local_mem(sycl::range<1>(wgroup_size), cgh);

      auto global_mem = buf.get_access<sycl::access::mode::read>();

      cgh.parallel_for<class reduction_kernel>(
	sycl::nd_range<1>(n_wgroups * wgroup_size, wgroup_size),
	[=] (sycl::nd_item<1> item) {
	  <<Perform load into local memory>>

	  <<Reduce into one element>>

	  <<Write group result to global memory>>
	}
      );
    } );
  queue.wait_and_throw();

  len = n_wgroups;
}

Inside the reduction loop, we first find the number of work-groups for this step of reduction. It is the length len left to be reduced divided by the number of elements that each work-group reduces.

Next, in the command group, we allocate a part of local memory by creating an accessor with access::target::local and a range equal to the work-group size. We checked the memory size earlier, so we know that it is available. As stated above, this region of memory looks different to each work-group and its use is for temporary storage.

You might wonder, why do we even bother with using local memory when we could carry out the whole operation in global? The answer is that it is much faster. Local memory is (usually) physically closer to the chip than global and it does not suffer from problems such as false sharing, since it is exclusive to each compute unit. It is therefore a good idea to always carry out all temporary operations in local memory for best performance.

We also obtain an accessor to the data available in global memory. This time get_access is explicitly qualified with access::target::global_buffer, while previously it took on that value by default.

Lastly, we launch a parallel kernel. We use the nd_range variant, which allows us to specify both the global and local size. The nd_range constructor takes in two range objects of the same dimensionality as itself. The first one describes the number of work-items per dimension (recall that there can be up to three dimensions). The second range argument to nd_range<n> describes the number of work-items in a work-group. To find the number of work-groups per dimension, divide the first argument by the second. In this case the result is n_wgroups, which is how many work-groups will be instantiated. In this variant the kernel lambda takes an nd_item argument. It represents the current work-item and features methods to get detailed information from it, such as local, global, and work-group info.

Since each step of the reduction loop produces one number per work-group, we set the len to n_wgroups on every iteration, which will continue reducing over the results.

Perform load into local memory 〉 ≡

size_t local_id = item.get_local_linear_id();
size_t global_id = item.get_global_linear_id();

local_mem[local_id] = 0;

if ((2 * global_id) < len) {
  local_mem[local_id] = global_mem[2 * global_id] + global_mem[2 * global_id + 1];
}

item.barrier(sycl::access::fence_space::local_space);

In the kernel, we firstly zero-initialize the local memory, since it can in fact contain garbage data. The key point here is that 0 is the invariant of our reduction, meaning that x + 0 = x, so we can add the whole array safely even if it isn't entirely filled with data to be reduced.

We divide our data into parts, each one being computed by a single work-group. The input data is required to be of even size, but it doesn't have to be a multiple of the work-group size. Hence, a few work-items in the last work-group can have no corresponding data. For this reason, the initial load from global to local memory is guarded by an if-statement. As mentioned in the "parallelism" section, this is usually a bad idea. In this case, however, it is okay, because at most one work-group will have divergent work-items. We use a small array for illustration purposes and a specialized kernel would technically be faster, but any real use case can be expected to have much more input data.

After the load is performed with an addition of the two elements corresponding to each work-item, we emit a barrier with a local memory fence. We have to stop for a bit and understand why this is necessary. In the OpenCL memory model, all operations across work-items have relaxed semantics. For example, in the following pseudocode we execute two functions in parallel over the same data:

Relaxed write 〉 ≡

int x = 0;
int y = 0;

void thread_a() {
  write(x, 1);
  write(y, 2);
}

void thread_b() {
  int ly = load(y);
  int lx = load(x);
  printf("%i %i", lx, ly);
}

in_parallel(thread_a, thread_b);

In a relaxed memory model, work-item B can in fact print 0 2. This looks wrong, because work-item A must have written x into memory before it wrote y. The key point is that operation work-item B can observe A's operations in a different order. This really helps hardware performance, but it comes at the cost of confusing behaviour. To deal with this problem, we have to emit memory fences. Moreover, even if we don't mind reordering, we might want to make sure that all results of write operations propagate between work-items - otherwise they could stay in per-work-item cache and not be visible across work-items.

To synchronize the state of memory, we use the item::barrier(access::fence_space) operation. A SYCL barrier does two things. Firstly, it makes sure that each work-item within the work-group reaches the barrier call. In other words, it guarantees that the work-group is synchronized at a certain point in the code. It is very important to make sure that either all work-items reach the barrier or none do. For example, the following code is invalid:

Branch barrier 〉 ≡

if (local_id < 5) {
  item.barrier(sycl::access::fence_space::local_space);
} else {
  item.barrier(sycl::access::fence_space::local_space);
}

It looks innocent, but the problem is that the two instructions are not the same barrier. Work-items below local id 5 will get to the first barrier while the rest will get to the other one, and the execution will stall, both groups waiting on each other forever. A simple transformation of factoring the barrier call out of the conditional would fix it.

Secondly, item::barrier emits a memory fence in the specified space. It can be either access::fence_space::local_space, ::global_space or ::global_and_local. A fence ensures that the state of the specified space is consistent across all work-items within the work-group. Importantly, it is not possible to synchronize between work-groups. They are entirely independent, and any write or read in the same global memory area done by two work-groups is a data race. For this reason, it is important to make sure each work-group only works on a dedicated region of global memory without crossover.

Next, we reduce each work-group's array in local memory:

Reduce into one element 〉 ≡

for (size_t stride = 1; stride < wgroup_size; stride *= 2) {
  auto idx = 2 * stride * local_id;
  if (idx < wgroup_size) {
    local_mem[idx] = local_mem[idx] + local_mem[idx + stride];
  }

  item.barrier(sycl::access::fence_space::local_space);
}

Since each iteration of the for loop depends on the previous one, we emit a barrier every time to synchronise work-items.

Lastly, write a single number which is the result of this work-group's reduction into global memory.

Write group result to global memory 〉 ≡

if (local_id == 0) {
  global_mem[item.get_group_linear_id()] = local_mem[0];
}

And the result is obtained:

Data: 0 10 5 6 3 9 9 4 1 5 10 5 4 3 8 0 
Sum: 82