Multiple Kernels

This guide was created for versions: v0.3.1 - Latest

As we've seen before, SYCL kernels are launched asynchronously. To retrieve the results of computation, we must either run the destructor of the buffer that manages the data or create a host accessor. A question comes up - what if we want to execute multiple kernels over the same data, one after another? Surely we must then manually synchronise the accesses? Luckily, we barely have to do anything. The SYCL runtime will guarantee that dependencies are met and that kernels which depend on other's results will not launch until the ones they depend on are finished.

All of this is managed under the hood and controlled through buffers and accessors. It is deterministic enough for us to be able to know exactly what will happen. Let's see an example:

Executing interdependent kernels 〉 ≡

#include <iostream>
#include <numeric>

#include <CL/sycl.hpp>
namespace sycl = cl::sycl;

int main(int, char**) {
  sycl::queue q(sycl::default_selector{});

  std::array<int, 16> a_data;
  std::array<int, 16> b_data;
  std::iota(a_data.begin(), a_data.end(), 1);
  std::iota(b_data.begin(), b_data.end(), 1);

  sycl::buffer<int, 1> a(a_data.data(), sycl::range<1>(16));
  sycl::buffer<int, 1> b(b_data.data(), sycl::range<1>(16));
  sycl::buffer<int, 1> c(sycl::range<1>(16));
  sycl::buffer<int, 1> d(sycl::range<1>(16));

  <<Read A, Write B>>

  <<Read A, Write C>>

  <<Read B and C, Write D>>

  <<Write D>>

  auto ad = d.get_access<sycl::access::mode::read>();
  for (size_t i = 0; i < 16; i++) {
    std::cout << ad[i] << " ";
  }
  std::cout << std::endl;

  return 0;
}

In this example, we submit four command groups. Their operations are not particularly important. What matters is which buffers they write to and read from:

Read A, Write B 〉 ≡

q.submit([&] (sycl::handler& cgh) {
    auto aa = a.get_access<sycl::access::mode::read>(cgh);
    auto ab = b.get_access<sycl::access::mode::discard_write>(cgh);

    cgh.parallel_for<class kernelA>(
      sycl::range<1>(16),
      [=] (sycl::item<1> item) {
	ab[item] = aa[item] * 2;
      } );
  } );

Read A, Write C 〉 ≡

q.submit([&] (sycl::handler& cgh) {
    auto aa = a.get_access<sycl::access::mode::read>(cgh);
    auto ac = c.get_access<sycl::access::mode::discard_write>(cgh);

    cgh.parallel_for<class kernelB>(
      sycl::range<1>(16),
      [=] (sycl::item<1> item) {
	ac[item] = aa[item] * 2;
      } );
  } );

Read B and C, Write D 〉 ≡

q.submit([&] (sycl::handler& cgh) {
    auto ab = b.get_access<sycl::access::mode::read>(cgh);
    auto ac = c.get_access<sycl::access::mode::read>(cgh);
    auto ad = d.get_access<sycl::access::mode::discard_write>(cgh);

    cgh.parallel_for<class kernelC>(
      sycl::range<1>(16),
      [=] (sycl::item<1> item) {
	ad[item] = ab[item] + ac[item];
      } );
  } );

Write D 〉 ≡

q.submit([&] (sycl::handler& cgh) {
    auto ad = d.get_access<sycl::access::mode::read_write>(cgh);

    cgh.parallel_for<class kernelD>(
      sycl::range<1>(16),
      [=] (sycl::item<1> item) {
	ad[item] /= 4;
      } );
  } );

As we can see, some buffers are reused between the kernels with different access modes, while others are used independently. The order in which the SYCL runtime schedules the kernels will mirror this usage.

The first two kernels will be scheduled concurrently, because they do not depend on each other. Both of them read from the same buffer (A), but they do not write to it. Since concurrent reading is not a data race, that part is independent. Then, they also write to different buffers, so writes do not conflict. The runtime is aware of all this and will exploit it for maximum parallelism.

The third kernel is not independent - it reads from the buffers B and C into which the first two kernels write. Hence, it will wait for them to finish and be scheduled immediately after that.

Finally, the fourth kernel does not read anything that a previous kernel wrote, but it does write to the same data - the D buffer. Since mutating shared state in parallel is a data race, this kernel has to wait for the third one to finish and will execute only then.

Our program outputs the correct results:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 

In this case we have a well-defined execution order, since all kernels are submitted from the same thread. What if we have a multithreaded application, with submit calls being made on several threads? The queue is thread-safe, and the order in which kernels are executed will be decided by the order of submission. If you want to guarantee a specific order between kernels submitted from different threads, you have to synchronise this manually and make submit calls in the right order - otherwise it could be random, depending on which thread happens to execute its operation on the queue first.