Debugging on a SYCL Host Device

Debugging on a SYCL Host Device

Debugging kernels running on a device back-end requires support in the underlying platform, so SYCL kernel debug support is currently not possible. However, developers can use standard tools such as valgrind, gdb, lldb and others to debug their SYCL applications on the host_device instead.

The host device is an emulated back-end that is executed as native C++ code and emulates the SYCL execution and memory models. This is a requirement for every SYCL implementation and allows developers to run a SYCL application without having the supported back-ends (e.g. OpenCL 1.2 or above for ComputeCpp) set up. The host device will not provide a fast multi-threading interface on the host but a working one that uses the core API of SYCL (excluding vendor-specific extensions) and needs only pure C++ (at least C++11) to do that.

Having clarified what exactly the host device is, let's see the purpose of having such an implementation.

  • The host device exists to be used as a fallback execution target for a SYCL application rather than being an optimal solution for parallel execution, it shouldn't be used in production code
  • It is a perfect environment for debugging SYCL code since all of the API calls and kernel code are basically standard C++, which can be compiled using any modern compiler
  • When SYCL parallel_for is invoked, the SYCL host implementation will spawn OS threads, allowing developers to use standard multi-threaded debugging methods

In order to enable debugging on the host device without having to modify the program source every time, a simple option would be to set a preprocessor macro when creating a custom debug build and use it as shown below.

The following code is a modified version of the reduction sample from the computecpp-sdk that uses a macro to select the host device.

#if DEBUG_IN_HOST_DEVICE
cl::sycl::queue queue(cl::sycl::host_selector{}, ...);
#else
cl::sycl::queue queue(cl::sycl::default_selector{}, ...);
#endif

If the DEBUG_IN_HOST_DEVICE macro is defined, the queue is initialized with a host device. Otherwise it uses the default selector which will select a GPU if one is available on the hardware. The rest of the program source remains unchanged but, when generating the executable, the program has to be compiled with -g (for gcc), or CMAKE_BUILD_TYPE=Debug in the case of using CMake to enable the use of extra debugging information.

However, code modifications are not the most advisable and convenient for such tasks, which is why the ComputeCpp SYCL implementation provides an environment variable COMPUTECPP_TARGET. When this variable is set to host, the default_selector will be forced to select the host device, thus eliminating the need for a custom preprocessor definition and changes in your code. It is important to note that this is not part of the standard and can be used only with ComputeCpp.

Once we are sure that the host device has been selected, we can run valgrind to debug our application (the reduction sample in this case).

Checking for memory leaks

Let's start by using memcheck as it comes built-in with valgrind and detects the use of uninitialized memory, out-of-bounds read/write access to memory and memory leaks.

To run valgrind with memcheck, type the following command:

valgrind --leak-check=full --show-leak-kinds=all ./reduction.

It's now possible to extract the leak summary, which in this case is as follows.

==10853== LEAK SUMMARY:
==10853==    definitely lost: 6,377 bytes in 4 blocks
==10853==    indirectly lost: 76 bytes in 2 blocks
==10853==      possibly lost: 0 bytes in 0 blocks
==10853==    still reachable: 138,748 bytes in 1,731 blocks
==10853==         suppressed: 0 bytes in 0 blocks

And you can also track any uninitialized values using the --track-origins=yes option.

Interpreting cache utilization

Another tool that we can use is Cachegrind.

Cachegrind simulates the first-level and last-level caches (usually the last-level cache has the most influence on runtime performance).

The experimental host device is an Intel CPU with 3 levels of cache, where the first level has separate instructions and data cache, and the rest are unified.

On Linux, you can check these details using the command:

lscpu | grep "cache".

The output looks something like this.

L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            8192K

To run valgrind with cachegrind using this command:

valgrind --tool=cachegrind ./reduction.

This will simulate the reduction program's interaction with the machine's cache hierarchy as visible in the output below.

==9436==
==9436== I   refs:      299,487,394
==9436== I1  misses:        103,329
==9436== LLi misses:         49,491
==9436== I1  miss rate:        0.03%
==9436== LLi miss rate:        0.02%
==9436==
==9436== D   refs:      122,361,759  (74,394,550 rd   + 47,967,209 wr)
==9436== D1  misses:      3,461,492  ( 2,527,357 rd   +    934,135 wr)
==9436== LLd misses:      1,388,245  (   722,130 rd   +    666,115 wr)
==9436== D1  miss rate:         2.8% (       3.4%     +        1.9%  )
==9436== LLd miss rate:         1.1% (       1.0%     +        1.4%  )
==9436==
==9436== LL refs:         3,564,821  ( 2,630,686 rd   +    934,135 wr)
==9436== LL misses:       1,437,736  (   771,621 rd   +    666,115 wr)
==9436== LL miss rate:          0.3% (       0.2%     +        1.4%  )

We can interpret the results and see that we have a not so significant cache miss in the first level instruction cache I1 with rate of 0.03%, but a more significant cache miss rate of 2.8% in the data cache D1. That said, the reduction program couldn't fit in the L1 cache, therefore cache optimization approaches can be applied. However, there were hundreds of millions of instructions in L1 so some compulsory cache misses may be unavoidable. Furthermore, we can also see that there is a very small cache miss rate of 0.3% in L3 cache (shown as LL).

Detecting race conditions

You can also check for possible data races and observe the thread allocations by using the Hellgrind tool: valgrind --tool=hellgrind ./reduction In this case however, you need to careful of interpreting the output as there may be warnings saying that there is a possible data race if one thread is reading a block of bytes that is written by another thread without a lock being held. Hellgrind can't know if there are any other means in the program that prevents the certain condition to happen in the two threads simultaneously, thus it will flag it as a possibility.

Debugging by stepping-through the program

You can also use gdb or other source-line debuggers (e.g. lldb) to step through your program just as you are stepping through a standard C++ application.

Sections

    Select a Product

    Please select a product

    ComputeCpp enables developers to integrate parallel computing into applications using SYCL and accelerate code on a wide range of OpenCL devices such as GPUs.

    ComputeSuite for R-Car enables developers to accelerate their applications on Renesas R-Car based hardware such as the V3M and V3H, using the widely supported open standards SYCL and OpenCL.

    Also,

    part of our network