Debugging on a SYCL Host Device
Debugging kernels running on a device back-end requires support in the underlying platform, so SYCL kernel debug support is currently not possible.
However, developers can use standard tools such as valgrind
, gdb
, lldb and others
to debug their SYCL applications on the host_device
instead.
The host device is an emulated back-end that is executed as native C++ code and emulates the SYCL execution and memory models. This is a requirement for every SYCL implementation and allows developers to run a SYCL application without having the supported back-ends (e.g. OpenCL 1.2 or above for ComputeCpp) set up. The host device will not provide a fast multi-threading interface on the host but a working one that uses the core API of SYCL (excluding vendor-specific extensions) and needs only pure C++ (at least C++11) to do that.
Having clarified what exactly the host device is, let's see the purpose of having such implementation.
- The host device exists to be used as a fallback execution target for a SYCL application rather than being an optimal solution for parallel execution
- It is a perfect environment for debugging SYCL code since all of the API calls and kernel code are basically standard C++, which can be compiled using any modern compiler
- When
sycl::parallel_for
is invoked, the SYCL host implementation will spawn OS threads, allowing developers to use standard multi-threaded debugging methods
In order to enable debugging on the host device without having to modify the program source every time, ComputeCpp has a built-in macro, named __DEBUG_IN_HOST_DEVICE
, and this will ensure that the queue is made for a host device if defined.
The following code is a modified version of the reduction sample from the computecpp-sdk that uses a macro to select the host device.
#if __DEBUG_IN_HOST_DEVICE
cl::sycl::queue queue(cl::sycl::host_selector{}, ...);
#else
cl::sycl::queue queue(cl::sycl::default_selector{}, ...);
#endif
If the debug macro is enabled, the queue is initialized with a host device. Otherwise it uses the default selector which will select a GPU if one is available on the hardware.
The rest of the program source remains unchanged but, when generating the executable, the program must be compiled with -g
, or CMAKE_BUILD_TYPE=Debug
in the case of using CMake ,to specify the use of the host device.
We can run valgrind with the host device using several tools with the reduction sample code.
Checking for memory leaks
Let's start by using memcheck as it comes built-in with valgrind and detects the use of uninitialized memory, inappropriate reads/writes and memory leaks.
To run valgrind with memcheck
, type the following command:
valgrind --leak-check=full --show-leak-kinds=all ./reduction.
It's now possible to extract the leak summary, which in this case is as follows.
==10853== LEAK SUMMARY:
==10853== definitely lost: 6,377 bytes in 4 blocks
==10853== indirectly lost: 76 bytes in 2 blocks
==10853== possibly lost: 0 bytes in 0 blocks
==10853== still reachable: 138,748 bytes in 1,731 blocks
==10853== suppressed: 0 bytes in 0 blocks
And you can also track any uninitialized values using the --track-origins=yes
option.
Interpreting cache utilization
Another tool that we can use is Cachegrind.
Cachegrind simulates the first-level and last-level caches (usually the last-level cache has the most influence on runtime performance).
The experimental host device is an Intel CPU with 3 levels of cache, where the first level has separate instructions and data cache, and the rest are unified.
On Linux, you can check these details using the command:
lscpu | grep "cache".
The output looks something like this.
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 8192K
To run valgrind with cachegrind
using this command:
valgrind --tool=cachegrind ./reduction.
This will simulate the reduction
program's interaction with the machine's cache hierarchy as visible in the output below.
==9436==
==9436== I refs: 299,487,394
==9436== I1 misses: 103,329
==9436== LLi misses: 49,491
==9436== I1 miss rate: 0.03%
==9436== LLi miss rate: 0.02%
==9436==
==9436== D refs: 122,361,759 (74,394,550 rd + 47,967,209 wr)
==9436== D1 misses: 3,461,492 ( 2,527,357 rd + 934,135 wr)
==9436== LLd misses: 1,388,245 ( 722,130 rd + 666,115 wr)
==9436== D1 miss rate: 2.8% ( 3.4% + 1.9% )
==9436== LLd miss rate: 1.1% ( 1.0% + 1.4% )
==9436==
==9436== LL refs: 3,564,821 ( 2,630,686 rd + 934,135 wr)
==9436== LL misses: 1,437,736 ( 771,621 rd + 666,115 wr)
==9436== LL miss rate: 0.3% ( 0.2% + 1.4% )
We can interpret the results and see that we have a not so significant cache miss in the first level instruction cache I1
with rate of 0.03%, but a more significant cache miss rate of 2.8%
in the data cache D1
.
That said, the reduction program couldn't fit in the L1
cache, therefore cache optimization approaches can be applied. However, there were hundreds of millions of instructions in L1
so some compulsory cache misses may be unavoidable.
Furthermore, we can also see that there is a very small cache miss rate of 0.3%
in L3
cache (shown as LL
).
Detecting race conditions
You can also check for possible data races and observe the thread allocations by using the Hellgrind tool: valgrind --tool=hellgrind ./reduction
In this case however, you need to careful of interpreting the output as there may be warnings saying that there is a possible data race if one thread is reading a block of bytes that is written by another thread without a lock being held.
Hellgrind can't know if there are any other means in the program that prevents the certain condition to happen in the two threads simultaneously, thus it will flag it as a possibility.
Debugging by stepping-through the program
You can also use gdb
or other source-line debuggers (e.g. lldb
) to step through your program just as you are stepping through a standard C++ application.