Debugging on an OpenCL Device

Debugging on an OpenCL Device

While debugging on the host device works for the most part, not every situation is the same. There may be cases where developers run into problems that can be only be observed in a device. If you happen to find yourself in a situation where an issue only occurs when running on an OpenCL device, you have to debug it on the device.

Fortunately, there are existing tools that are capable of simulating an OpenCL device as well as providing validation of the OpenCL (1.2) code.

Oclgrind is the go to tool for this purpose; it is an OpenCL device simulator and debugger that can interpret OpenCL SPIR (1.2). You can learn more about SPIR by having a look at its specification hosted by the Khronos Group. At its core, Oclgrind simulates how an OpenCL device executes a kernel independently from any specific architecture. This is particularly useful when facing portability issues during development.

Setting up Oclgrind

The easiest way to install oclgrind from an OS package manager that maintains a stable version of the project. For example, on Ubuntu, using the APT package manager, you can simply type the following command in the Terminal:

sudo apt-get install oclgrind

This will get you the latest stable version of the tool maintained for your Ubuntu version. For Ubuntu-16/17/18 this will get you oclgrind 15.5. To check the version you have install, simply type:

oclgrind --version

In case you want to install the latest version of oclgrind (18.3 at the time of writing) which is usually not maintained by most Linux package managers such Ubuntu's APT, it is best to install directly from the the project source files. The Building and Installing section on their Github wiki gives a thorough explanations on how to proceed this way, even explaining how to run oclgrind with via the Khronos' OpenCL ICD loader. As a prerequisite, you will need to CMake installed for you platform. This approach works almost identically on both Linux and OSX based platforms. Additionally, this section includes a paragraph on how to build the sources on Windows-based platforms as well.

Using Oclgrind

Fortunately, oclgrind is no hard to use. Because it implements the full OpenCL 1.2 runtime API, developers are not required to do any manual changes to their SYCL (1.2.1) or OpenCL programs.

Here are a few examples, simply to scratch the surface of what's possible as well as demonstrate how easy it is to do things like, for example, enabling data-race detection in the kernel or collect kernel information such as the count of instructions called during execution on the OpenCL device simulator. In addition to the above, you can do things like checking for errors in API calls, dump OpenCL SPIR and generate IR (intermediate representation) files.

Oclgrind incorporates a simple plugin for detecting Invalid memory access which is a well-known and common problem when the in area of GPU compute. It may be the case that the platform you are running on may not provide a meaningful feedback about what went wrong, while oclgrind comes with a plugin that checks each memory access that your OpenCL kernels perform to ensure they do not access memory that they shouldn't, and is enabled by default. If an error is encountered, the plugin will provide a diagnostic on it and a feedback including the failing code with an error message.

Checking for runtime API errors

There’s plenty of issues that can arise when using the various runtime API functions from the host. Unfortunately, the error messages returned from these API calls can often be confusing or point to multiple possible causes. To make the case of detecting API errors, oclgrind provides a plugin called check-api that needs to be manually specified as a flag when running the application with oclgrind.

Consider the following attempt to launching a kernel with NDRange where the value for the local work size does not divide that of the global work size. The simple-local-barrier sample from the SDK is a handy code for the purpose of this demonstration.

const int size = 65;

[...] // program logic.

cgh.parallel_for<example_kernel>(nd_range<1>(range<1>{size}, range<1>{2}), [=](nd_item<1> item) {
  [...] // kernel logic.
}

Now we can run the program by adding the --check-api flag to oclgrind to enable error checking for the underlying OpenCL API calls as follows.

oclgrind --check-api ./simple-local-barrier

In this scenario, the result of dividing the global work size, which is 65, by the local work size of 2 returns a fraction - 32.5 and this fraction can't be used to define the number of work groups for execution. Instead what we need is a natural, whole number to specify the work groups count. Thus, the code generated from the API call doesn't have a successful return due to the description of the error detected by oclgrind (below).

Oclgrind - OpenCL runtime error detected
    Function: clEnqueueNDRangeKernel
    Error:    CL_INVALID_WORK_GROUP_SIZE
    Dimension 0: local_work_size (2) does not divide global_work_size (65)

In addition, another case of invoking the CL_INVALID_WORK_GROUP_SIZE error could be having set a too large local work size which oclgrind will report with an appropriate error message.

Detecting race conditions

Similarly to the valgrind --memcheck tool, you can check for race conditions using oclgrind. If the said behavior can only be observed when running on an OpenCL device, we can enable the data-race detection plugin in oclgrind by passing the –data-races flags.

oclgrind --data-races ./application

In case of detecting a data race, the tool will output the address at which the condition has been detected, followed by the address memory space and the failing instruction alongside other useful debugging information if available.

An example for a race condition inside a kernel could be one where a work-item with one local id assigns a value to some variable and a work-item with another local id tries to read and use that same variable with no synchronization between the work-items in the work-group.

Consider the following SYCL kernel code:

auto inAcc = bufIn.get_access<access::mode::read>(cgh);
auto outAcc = bufOut.get_access<access::mode::write>(cgh);
accessor<int, 1, access::mode::read_write, access::target::local> localAcc(range<1>{1}, cgh);

cgh.parallel_for<racy_kernel>(nd_range<1>(globalRange, localRange),
[=](nd_item<1> item) {
  const auto localId = item.get_local_id(0);
  if (localId == 1) {
    localAcc[0] = inAcc[localId];
  }
  if (localId == 0) {
    outAcc[localId] = localAcc[0];
  }
});

This code implements a case of a race condition inside a SYCL kernel as described above. Running it through oclgrind with the data-races plugin enabled we will get the following output:

Read-write data race at local memory address 0x1000000000000
    Kernel: SYCL_class_racy_kernel

    First entity:  Global(1,0,0) Local(1,0,0) Group(0,0,0)
      store i32 %7, i32 addrspace(3)* %0, align 4, !tbaa !11
    Debugging information not available.


    Second entity: Global(0,0,0) Local(0,0,0) Group(0,0,0)
      %9 = load i32, i32 addrspace(3)* %0, align 4, !tbaa !11
    Debugging information not available.

Adding a work-group barrier with memory ordering on the local address space betweem the two lines of assignment.

item.barrier(access::fence_space::local_space);

With this the application will run without any data races being introduced. In the case of an OpenCL program, where the kernel is separated from the host source, you will get the line of where the race condition has been detected and the faulty code but this information is not available in our case. However, you can deduce where this is happening in the code by interpreting the returned SPIR instructions.

Collecting profiling information

Another interesting function is, if you are familiar with the LLVM instruction set, is the --inst-counts option that can be passed to oclgrind. Even if you are not that familiar, the SPIR spec includes a reference table (p. 20, 21) to all LLVM instructions that may be used in SPIR. Back to the reduction sample we used in the Debugging on Host Device section, we will run it with oclgrind passing the --inst-counts flag.

oclgrind --inst-counts ./reduction

Oclgrind outputs a histogram in the terminal, which shows the collected counts of instructions (as well as the actual instructions) that were executed while running the kernel. In more detail, the counts for the memory loads and stores which are in the Memory Access & Addressing LLVM Instruction Family are split into the separate address spaces to show the number of bytes read or written for the correct space (e.g., global or local).

Device Name: Oclgrind Simulator
Platform Name: Oclgrind
Instructions executed for kernel 'SYCL_class_sycl_reduction_int_':
           2,688 - br
           2,304 - icmp
           1,024 - call _Z7barrierj()
           1,024 - phi
             896 - lshr
             512 - getelementptr
             382 - add
             255 - load local (1,020 bytes)
             255 - store local (1,020 bytes)
             128 - call _Z12get_group_idj()
             128 - call _Z12get_local_idj()
             128 - call _Z13get_global_idj()
             128 - load global (512 bytes)
             128 - ret
             128 - sdiv
             128 - select
             128 - sext
             128 - trunc
               1 - store global (4 bytes)

This functionality is particularly helpful for optimizing kernels, involving synchronizations between the work-items in a work-group executing the kernel. For example, in the reduction sample we have 1024 barrier calls, which may be too big of a number in this case but in general is a helpful indicator in cases of slow kernels to have an overview of the number of synchronizations (between work-items) being performed.

Interactive-style debugger - Stepping through the kernel

In case you want to step through the kernel, you can do that by running oclgrind in interactive mode - that provides a simple gdb-style step-through debugger with a limited subset of the commands supported by gdb.

Command list:
  backtrace    (bt)
  break        (b)
  continue     (c)
  delete       (d)
  gmem         (gm)
  help         (h)
  info         (i)
  list         (l)
  next         (n)
  lmem         (lm)
  pmem         (pm)
  print        (p)
  quit         (q)
  step         (s)
  workitem     (wi)

Running an application with oclgrind in interactive mode is done as follows:

oclgrind --interactive ./application # or just type -i for short

Oclgrind will automatically break at the start of each kernel invocation. Additionally, the interactive debugger plugin also interacts perfectly with the other oclgrind plugins, which means it can automatically break into a prompt if an error is encountered (e.g. invalid memory access). You can manually set breakpoints for specific source lines is done just like in gdb using the break command. You can also switch between work-items. Although, oclgrind will do a sequential execution of the work-items, you can use the workitem command to do the switch.

Other command features include the viewing a variable and inspecting regions of memory.

  • Variables can't be watched but you can use the print command to view the contents of a variable.
  • To inspect regions of memory, you can use the mem commands - lmem, pmem and gmem.

Furthermore, to get a general overview of the NDRange of the kernel you are debugging, you can use the info command in the debugger. Example with the reduction kernel:

(oclgrind) info
Running kernel 'SYCL_class_sycl_reduction_int_'
-> Global work size:   (128,1,1)
-> Global work offset: (0,0,0)
-> Local work size:    (128,1,1)

The SYCL host device makes debugging much easier for developers. Since SYCL is written entirely in standard C++, we can use common and robust tools to debug SYCL applications. It's also possible to use OpenCL debugging tools out of the box without any code modifications. This makes the debugging of kernel specific errors possible without vendor-dependent tools.

If you'd like to find out about how to profile your SYCL application, read our Optimizing Your SYCL Code Using Profiling guide.

Sections

    Select a Product

    Please select a product

    ComputeCpp enables developers to integrate parallel computing into applications using SYCL and accelerate code on a wide range of OpenCL devices such as GPUs.

    ComputeSuite for R-Car enables developers to accelerate their applications on Renesas R-Car based hardware such as the V3M and V3H, using the widely supported open standards SYCL and OpenCL.

    Also,

    part of our network