info

Please note that you are viewing a guide targeting an older version of oneAPI for NVIDIA® GPUs. This guide was designed for version 2023.1.0.

Common optimizations

This section describes common performance issues or pitfalls when using DPC++ and how to address them.

Index swapping

link

The SYCL Specification (Sec. 4.9.1) mandates that:

When constructing multi-dimensional ids or ranges from integers, the elements are written such that the right-most element varies fastest in a linearization of the multi-dimensional space.

For this reason in Intel DPC++, the right-most dimension is mapped to the CUDA or HIP x-dimension, the second right-most dimension maps to the CUDA or HIP y-dimension and so on, for example:

cgh.parallel_for(sycl::nd_range{sycl::range(WG_X), sycl::range(WI_X)}, ...)

cgh.parallel_for(sycl::nd_range<2>{sycl::range<2>(WG_Y, WG_X), sycl::range<2>(WI_Y, WI_X)}, ...)

cgh.parallel_for(sycl::nd_range<3>{sycl::range<3>(WG_Z, WG_Y, WG_X), sycl::range<3>(WI_Z, WI_Y, WI_X)}, ...)

Where WG_X and WI_X are the number of work-groups and work-items per work-group (i.e., named grid size and threads per block in CUDA) in the x-dimension and so on for _Y and _Z.

Be aware that this is particularly important in the case of a parallel_for execution of two or three-dimensional kernels when:

There are manually linearized accesses in (1-d) arrays within local or global memory. This needs to be considered in order to avoid performance issues due to non-coalesced global memory accesses or bank conflicts in local memory. Further details on linearization can be found in Sec. 3.11, Multi-dimensional objects and linearization of the SYCL Specification.
The following error (or similar) is present:
```
Number of work-groups exceed limit for dimension 1 : 379957 > 65535
```
which relates to the fact that on some platforms such as CUDA, the x-dimension can usually support a higher number of work-groups compared to the y and z ones.
```
Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
```

Further details on this topic can be found here.

When building for Nvidia GPUs it is possible to omit the specific GPU architecture, which means the compiler will default to sm_50. When using a newer GPU it may be beneficial to specify the exact architecture, which may allow the compiler to use newer more performant features. Refer to the get started guide for instructions on how to specify a specific GPU architecture.

Inlining

link

In some cases DPC++ can be too conservative with inlining which may cause performance degradation.

A common case where this happens is when the SYCL application is written in such a way that the kernel lambda simply calls a large function containing the kernel implementation.

To work around this issue it is possible to add the always_inline attribute on specific functions to force their inlining, for example:

__attribute__((always_inline)) void function(...) {
  ...
}

...

q.submit([&](sycl::handler &cgh) {
    cgh.parallel_for(..., [=](...) {
      function(...);
    });
}

Fast Math Builtins

link

The SYCL math builtins are defined to match the precision requirements of the equivalent OpenCL 1.2 math builtins, which may be too precise for some applications, causing an unnecessary loss of performance.

To address this, the SYCL specification provides a native version of a subset of the math functions (full list in Section 4.17.5, Math functions), which trades precision for performance. They are defined within the native namespace. For example, the native version of sycl::cos() is sycl::native::cos().

In general if precision is not an issue using the native variants may provide significant improvements, although do note that not all backends make use of the relaxed precision for all builtins.

When targeting NVIDIA architectures, the following sycl::native:: functions are implemented using the .approx variants of the corresponding PTX instructions:

sycl::native::divide
sycl::native::sqrt
sycl::native::sin
sycl::native::cos
sycl::native::log2
sycl::ext::oneapi::experimental::native::exp2
sycl::ext::oneapi::experimental::native::tanh

There is a 1:1 mapping between the above sycl::native:: math functions and the PTX instructions, e.g., sycl::native::exp2 calls ex2.approx. In other cases, a native math function can be implemented through multiple .approx PTX instructions, for example, sycl::native::tan() is implemented with sin.approx, cos.approx, and divide.approx. It must be noted that native math functions can be faster than their sycl:: counterparts at the cost of losing precision: further details on the precision of the .approx PTX instructions can be found in the PTX ISA documentation.

Note

The -ffast-math compilation flag swaps standard sycl:: math functions into the corresponding sycl::native:: ones if the are available. If there is no native version for a given math function the -ffast-math flag has no effect on it.

Loop Unrolling

link

While the compiler will handle some loop unrolling automatically, it can sometimes be beneficial to help the compiler by manually tuning the unrolling of the compute intensive loops in the device code, for example by using the unrolling pragma as follows:

#pragma unroll <unroll factor>
for( ... ) {
 ...
}

Alias Analysis

link

Alias Analysis can prove that two memory references do not alias each other. This may enable optimizations. By default, the compiler must assume that memory references do alias, if not proven otherwise by the alias analysis. It is also possible, however, to explicitly signal to the compiler that a memory reference inside the device code is not aliased. This can be achieved using respective keywords for the buffer/accessor and USM model.

For the former, one can add the no_alias property from the oneapi extension to an accessor:

q.submit([&](sycl::handler &cgh) {
  sycl::accessor acc{..., sycl::ext::oneapi::accessor_property_list{sycl::ext::oneapi::no_alias}};
  ...
});

For the latter, the __restrict__ qualifier can be added to a pointer.

Note that __restrict__ is non-standard C++ and may not behave consistently across sycl implementations. For dpc++ only restrict-qualified device function (a function called from within a SYCL kernel) parameters will be taken into account.

For example:

void function(int __restrict__ *ptr) {
    ...
}

...
int *ptr = sycl::malloc_device<int>(..., q);
...
q.submit([&](sycl::handler &cgh) {
      cgh.parallel_for(..., [=](...) {
        function(ptr);
      });
});

A more brute force approach is to add the [[intel::kernel_args_restrict]] attribute to a kernel. This signals the compiler to ignore all possible alias dependencies between each of the USM pointers, or buffer accessors if that model is used. inside the kernel.

Example (buffer/accessor model):

q.submit([&](handler& cgh) {
      accessor in_accessor(in_buf, cgh, read_only);
      accessor out_accessor(out_buf, cgh, write_only);
        cgh.single_task<NoAliases>([=]() [[intel::kernel_args_restrict]] {
          for (int i = 0; i < N; i++)
              out_accessor[i] = in_accessor[i];
        });
});

Rate this Guide

Performance on Nvidia GPUs

Features

assignmentJump to Section

Common optimizations
Index swapping
Set specific GPU architecture
Inlining
Fast Math Builtins
Loop Unrolling
Alias Analysis

oneAPI Menu

Main Menu

Products

menu_bookGuides

Common optimizations

Index swapping

Set specific GPU architecture

Inlining

Fast Math Builtins

Loop Unrolling

Alias Analysis

Performance on Nvidia GPUs

Features

assignmentJump to Section

Select a Product

oneAPI

Dark Mode

Light Mode

Also,

part of our network

Codeplay.com

SYCL.tech

Codeplay Developer

Codeplay Open Source