info

Please note that you are viewing a guide targeting an older version of oneAPI for NVIDIA® GPUs. This guide was designed for version 2025.0.0.

Common Optimizations

This section describes common performance issues or pitfalls when using DPC++ and how to address them.

Index swapping

link

The SYCL Specification (Sec. 4.9.1) mandates that:

When constructing multi-dimensional ids or ranges from integers, the elements are written such that the right-most element varies fastest in a linearization of the multi-dimensional space.

For this reason in Intel DPC++, the right-most dimension is mapped to the CUDA or HIP x-dimension, the second right-most dimension maps to the CUDA or HIP y-dimension and so on, for example:

cgh.parallel_for(sycl::nd_range{sycl::range(WG_X * WI_X), sycl::range(WI_X)}, ...)

cgh.parallel_for(sycl::nd_range<2>{sycl::range<2>(WG_Y * WI_Y, WG_X * WI_X), sycl::range<2>(WI_Y, WI_X)}, ...)

cgh.parallel_for(sycl::nd_range<3>{sycl::range<3>(WG_Z * WI_Z, WG_Y * WI_Y, WG_X * WI_X), sycl::range<3>(WI_Z, WI_Y, WI_X)}, ...)

Where WG_X and WI_X are the number of work-groups and work-items per work-group (i.e., named grid size and threads per block in CUDA) in the x-dimension and so on for _Y and _Z.

Be aware that this is particularly important in the case of a parallel_for execution of two or three-dimensional kernels when:

There are manually linearized accesses in (1-d) arrays within local or global memory. This needs to be considered in order to avoid performance issues due to non-coalesced global memory accesses or bank conflicts in local memory. Further details on linearization can be found in Sec. 3.11, Multi-dimensional objects and linearization of the SYCL Specification.
The following error (or similar) is present:
```
Number of work-groups exceed limit for dimension 1 : 379957 > 65535
```
which relates to the fact that on some platforms such as CUDA, the x-dimension can usually support a higher number of work-groups compared to the y and z ones.
```
Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
```

Further details on this topic can be found here.

When building for Nvidia GPUs it is possible to omit the specific GPU architecture, which means the compiler will default to sm_50. When using a newer GPU it may be beneficial to specify the exact architecture, which may allow the compiler to use newer more performant features. Refer to the get started guide for instructions on how to specify a specific GPU architecture.

Inlining

link

The degree to which DPC++ automatically inlines functions is chosen to balance performance with compilation time across a range of devices. However programmers can also add the always_inline attribute on specific functions to force their inlining, for example:

__attribute__((always_inline)) void function(...) {
  ...
}

...

q.submit([&](sycl::handler &cgh) {
    cgh.parallel_for(..., [=](...) {
      function(...);
    });
}

We advise that due care is taken when manually marking functions inline. Be aware that whilst manually inlining a given function may lead to greater performance on some Nvidia devices, other Nvidia devices may encounter drops in performance for the same code. We will continue to improve the compiler optimization heuristics in future releases.

Fast Math Builtins

link

The SYCL math builtins are defined to match the precision requirements of the equivalent OpenCL 1.2 math builtins, which may be unnecessarily precise for some applications, causing an avoidable loss of performance.

To address this, the SYCL specification provides a native version of a subset of the math functions (full list in Section 4.17.5, Math functions), which trades precision for performance. They are defined within the native namespace. For example, the native version of sycl::cos() is sycl::native::cos().

In general if precision is not an issue using the native variants may provide significant improvements, although do note that not all backends make use of the relaxed precision for all builtins.

When targeting NVIDIA architectures, the following sycl::native:: functions are implemented using the .approx variants of the corresponding PTX instructions:

sycl::native::divide
sycl::native::sqrt
sycl::native::sin
sycl::native::cos
sycl::native::log2
sycl::ext::oneapi::experimental::native::exp2
sycl::ext::oneapi::experimental::native::tanh

There is a 1:1 mapping between the above sycl::native:: math functions and the PTX instructions, e.g., a call to sycl::native::exp2 results in the compiler generating a single ex2.approx instruction. In other cases, a native math function can be implemented through multiple .approx PTX instructions, for example, sycl::native::tan() is implemented with sin.approx, cos.approx, and divide.approx. It must be noted that native math functions can be faster than their sycl:: counterparts at the cost of losing precision; further details on the precision of the .approx PTX instructions can be found in the PTX ISA documentation.

Note

The -ffast-math compilation flag swaps standard sycl:: math functions into the corresponding sycl::native:: ones if they are available. If there is no native version for a given math function the -ffast-math flag has no effect on it.

-ffast-math is the default for the icpx compiler. To disable -ffast-math for icpx, use -fno-fast-math.

Loop Unrolling

link

While the compiler will handle some loop unrolling automatically, it can sometimes be beneficial to help the compiler by manually tuning the unrolling of the compute intensive loops in the device code, for example by using the unrolling pragma as follows:

#pragma unroll <unroll factor>
for( ... ) {
 ...
}

We advise that due care is taken when applying manual unrolling; be aware that whilst manual unrolling a given loop may lead to greater performance on some Nvidia devices, other Nvidia devices may encounter drops in performance for the same code. We will continue to improve the compiler optimization heuristics with each new release.

Downcasting Indexes

link

In SYCL, nd_range, range and other index types use size_t as a value type. This is in contrast to backend APIs such as CUDA or HIP, where threadIdx.{x|y|z} is an int. In order to save register space within kernels it can be beneficial to downcast any index types from the size_t s returned from index calculation functions, to the type that the backend API would use for indexing, which is typically an int.

auto bigIdx = item.get_id(0); // size_t
int intIdx = static_cast<int>(item.get_id(0));

This is especially beneficial for kernels with high register pressure.

In the case of member functions such as get_linear_id, and get_global_linear_id, the SYCL implementation will combine different size_t indexes in a manner such as:

size_t item<2>::get_linear_id() {
  return get_id(0) * get_range(1) + get_id(1);
}

SYCL is specified to use size_t s for such calculations so as to avoid the overflow which may occur when using 32 bit types. Since arithmetic is being done on size_t types returned by get_range and get_id, casting the result of this member function to an int or some other type will not eliminate the use of size_t s within these functions’ implementations. Therefore if the programmer desires to only use 32 bit types (or smaller) for these calculations, it is necessary to omit calls that abstract away linear index calculation. An implementation of an unsafe_get_linear_id such as:

inline int unsafe_get_linear_id(item<2> it) {
  return static_cast<int>(it.get_id(0)) * static_cast<int>(it.get_range(1))
              + static_cast<int>(it.get_id(1));
}

Would remove any 64 bit size types from index calculation. This gives the SYCL code for index calculation the same device code footprint as the native CUDA or HIP code.

The above example is considered unsafe as the programmer then takes on the responsibility of ensuring that integer overflow does not take place.

Optimizing Accessor Usage

link

SYCL accessors provide a high level interface to the user for indexing into device memory. An accessor can be constructed with an accessOffset argument, which allows the base index of the accessor to be offset some value from the true base index of the memory that the accessor refers to. While this feature can be useful in writing clean, portable code, the computation of acc[idx] can be made suboptimal due to the need to calculate operator[] using the accessOffset parameter, for all operator[] calls.

T &operator[](size_t idx) {
  return data[idx + accessOffset];
}

This can increase kernel setup time and register usage, which can hamper performance. If the user wishes to eliminate this overhead, while still using accessors, it is recommended to index into an accessor’s data through the get_multi_ptr() method. Since get_multi_ptr() always returns the base pointer of an allocation, it will not be offset by accessOffset.

// Won't internally combine idx with accessOffset
auto refVal = acc.get_multi_ptr()[idx];

Using this pattern will reduce register usage and kernel setup.

Note that this is a problem with the accessor class only, and is not present when indexing into USM pointers.

Local Accessors

link

SYCL 2020 provides the local_accessor class for using local memory in SYCL programs. This is convenient and mirrors the accessor model for device memory. However, in some applications, this can inhibit certain optimizations, causing worse runtime performance. Consider the following snippet:

template <int lmem_size>
void MatrixMul(float *C, float *A, float *B, int wA, int wB,
               sycl::nd_item<3> item_ct1,
               sycl::local_accessor<float, 2> As,
               sycl::local_accessor<float, 2> Bs) {
 // continues

The local accessors are statically sized to match the work group size in the enclosing queue::submit scope. However, the compiler is unable to see this across the host-device divide, and therefore cannot make assumptions based on the accessor size. This can lead to many missed optimization opportunities.

Intel has published an extension to the SYCL specification which allows the user to specify local memory in a different way, inside the kernel.

Using this extension looks like the following:

template <int lmem_size>
void MatrixMul(float *C, float *A, float *B, int wA, int wB,
               sycl::nd_item<3> item_ct1) {
  using namespace sycl::ext::oneapi;
  auto& As = *group_local_memory_for_overwrite<float[lmem_size][lmem_size]>(item_ct1.get_group());
  auto& Bs = *group_local_memory_for_overwrite<float[lmem_size][lmem_size]>(item_ct1.get_group());

In this case we use float[size][size] as the type that the underlying array should hold, but it can be any statically sized array type. Knowing the size of the allocation at compile time lets the compiler reason about the code and can lead to better code generation when, for example, loading tiled data from device global memory to local memory. It is difficult to say when this might be the case, so as ever, investigating the generated code and running performance tests on the program would be the best ways to know when to use this over local accessors.

Alias Analysis

link

Alias Analysis can prove that two memory references do not alias each other. This may enable optimizations. By default, the compiler must assume that memory references do alias, if not proven otherwise by the alias analysis. It is also possible, however, to explicitly signal to the compiler that a memory reference inside the device code is not aliased. This can be achieved using respective keywords for the buffer/accessor and USM model.

For the former, one can add the no_alias property from the oneapi extension to an accessor:

q.submit([&](sycl::handler &cgh) {
  sycl::accessor acc{..., sycl::ext::oneapi::accessor_property_list{sycl::ext::oneapi::no_alias}};
  ...
});

For the latter, the __restrict__ qualifier can be added to a pointer.

Note that __restrict__ is non-standard C++ and may not behave consistently across sycl implementations. For dpc++ only restrict-qualified device function (a function called from within a SYCL kernel) parameters will be taken into account.

For example:

void function(int *__restrict__ ptr) {
    ...
}

...
int *ptr = sycl::malloc_device<int>(..., q);
...
q.submit([&](sycl::handler &cgh) {
      cgh.parallel_for(..., [=](...) {
        function(ptr);
      });
});

A more brute force approach is to add the [[intel::kernel_args_restrict]] attribute to a kernel. This signals the compiler to ignore all possible alias dependencies between each of the USM pointers, or buffer accessors if that model is used. inside the kernel.

Example (buffer/accessor model):

q.submit([&](handler& cgh) {
      accessor in_accessor(in_buf, cgh, read_only);
      accessor out_accessor(out_buf, cgh, write_only);
        cgh.single_task<NoAliases>([=]() [[intel::kernel_args_restrict]] {
          for (int i = 0; i < N; i++)
              out_accessor[i] = in_accessor[i];
        });
});

Using the texture cache

link

On CUDA platforms, data that is constant for at least the lifetime of a kernel can be cached in the texture cache.

This can be achieved by using the sycl::ext::oneapi::experimental::cuda::ldg function. It takes a pointer to device memory and returns the value stored at that address by loading it through the L1/tex cache. For example:

float some_value = ldg(&some_data_in_device_memory[some_index]);

Warning

It is important to note that if the compiler detects that data loaded using this function is written to within a kernel, the program will still compile but will not use the texture cache!

There are many factors impacting the performance benefit of using the texture cache; as such it can be rather difficult to achieve the biggest possible speed-up. Many use-cases in fact receive little to no benefit. However, as performance degradation is both unlikely and small in magnitude when it does happen, and using ldg requires minimal code changes, it can be a great way to quickly improve kernel performance.

Note that ldg can be also be portably used on all other platforms including HIP AMD. However, CUDA is currently the only platform where ldg usage will lead to special caching. The HIP AMD backend always loads all register data into both the L1 and L2 caches whether or not ldg is used.

A more detailed look into the texture cache can be found in this blog post, and the exact detail about the ldg function can be found in the corresponding extension documentation.

AMDGPU Unsafe Atomics

link

For atomic operations on AMD GPUs performed on malloc_device allocations, it is recommended to use unsafe atomics to achieve better performance. This can be done by adding -mllvm --amdgpu-oclc-unsafe-int-atomics=true for int type atomics, and -mllvm --amdgpu-oclc-unsafe-fp-atomics=true for fp type. However, this approach may not work for malloc_shared allocations, as support for unsafe atomics in this case depends on your PCI Express version, with only the new generations supporting these new, fast instructions. Therefore, the default behaviour will use safe CAS (compare and swap) atomics, which PCIe should support. CAS atomics are a lot slower than the unsafe equivalents, so to get the best performance, it is recommended to use malloc_device allocations with the unsafe atomics flag. Note that both flags are global options and shouldn’t be used when user code contains both malloc_shared and malloc_device atomics within the same TU.

Rate this Guide

Shared (managed) USM in the CUDA backend

Features

assignmentJump to Section

Common Optimizations
Index swapping
Set specific GPU architecture
Inlining
Fast Math Builtins
Loop Unrolling
Downcasting Indexes
Optimizing Accessor Usage
Local Accessors
Alias Analysis
Using the texture cache
AMDGPU Unsafe Atomics

oneAPI Menu

Main Menu

Products

menu_bookGuides

Common Optimizations

Index swapping

Set specific GPU architecture

Inlining

Fast Math Builtins

Loop Unrolling

Downcasting Indexes

Optimizing Accessor Usage

Local Accessors

Alias Analysis

Using the texture cache

AMDGPU Unsafe Atomics

Shared (managed) USM in the CUDA backend

Features

assignmentJump to Section

Select a Product

oneAPI

Dark Mode

Light Mode

Also,

part of our network

Codeplay.com

SYCL.tech

Codeplay Developer

Codeplay Open Source