This section describes common performance issues or pitfalls when using DPC++ and how to address them.
Index swapping
The SYCL Specification (Sec. 4.9.1) mandates that:
When constructing multi-dimensional ids or ranges from integers, the elements are written such that the right-most element varies fastest in a linearization of the multi-dimensional space.
For this reason in Intel DPC++, the right-most dimension is mapped to the CUDA or HIP x-dimension, the second right-most dimension maps to the CUDA or HIP y-dimension and so on, for example:
cgh.parallel_for(sycl::nd_range{sycl::range(WG_X * WI_X), sycl::range(WI_X)}, ...)
cgh.parallel_for(sycl::nd_range<2>{sycl::range<2>(WG_Y * WI_Y, WG_X * WI_X), sycl::range<2>(WI_Y, WI_X)}, ...)
cgh.parallel_for(sycl::nd_range<3>{sycl::range<3>(WG_Z * WI_Z, WG_Y * WI_Y, WG_X * WI_X), sycl::range<3>(WI_Z, WI_Y, WI_X)}, ...)
Where WG_X
and WI_X
are the number of work-groups and work-items per
work-group (i.e., named grid size and threads per block in CUDA) in the
x-dimension and so on for _Y
and _Z
.
Be aware that this is particularly important in the case of a parallel_for
execution of two or three-dimensional kernels when:
There are manually linearized accesses in (1-d) arrays within local or global memory. This needs to be considered in order to avoid performance issues due to non-coalesced global memory accesses or bank conflicts in local memory. Further details on linearization can be found in Sec. 3.11, Multi-dimensional objects and linearization of the SYCL Specification.
The following error (or similar) is present:
Number of work-groups exceed limit for dimension 1 : 379957 > 65535
which relates to the fact that on some platforms such as CUDA, the x-dimension can usually support a higher number of work-groups compared to the y and z ones.
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Further details on this topic can be found here.
Inlining
The degree to which DPC++ automatically inlines functions is chosen to balance
performance with compilation time across a range of devices. However
programmers can also add the always_inline
attribute on specific functions
to force their inlining, for
example:
__attribute__((always_inline)) void function(...) {
...
}
...
q.submit([&](sycl::handler &cgh) {
cgh.parallel_for(..., [=](...) {
function(...);
});
}
We advise that due care is taken when manually marking functions inline. Be aware that whilst manually inlining a given function may lead to greater performance on some Nvidia devices, other Nvidia devices may encounter drops in performance for the same code. We will continue to improve the compiler optimization heuristics in future releases.
Fast Math Builtins
The SYCL math builtins are defined to match the precision requirements of the equivalent OpenCL 1.2 math builtins, which may be unnecessarily precise for some applications, causing an avoidable loss of performance.
To address this, the SYCL specification provides a native version of a subset
of the math functions (full list in Section 4.17.5, Math functions),
which trades precision for performance. They are defined within the native
namespace. For example, the native version of sycl::cos()
is
sycl::native::cos()
.
In general if precision is not an issue using the native variants may provide significant improvements, although do note that not all backends make use of the relaxed precision for all builtins.
Note
The
-ffast-math
compilation flag swaps standardsycl::
math functions into the correspondingsycl::native::
ones if they are available. If there is no native version for a given math function the-ffast-math
flag has no effect on it.
-ffast-math
is the default for the icpx
compiler. To disable
-ffast-math
for icpx
, use -fno-fast-math
.
Loop Unrolling
While the compiler will handle some loop unrolling automatically, it can sometimes be beneficial to help the compiler by manually tuning the unrolling of the compute intensive loops in the device code, for example by using the unrolling pragma as follows:
#pragma unroll <unroll factor>
for( ... ) {
...
}
We advise that due care is taken when applying manual unrolling; be aware that whilst manual unrolling a given loop may lead to greater performance on some Nvidia devices, other Nvidia devices may encounter drops in performance for the same code. We will continue to improve the compiler optimization heuristics with each new release.
Downcasting Indexes
In SYCL, nd_range
, range
and other index types use size_t
as a value
type. This is in contrast to backend APIs such as CUDA or HIP, where
threadIdx.{x|y|z}
is an int
. In order to save register space within
kernels it can be beneficial to downcast any index types from the size_t
s
returned from index calculation functions, to the type that the backend API
would use for indexing, which is typically an int
.
auto bigIdx = item.get_id(0); // size_t
int intIdx = static_cast<int>(item.get_id(0));
This is especially beneficial for kernels with high register pressure.
In the case of member functions such as get_linear_id
, and
get_global_linear_id
, the SYCL implementation will combine different
size_t
indexes in a manner such as:
size_t item<2>::get_linear_id() {
return get_id(0) * get_range(1) + get_id(1);
}
SYCL is specified to use size_t
s for such calculations so as to avoid the
overflow which may occur when using 32 bit types. Since arithmetic is being
done on size_t
types returned by get_range
and get_id
, casting the
result of this member function to an int
or some other type will not
eliminate the use of size_t
s within these functions’ implementations.
Therefore if the programmer desires to only use 32 bit types (or smaller) for
these calculations, it is necessary to omit calls that abstract away linear
index calculation. An implementation of an unsafe_get_linear_id
such as:
inline int unsafe_get_linear_id(item<2> it) {
return static_cast<int>(it.get_id(0)) * static_cast<int>(it.get_range(1))
+ static_cast<int>(it.get_id(1));
}
Would remove any 64 bit size types from index calculation. This gives the SYCL code for index calculation the same device code footprint as the native CUDA or HIP code.
The above example is considered unsafe as the programmer then takes on the responsibility of ensuring that integer overflow does not take place.
Alias Analysis
Alias Analysis can prove that two memory references do not alias each other. This may enable optimizations. By default, the compiler must assume that memory references do alias, if not proven otherwise by the alias analysis. It is also possible, however, to explicitly signal to the compiler that a memory reference inside the device code is not aliased. This can be achieved using respective keywords for the buffer/accessor and USM model.
For the former, one can add the no_alias
property from the oneapi
extension to an accessor:
q.submit([&](sycl::handler &cgh) {
sycl::accessor acc{..., sycl::ext::oneapi::accessor_property_list{sycl::ext::oneapi::no_alias}};
...
});
For the latter, the __restrict__ qualifier can be added to a pointer.
Note that __restrict__ is non-standard C++ and may not behave consistently across sycl implementations. For dpc++ only restrict-qualified device function (a function called from within a SYCL kernel) parameters will be taken into account.
For example:
void function(int *__restrict__ ptr) {
...
}
...
int *ptr = sycl::malloc_device<int>(..., q);
...
q.submit([&](sycl::handler &cgh) {
cgh.parallel_for(..., [=](...) {
function(ptr);
});
});
A more brute force approach is to add the [[intel::kernel_args_restrict]] attribute to a kernel. This signals the compiler to ignore all possible alias dependencies between each of the USM pointers, or buffer accessors if that model is used. inside the kernel.
Example (buffer/accessor model):
q.submit([&](handler& cgh) {
accessor in_accessor(in_buf, cgh, read_only);
accessor out_accessor(out_buf, cgh, write_only);
cgh.single_task<NoAliases>([=]() [[intel::kernel_args_restrict]] {
for (int i = 0; i < N; i++)
out_accessor[i] = in_accessor[i];
});
});
Using the texture cache
On CUDA platforms, data that is constant for at least the lifetime of a kernel can be cached in the texture cache.
This can be achieved by using the sycl::ext::oneapi::experimental::cuda::ldg
function. It takes a pointer to device memory and returns the value stored at
that address by loading it through the L1/tex cache. For example:
float some_value = ldg(&some_data_in_device_memory[some_index]);
Warning
It is important to note that if the compiler detects that data loaded using this function is written to within a kernel, the program will still compile but will not use the texture cache!
There are many factors impacting the performance benefit of using the texture
cache; as such it can be rather difficult to achieve the biggest possible
speed-up. Many use-cases in fact receive little to no benefit. However, as
performance degradation is both unlikely and small in magnitude when it does
happen, and using ldg
requires minimal code changes, it can be a great way
to quickly improve kernel performance.
Note that ldg
can be also
be portably used on all other platforms including HIP AMD. However, CUDA is
currently the only platform where ldg
usage will lead to special caching.
The HIP AMD backend always loads all register data into both the L1 and L2
caches whether or not ldg
is used.
A more detailed look into the texture cache can be found in
this
blog post, and the exact detail about the ldg
function can be found in the
corresponding
extension documentation.