This section covers troubleshooting tips and solutions to common issues. If the following doesn’t fix your problem, please submit a support request via Codeplay’s community support website. We cannot provide any guarantees of support, but we will try to help. Please ensure that you are using the most recent stable release of the software before submitting a support request.
Bugs, performance, and feature requests can be reported via the oneAPI DPC++ compiler open-source repository.
Missing Devices in sycl-ls
Output
If sycl-ls
does not list the expected devices within the system:
Check that there is a compatible version of the CUDA® or ROCm™ SDK installed on the system (for CUDA or HIP plugins respectively), as well as the compatible drivers.
Check that
nvidia-smi
orrocm-smi
can correctly identify the devices.Check that the plugins are correctly loaded. This can be done by setting the environment variable
SYCL_PI_TRACE
to1
and runningsycl-ls
again. For example:SYCL_PI_TRACE=1 sycl-ls
You should see output similar to the following:
SYCL_PI_TRACE[basic]: Plugin found and successfully loaded: libpi_opencl.so [ PluginVersion: 11.15.1 ] SYCL_PI_TRACE[basic]: Plugin found and successfully loaded: libpi_level_zero.so [ PluginVersion: 11.15.1 ] SYCL_PI_TRACE[basic]: Plugin found and successfully loaded: libpi_cuda.so [ PluginVersion: 11.15.1 ] [ext_oneapi_cuda:gpu:0] NVIDIA CUDA BACKEND, NVIDIA A100-PCIE-40GB 0.0 [CUDA 11.7]
If the plugin you’ve installed doesn’t show up in the
sycl-ls
output, you can run it again withSYCL_PI_TRACE
this time set to-1
to see more details of the error:SYCL_PI_TRACE=-1 sycl-ls
Within the output, which can be quite large, you may see errors like the following:
SYCL_PI_TRACE[-1]: dlopen(/opt/intel/oneapi/compiler/2024.1.0/linux/lib/libpi_hip.so) failed with <libamdhip64.so.4: cannot open shared object file: No such file or directory> SYCL_PI_TRACE[all]: Check if plugin is present. Failed to load plugin: libpi_hip.so
The CUDA plugin requires
libcuda.so
andlibcupti.so
from the CUDA SDK.The HIP plugin requires
libamdhip64.so
from ROCm.
Double-check your CUDA or ROCm installation and make sure that the environment is set up properly i.e.
LD_LIBRARY_PATH
points to the correct locations to find the above libraries.Check that there isn’t any device filtering environment variable set such as
ONEAPI_DEVICE_SELECTOR
(note thatsycl-ls
will warn if this one is set), orSYCL_DEVICE_ALLOWLIST
.Check permissions. On POSIX access to accelerator devices is typically gated on being a member of the proper groups. For example, on Ubuntu Linux GPU access may require membership of the
video
andrender
groups, but this can vary depending on your configuration.
Dealing with Invalid Binary Errors
Incorrect Platform
A common mistake is to execute a SYCL program using a platform for which
the SYCL program does not have a compatible binary. For example the SYCL
program may have been compiled for a SPIR-V backend but then executed on a
HIP device. In such a case the following error code,
PI_ERROR_INVALID_BINARY
, will be thrown. In this scenario, check
the following points:
Make sure your target platform is in
-fsycl-targets
so that the program will be compiled for the required platform(s).Make sure that the program is using a SYCL platform or device selector that is compatible with the platforms for which the executable was compiled. Try running with the environment variable
SYCL_PI_TRACE=1
to print which device is being selected at runtime.
Correct Platform, Incorrect Device
When running SYCL™ applications targeting CUDA or HIP, under certain
circumstances the application may fail and report an error about an
invalid binary. For example, for CUDA it may report
CUDA_ERROR_NO_BINARY_FOR_GPU
.
This means that the SYCL device selected was provided with a binary for the correct platform but an incorrect architecture. In that scenario, check the following points:
Make sure your target is in -fsycl-targets and that the correct architecture matching the available hardware is specified with the flags:
Flags for CUDA:
-Xsycl-target-backend=nvptx64-nvidia-cuda --cuda-gpu-arch=<arch>
Flags for HIP:
-Xsycl-target-backend=amdgcn-amd-amdhsa --offload-arch=<arch>
Ensure that the correct SYCL device (matching the architecture that the application was built for) is selected at run-time. The environment variable
SYCL_PI_TRACE=1
can be used to display more information on which device was selected, for example:SYCL_PI_TRACE[basic]: Plugin found and successfully loaded: libpi_opencl.so [ PluginVersion: 11.16.1 ] SYCL_PI_TRACE[basic]: Plugin found and successfully loaded: libpi_level_zero.so [ PluginVersion: 11.16.1 ] SYCL_PI_TRACE[basic]: Plugin found and successfully loaded: libpi_cuda.so [ PluginVersion: 11.16.1 ] SYCL_PI_TRACE[all]: Requested device_type: info::device_type::automatic SYCL_PI_TRACE[all]: Requested device_type: info::device_type::automatic SYCL_PI_TRACE[all]: Selected device: -> final score = 1500 SYCL_PI_TRACE[all]: platform: NVIDIA CUDA BACKEND SYCL_PI_TRACE[all]: device: NVIDIA GeForce GTX 1050 Ti
If an incorrect device is selected, the environment variable
ONEAPI_DEVICE_SELECTOR
may be used to help the SYCL device selector pick the correct one - see the Environment Variables section of the Intel® oneAPI DPC++/C++ Compiler documentation.
Unresolved extern function ‘…’ / Undefined external symbol ‘…’
This may be caused by a number of things.
There is currently no support for
std::complex
in DPC++. Please usesycl::complex
instead.
The
icpx
compiler driver uses-ffast-math
mode by default, which can currently lead to some issues resolving certain math functions such asldexp
orlogf
. This can be worked around by disabling-ffast-math
with the-fno-fast-math
flag.See Install oneAPI for NVIDIA GPUs for more information.
Compiler Error: “cannot find libdevice”
If the CUDA SDK is not installed in a standard location, clang++
may
fail to find it - leading to errors during compilation such as:
clang-17: error: cannot find libdevice for sm_50; provide path to different CUDA installation via '--cuda-path', or pass '-nocudalib' to build without linking with libdevice
To fix this issue, specify the path to your CUDA installation using the
--cuda-path
option.
Compiler Error: “needs target feature”
Some nvptx builtins that are used by the DPC++ runtime require a minimum
Compute Capability in order to compile. If you have not targeted a
sufficient Compute Capability for a builtin that you’re using in your
program (by using the compiler argument
-Xsycl-target-backend=nvptx64-nvidia-cuda --cuda-gpu-arch=sm_xx
), an
error with the following pattern will be reported:
error: '__builtin_name' needs target feature (sm_70|sm_72|..),...
In order to avoid such an error, ensure that you are compiling for a device
with a sufficient Compute Capability.
If you are still getting such an error despite passing a supported Compute
Capability to the compiler, this may be because you are passing the 32-bit
triple, nvptx-nvidia-cuda
to -fsycl-targets
. The
nvptx-nvidia-cuda
triple does not allow the compilation of target
feature builtins and is not officially supported by DPC++. The 64-bit
triple, nvptx64-nvidia-cuda
, is supported by all modern NVIDIA® devices,
so it is always recommended.
Compiler Warning: “CUDA version is newer than the latest supported version”
Depending on the CUDA version used with the release, the compiler may output the following warning:
clang++: warning: CUDA version is newer than the latest supported version 12.1 [-Wunknown-cuda-version]
In most cases this warning can safely be ignored. It simply means that DPC++ may not use some of the latest CUDA features, but it should still work perfectly fine in most scenarios.
Out of resources on kernel launch
Observing one of the following error codes may indicate that a launch did not occur because it did not have appropriate resources:
CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES
CUDA_ERROR_INVALID_VALUE
PI_ERROR_INVALID_WORK_GROUP_SIZE
PI_ERROR_INVALID_VALUE
- Possible reasons are:
The maximum number of work-items (threads in Cuda) for the device have been exceeded.
The maximum work-group size (thread-block in Cuda) for the device has been exceeded.
The kernel resources (i.e., registers or shared memory) exceed the device capabilities.
We can verify these possibilities by checking the device capabilities and resolve them configuring the kernel launch with those capability limitations in mind. The limitations per CUDA compute capability are summarised in a table in the CUDA documentation.
However, the maximum work-group size for kernel launch is not always the same number as the potential capability of the device, and this is where we need to understand the register usage of our kernel and take it into account. High register pressure with large work-groups can lead to an invalid kernel launch due to exceeding hardware limitations, such as available registers.
Exceeding work-group limitations
Submitting work with one of the work-group dimensions or the total work-group size (product of all dimensions)
exceeding the corresponding maximum supported by the device results in the PI_ERROR_INVALID_WORK_GROUP_SIZE
error. Similarly, submitting non-uniform work-groups, e.g. nd_range<1>{48,32}
, on a device which doesn’t support
them (all CUDA devices) will result in the same error code. In all these cases, the following message is emitted:
Non-uniform work-groups are not supported by the target device -54 (PI_ERROR_INVALID_WORK_GROUP_SIZE)
Currently, this happens even if the work-groups are uniform but exceed one of the other limits mentioned above. The message will be improved in future versions of the plugin.
Exceeding the limit on the number of work groups either in one dimension or the product of all dimensions results
in the CUDA_ERROR_INVALID_VALUE
error mapped to PI_ERROR_INVALID_VALUE
. For example, all CUDA devices allow
at most 65535 work groups in the y- and z-dimension. Submitting 65536 work groups in the y-dimension results in the
exception:
Number of work-groups exceed limit for dimension 1 : 65536 > 65535 -30 (PI_ERROR_INVALID_VALUE)
Exceeding the maximum amount of shared memory
Currently, submitting work exceeding the limit on the shared memory size result in the following error:
UR CUDA ERROR:
Value: 1
Name: CUDA_ERROR_INVALID_VALUE
Description: invalid argument
Function: urEnqueueKernelLaunch
Source Location: /tmp/tmp.7vgJ2wJCWQ/intel-llvm-mirror/sycl/plugins/unified_runtime/ur/adapters/cuda/enqueue.cpp:418
Native API failed. Native API returns: -30 (PI_ERROR_INVALID_VALUE) -30 (PI_ERROR_INVALID_VALUE)
The error handling will be improved in the future versions of the plugin to clearly report:
Excessive allocation of local memory on the device
Out of available registers:
The CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES
error can be a result of using too many registers per CUDA block. It is
mapped to the PI_ERROR_INVALID_WORK_GROUP_SIZE
error code from DPC++, associated with a detailed exception
message explaining the limits and showing the number of registers used:
Exceeded the number of registers available on the hardware.
The number registers per work-group cannot exceed 65536 for this kernel on this device.
The kernel uses 100 registers per work-item for a total of 1024 work-items per work-group.
-54 (PI_ERROR_INVALID_WORK_GROUP_SIZE)
Beyond the error message, a quick check on the number of registers allocated for the kernel by ptxas
can be performed by specifying the -Xcuda-ptxas --verbose
option when compiling. This will enable
verbose mode which prints code generation statistics, including register usage for the kernel(s) in the binary.
Example ptxas
verbose output:
ptxas info : Compiling entry function 'my_kernel' for 'sm_75'
ptxas info : Function properties for my_kernel
8192 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 100 registers, 256 bytes cmem[0], 1512 bytes cmem[2]
If the kernel has exceeded the number of registers available on the multiprocessor, reducing the workgroup size effectively reduces the number of threads executing in a CUDA block which can reduce register pressure. The effective maximum work-group size for a compiled kernel can be queried in SYCL in the following way:
auto b = sycl::get_kernel_bundle<MyKernel, sycl::bundle_state::executable>(q.get_context());
auto k = b.template get_kernel<MyKernel>();
auto maxWGSize{k.template get_info<sycl::info::kernel_device_specific::work_group_size>(q.get_device())};
std::cout << "MyKernel max WG size on this device: " << maxWGSize << std::endl;
However, if this is not the desired solution, we can also instruct the compiler to lower the register pressure and spill beyond a certain threshold, which can also result in a successful launch without having to size down the thread blocks. In DPC++, this can be achieved by:
1. Specifying the CUDA architecture or SM/compute capability of the target device, i.e.
-Xsycl-target-backend --cuda-gpu-arch=sm_86
for Nvidia GeForce RTX 3060/TI.2. Instructing the PTX backend that we want to limit the registers in the kernel. This is done with the
-Xcuda-ptxas --maxrregcount=<N>
option, added to the compile command.
Note
A downside to limiting register usage in the kernel via the -Xcuda-ptxas --maxrregcount
compiler option, is that the remaining registers may be spilled into DRAM, which may impact performance.
Sub-group size issues in codes ported across platforms/architectures
Consider code that uses the kernel attribute reqd_sub_group_size
to set a
specific sub-group size that is then ported to a different platform or executed
on a different architecture to the one it was originally written for. In such a
case if the requested sub-group size is not supported by the platform/architecture
then a runtime error will be thrown:
Sub-group size x is not supported on the device
On the CUDA platform only a single sub-group size is supported, hence only a warning is given:
CUDA requires sub_group size 32
and the runtime will use the sub-group size of 32 instead of the requested
sub-group size. The reqd_sub_group_size
kernel attribute is designed for
platforms/architectures that support multiple sub-group sizes. Note that some
SYCL code is not portable across different sub-group sizes. For example,
the result of the sub-group collective reduce_over_group
will depend on the
sub-group size. Users that want to write code that is portable across
platforms/architectures which use different sub-group sizes should either:
Write code in a portable way such that the result does not depend on sub-group size.
sub-group size sensitive parts of the code should have different versions for different platforms/architectures to take account of different sub-group sizes.