info

Please note that you are viewing a guide targeting an older version of oneAPI for NVIDIA® GPUs. This guide was designed for version 2025.0.0.

The Nvidia plugin is now available on Windows.

Improvements

link

Added sm90a architecture [f204869]
Added __PTX_VERSION__ macro to provide PTX version [58f829a]
Implemented sycl_ext_codeplay_enqueue_native_command extension [0f48227]
Improved C-CXX standard library support in kernels [da379ec, 9942378]
Improved compilation diagnostic for incorrect Nvidia triples [4240ef0]
Improved error message when running out of registers [9f1cee57]
Improve SYCL-Graph fill node implementation [unified-runtime bb589ca]
Implement handler::prefetch and handler::mem_advise as empty nodes enforcing the node dependencies in SYCL-Graph [unified-runtime 3c12bbc]
Remove some overheads from UR sync-points used to implement SYCL-Graph edges [unified-runtime 3c12bbc]
Use one platform for all Nvidia devices [unified-runtime f05c1c8]
Cubemap and image array support for Bindless Images [83bbea926ae7, 99635a0d214b]
Added texture fetch functionality [d13fdbe4ee02]
Added support for device to device copies for Bindless Images [unified-runtime f4898299]
DirectX 12 interop for Bindless Images [bd97f283c9f9, unified-runtime 487f4f8a]

Bug Fixes

link

Fix nextafter(-0.0,+0.0) [d6780ae7]
Fix issues in non-uniform group shuffles [a0c3b325]
Fix performance issues when using queue.fill() [0ccb0b7]
Fix multi_ptr relational operators for nullptr [4f91bbb]
Fix race condition in CUDA stream creation [unified-runtime cabf128]

2024.2.0

link

Improvements

link

Add support for sequentially consistent memory ordering (sm_70+) [c1e2957]

Bug Fixes

link

Fix fence implementation to match SYCL 2020 semantics [95e183e6]
Fix and improve local work size guessing [unified-runtime 43f0963]

2024.1.0

link

Improvements

link

Added support for sycl_ext_oneapi_graph [367b662a]
Added support for sycl_ext_oneapi_device_architecture [1ad69e59]
Added support for ext_oneapi_queue_priority [0c33fea5]
Added support for normalized channel types [fd5014ad]
Improve relevance of returned error codes [67a24f7b, b7a43a42]

Bug Fixes

link

Fix reported maximum local memory size [d2719b5]
Fix -fgpu-rdc option [f7595ac]
Fix missing rintf, nearbyint [3c327c73, 0ef26d3e]
Fix race condition in event profiling [e8ffd021]

Deprecation

link

Deprecate context interoperability, primary context should be used instead [e213fe2f]

2024.0.2

link

No changes

2024.0.1

link

Added support for the sycl_ext_oneapi_bindless_images extension.

2024.0

link

Improvements

link

SYCL Compiler

link

Improved the work_group size selection for the parallel_for interface taking a sycl::range.

SYCL Library

link

Added support for the sycl_ext_oneapi_graph extension.
Added support for the sycl_ext_oneapi_non_uniform_groups extension.
Added support for the sycl_ext_oneapi_peer_access extension.
Introduced the max_registers_per_work_group device query.
Added a mechanism via SYCL_PROGRAM_COMPILE_OPTIONS such that the maxrregcount ptxas compiler option can be passed to the cuda backend in the following way: SYCL_PROGRAM_COMPILE_OPTIONS="--maxrregcount=<value>". Note that this works for JIT compiled programs only.

Bug Fixes

link

Double precision floating point Group algorithms: broadcast, joint_exclusive_scan, joint_inclusive_scan, exclusive_scan_over_group, inclusive_scan_over_group have been fixed and no longer hang when compiled with the icpx compiler.
Fixed a bug in atomic_fence where it was using the wrong memory_scope when passed sycl::memory_scope::device as an argument.

2023.2.0

link

Improvements

link

SYCL Compiler

link

clang++/cuda performance improvement by increasing inline threshold multiplier in NVPTX backend [22d98280]
Define __SYCL_CUDA_ARCH__ instead of __CUDA_ARCH__ for SYCL [8f5000c3]

SYCL Library

link

Introduced sycl_ext_oneapi_cuda_tex_cache_read to expose the __ldg* clang builtins to sycl as a cuda only extension - Read-only Texture Cache [5360825e]
Report cl_khr_subgroups as a subgroups supporting extension [8e6c092b]
atomic_fence device queries now return the minimum required capabilities rather than failing with an error [82ac98f8] may be dropped by NVIDIA [1e88df54]
Support the query of theoretical peak memory bandwidth - Intel’s Extensions for Device Information [8ce0a6d5]
Add Support for device ID and UUID - Intel’s Extensions for Device Information [8213074d]
Support host-device memcpy2D [d0b25d4a]
Add support for sycl_ext_oneapi_memcpy2d on CUDA backend - OneAPI memcpy2d [9008a5d2]

Bug Fixes

link

Replace error on invalid work group size to PI_ERROR_INVALID_WORK_GROUP_SIZE [2357af0a]
Address Wrong results from sycl::ctz function [5a9f601e]
Address the issue that can cause events not to be waited on as intended [1b225447]

2023.1.0

link

Improvements

link

SYCL Compiler

link

Allow FTZ, prec-sqrt to override no-ftz, no-prec-sqrt [8096a6fb]
Implement support for NVIDIA architectures (such as nvidia_gpu_sm_80) as argument to fsycl-targets [e5de913f]

SYCL Library

link

Implement matrix extension using new “unified” interface [166bbc36]
Support zero range kernel for cuda backends [a3958865]
Add missing macro to interop-backend-traits.cpp [a578c8141]
Allow varying program metadata in CUDA backend [25d05f3d]

Bug Fixes

link

ext_oneapi_cuda make_device no longer duplicates sycl::device [75302c53a]
Fix incorrectly constructed guards [ce7c594f]

Document

link

Demonstrate how to pass ptxas options [f48f96eb3f]
Add mention of cuda gpu arch for enabling cuda-arch specific features [4e5d276f]

2023.0.0

link

Initial release of oneAPI for NVIDIA® GPUs!

This release was created from the intel/llvm repository at commit 0f579ba.

New Features

link

Support for CUDA® backend

SYCL Compiler

link

Support for sycl::half type
Support for bf16 builtins operating on storage types
Support for the SYCL builtins from relational, geometric, common and math categories
Support for sub_group extension
Support for group algorithms
Support for group_ballot intrinsic
Support for atomics with scopes and memory orders
Support for multiple streams in each queue to improve concurrent execution
Support for sycl::queue::mem_advise
Support for --ffast-math in CUDA libclc
Support for device side assert
Support for float and double exchange and compare exchange atomic operations in CUDA libclc
Enabled CXX standard library functions
Native event for default-ctored sycl::event has to be in COMPLETE state

SYCL Library

link

Add bfloat16 builtins for fma, fmin and fmax
Support for sycl::aspect::fp16
Add tanh (for floats/halfs) and exp2 (for halfs) native definitions
Support for sycl::get_native(sycl::buffer)
Implemented mem_advise reset and managed concurrent memory checks
Support for element-wise operations on joint_matrix including bfloat16 support, oneAPI matrix extension
Support for Unified Shared Memory (USM)

Rate this Guide

menu_bookGuides

2025.0.0

Improvements

Bug Fixes

2024.2.0

Improvements

Bug Fixes

2024.1.0

Improvements

Bug Fixes

Deprecation

2024.0.2

2024.0.1

2024.0

Improvements

SYCL Compiler

SYCL Library

Bug Fixes

2023.2.0

Improvements

SYCL Compiler

SYCL Library

Bug Fixes

2023.1.0

Improvements

SYCL Compiler

SYCL Library

Bug Fixes

Document

2023.0.0

New Features

SYCL Compiler

SYCL Library

Features

Troubleshooting

assignmentJump to Section

Select a Product

oneAPI

Dark Mode

Light Mode

Codeplay.com

SYCL.tech

Codeplay Developer

Codeplay Open Source