2025.0.0
The Nvidia plugin is now available on Windows.
Improvements
Added sm90a architecture [f204869]
Added
__PTX_VERSION__macro to provide PTX version [58f829a]Implemented
sycl_ext_codeplay_enqueue_native_commandextension [0f48227]Improved C-CXX standard library support in kernels [da379ec, 9942378]
Improved compilation diagnostic for incorrect Nvidia triples [4240ef0]
Improved error message when running out of registers [9f1cee57]
Improve SYCL-Graph fill node implementation [unified-runtime bb589ca]
Implement
handler::prefetchandhandler::mem_adviseas empty nodes enforcing the node dependencies in SYCL-Graph [unified-runtime 3c12bbc]Remove some overheads from UR sync-points used to implement SYCL-Graph edges [unified-runtime 3c12bbc]
Use one platform for all Nvidia devices [unified-runtime f05c1c8]
Cubemap and image array support for Bindless Images [83bbea926ae7, 99635a0d214b]
Added texture fetch functionality [d13fdbe4ee02]
Added support for device to device copies for Bindless Images [unified-runtime f4898299]
DirectX 12 interop for Bindless Images [bd97f283c9f9, unified-runtime 487f4f8a]
Bug Fixes
Fix
nextafter(-0.0,+0.0)[d6780ae7]Fix issues in non-uniform group shuffles [a0c3b325]
Fix performance issues when using
queue.fill()[0ccb0b7]Fix
multi_ptrrelational operators fornullptr[4f91bbb]Fix race condition in CUDA stream creation [unified-runtime cabf128]
2024.2.0
Improvements
Add support for sequentially consistent memory ordering (
sm_70+) [c1e2957]
Bug Fixes
Fix fence implementation to match SYCL 2020 semantics [95e183e6]
Fix and improve local work size guessing [unified-runtime 43f0963]
2024.1.0
Improvements
Added support for
sycl_ext_oneapi_graph[367b662a]Added support for
sycl_ext_oneapi_device_architecture[1ad69e59]Added support for
ext_oneapi_queue_priority[0c33fea5]Added support for normalized channel types [fd5014ad]
Improve relevance of returned error codes [67a24f7b, b7a43a42]
Bug Fixes
Fix reported maximum local memory size [d2719b5]
Fix
-fgpu-rdcoption [f7595ac]Fix missing
rintf,nearbyint[3c327c73, 0ef26d3e]Fix race condition in event profiling [e8ffd021]
Deprecation
Deprecate context interoperability, primary context should be used instead [e213fe2f]
2024.0.2
No changes
2024.0.1
Added support for the sycl_ext_oneapi_bindless_images extension.
2024.0
Improvements
SYCL Compiler
Improved the
work_groupsize selection for theparallel_forinterface taking asycl::range.
SYCL Library
Added support for the sycl_ext_oneapi_graph extension.
Added support for the sycl_ext_oneapi_non_uniform_groups extension.
Added support for the sycl_ext_oneapi_peer_access extension.
Introduced the max_registers_per_work_group device query.
Added a mechanism via
SYCL_PROGRAM_COMPILE_OPTIONSsuch that themaxrregcountptxas compiler option can be passed to the cuda backend in the following way:SYCL_PROGRAM_COMPILE_OPTIONS="--maxrregcount=<value>". Note that this works for JIT compiled programs only.
Bug Fixes
Double precision floating point Group algorithms:
broadcast,joint_exclusive_scan,joint_inclusive_scan,exclusive_scan_over_group,inclusive_scan_over_grouphave been fixed and no longer hang when compiled with the icpx compiler.Fixed a bug in
atomic_fencewhere it was using the wrongmemory_scopewhen passedsycl::memory_scope::deviceas an argument.
2023.2.0
Improvements
SYCL Compiler
clang++/cuda performance improvement by increasing inline threshold multiplier in NVPTX backend [22d98280]
Define
__SYCL_CUDA_ARCH__instead of__CUDA_ARCH__for SYCL [8f5000c3]
SYCL Library
Introduced
sycl_ext_oneapi_cuda_tex_cache_readto expose the__ldg*clang builtins to sycl as a cuda only extension - Read-only Texture Cache [5360825e]Report
cl_khr_subgroupsas a subgroups supporting extension [8e6c092b]atomic_fencedevice queries now return the minimum required capabilities rather than failing with an error [82ac98f8] may be dropped by NVIDIA [1e88df54]Support the query of theoretical peak memory bandwidth - Intel’s Extensions for Device Information [8ce0a6d5]
Add Support for device ID and UUID - Intel’s Extensions for Device Information [8213074d]
Support host-device
memcpy2D[d0b25d4a]Add support for
sycl_ext_oneapi_memcpy2don CUDA backend - OneAPI memcpy2d [9008a5d2]
Bug Fixes
Replace error on invalid work group size to
PI_ERROR_INVALID_WORK_GROUP_SIZE[2357af0a]Address Wrong results from
sycl::ctzfunction [5a9f601e]Address the issue that can cause events not to be waited on as intended [1b225447]
2023.1.0
Improvements
SYCL Compiler
Allow FTZ, prec-sqrt to override no-ftz, no-prec-sqrt [8096a6fb]
Implement support for NVIDIA architectures (such as nvidia_gpu_sm_80) as argument to fsycl-targets [e5de913f]
SYCL Library
Implement matrix extension using new “unified” interface [166bbc36]
Support zero range kernel for cuda backends [a3958865]
Add missing macro to interop-backend-traits.cpp [a578c8141]
Allow varying program metadata in CUDA backend [25d05f3d]
Bug Fixes
ext_oneapi_cuda make_device no longer duplicates sycl::device [75302c53a]
Fix incorrectly constructed guards [ce7c594f]
Document
Demonstrate how to pass ptxas options [f48f96eb3f]
Add mention of cuda gpu arch for enabling cuda-arch specific features [4e5d276f]
2023.0.0
Initial release of oneAPI for NVIDIA® GPUs!
This release was created from the intel/llvm repository at commit 0f579ba.
New Features
Support for CUDA® backend
SYCL Compiler
Support for sycl::half type
Support for
bf16builtins operating on storage typesSupport for the SYCL builtins from relational, geometric, common and math categories
Support for sub_group extension
Support for group algorithms
Support for
group_ballotintrinsicSupport for atomics with scopes and memory orders
Support for multiple streams in each queue to improve concurrent execution
Support for
sycl::queue::mem_adviseSupport for
--ffast-mathin CUDA libclcSupport for device side
assertSupport for float and double exchange and compare exchange atomic operations in CUDA libclc
Enabled CXX standard library functions
Native event for default-ctored sycl::event has to be in COMPLETE state
SYCL Library
Add
bfloat16builtins forfma,fminandfmaxSupport for
sycl::aspect::fp16Add
tanh(for floats/halfs) andexp2(for halfs) native definitionsSupport for
sycl::get_native(sycl::buffer)Implemented
mem_advisereset and managed concurrent memory checksSupport for element-wise operations on
joint_matrixincludingbfloat16support, oneAPI matrix extensionSupport for Unified Shared Memory (USM)