2025.0.0
The Nvidia plugin is now available on Windows.
Improvements
Added sm90a architecture [f204869]
Added
__PTX_VERSION__
macro to provide PTX version [58f829a]Implemented
sycl_ext_codeplay_enqueue_native_command
extension [0f48227]Improved C-CXX standard library support in kernels [da379ec, 9942378]
Improved compilation diagnostic for incorrect Nvidia triples [4240ef0]
Improved error message when running out of registers [9f1cee57]
Improve SYCL-Graph fill node implementation [unified-runtime bb589ca]
Implement
handler::prefetch
andhandler::mem_advise
as empty nodes enforcing the node dependencies in SYCL-Graph [unified-runtime 3c12bbc]Remove some overheads from UR sync-points used to implement SYCL-Graph edges [unified-runtime 3c12bbc]
Use one platform for all Nvidia devices [unified-runtime f05c1c8]
Cubemap and image array support for Bindless Images [83bbea926ae7, 99635a0d214b]
Added texture fetch functionality [d13fdbe4ee02]
Added support for device to device copies for Bindless Images [unified-runtime f4898299]
DirectX 12 interop for Bindless Images [bd97f283c9f9, unified-runtime 487f4f8a]
Bug Fixes
Fix
nextafter(-0.0,+0.0)
[d6780ae7]Fix issues in non-uniform group shuffles [a0c3b325]
Fix performance issues when using
queue.fill()
[0ccb0b7]Fix
multi_ptr
relational operators fornullptr
[4f91bbb]Fix race condition in CUDA stream creation [unified-runtime cabf128]
2024.2.0
Improvements
Add support for sequentially consistent memory ordering (
sm_70+
) [c1e2957]
Bug Fixes
Fix fence implementation to match SYCL 2020 semantics [95e183e6]
Fix and improve local work size guessing [unified-runtime 43f0963]
2024.1.0
Improvements
Added support for
sycl_ext_oneapi_graph
[367b662a]Added support for
sycl_ext_oneapi_device_architecture
[1ad69e59]Added support for
ext_oneapi_queue_priority
[0c33fea5]Added support for normalized channel types [fd5014ad]
Improve relevance of returned error codes [67a24f7b, b7a43a42]
Bug Fixes
Fix reported maximum local memory size [d2719b5]
Fix
-fgpu-rdc
option [f7595ac]Fix missing
rintf
,nearbyint
[3c327c73, 0ef26d3e]Fix race condition in event profiling [e8ffd021]
Deprecation
Deprecate context interoperability, primary context should be used instead [e213fe2f]
2024.0.2
No changes
2024.0.1
Added support for the sycl_ext_oneapi_bindless_images extension.
2024.0
Improvements
SYCL Compiler
Improved the
work_group
size selection for theparallel_for
interface taking asycl::range
.
SYCL Library
Added support for the sycl_ext_oneapi_graph extension.
Added support for the sycl_ext_oneapi_non_uniform_groups extension.
Added support for the sycl_ext_oneapi_peer_access extension.
Introduced the max_registers_per_work_group device query.
Added a mechanism via
SYCL_PROGRAM_COMPILE_OPTIONS
such that themaxrregcount
ptxas compiler option can be passed to the cuda backend in the following way:SYCL_PROGRAM_COMPILE_OPTIONS="--maxrregcount=<value>"
. Note that this works for JIT compiled programs only.
Bug Fixes
Double precision floating point Group algorithms:
broadcast
,joint_exclusive_scan
,joint_inclusive_scan
,exclusive_scan_over_group
,inclusive_scan_over_group
have been fixed and no longer hang when compiled with the icpx compiler.Fixed a bug in
atomic_fence
where it was using the wrongmemory_scope
when passedsycl::memory_scope::device
as an argument.
2023.2.0
Improvements
SYCL Compiler
clang++/cuda performance improvement by increasing inline threshold multiplier in NVPTX backend [22d98280]
Define
__SYCL_CUDA_ARCH__
instead of__CUDA_ARCH__
for SYCL [8f5000c3]
SYCL Library
Introduced
sycl_ext_oneapi_cuda_tex_cache_read
to expose the__ldg*
clang builtins to sycl as a cuda only extension - Read-only Texture Cache [5360825e]Report
cl_khr_subgroups
as a subgroups supporting extension [8e6c092b]atomic_fence
device queries now return the minimum required capabilities rather than failing with an error [82ac98f8] may be dropped by NVIDIA [1e88df54]Support the query of theoretical peak memory bandwidth - Intel’s Extensions for Device Information [8ce0a6d5]
Add Support for device ID and UUID - Intel’s Extensions for Device Information [8213074d]
Support host-device
memcpy2D
[d0b25d4a]Add support for
sycl_ext_oneapi_memcpy2d
on CUDA backend - OneAPI memcpy2d [9008a5d2]
Bug Fixes
Replace error on invalid work group size to
PI_ERROR_INVALID_WORK_GROUP_SIZE
[2357af0a]Address Wrong results from
sycl::ctz
function [5a9f601e]Address the issue that can cause events not to be waited on as intended [1b225447]
2023.1.0
Improvements
SYCL Compiler
Allow FTZ, prec-sqrt to override no-ftz, no-prec-sqrt [8096a6fb]
Implement support for NVIDIA architectures (such as nvidia_gpu_sm_80) as argument to fsycl-targets [e5de913f]
SYCL Library
Implement matrix extension using new “unified” interface [166bbc36]
Support zero range kernel for cuda backends [a3958865]
Add missing macro to interop-backend-traits.cpp [a578c8141]
Allow varying program metadata in CUDA backend [25d05f3d]
Bug Fixes
ext_oneapi_cuda make_device no longer duplicates sycl::device [75302c53a]
Fix incorrectly constructed guards [ce7c594f]
Document
Demonstrate how to pass ptxas options [f48f96eb3f]
Add mention of cuda gpu arch for enabling cuda-arch specific features [4e5d276f]
2023.0.0
Initial release of oneAPI for NVIDIA® GPUs!
This release was created from the intel/llvm repository at commit 0f579ba.
New Features
Support for CUDA® backend
SYCL Compiler
Support for sycl::half type
Support for
bf16
builtins operating on storage typesSupport for the SYCL builtins from relational, geometric, common and math categories
Support for sub_group extension
Support for group algorithms
Support for
group_ballot
intrinsicSupport for atomics with scopes and memory orders
Support for multiple streams in each queue to improve concurrent execution
Support for
sycl::queue::mem_advise
Support for
--ffast-math
in CUDA libclcSupport for device side
assert
Support for float and double exchange and compare exchange atomic operations in CUDA libclc
Enabled CXX standard library functions
Native event for default-ctored sycl::event has to be in COMPLETE state
SYCL Library
Add
bfloat16
builtins forfma
,fmin
andfmax
Support for
sycl::aspect::fp16
Add
tanh
(for floats/halfs) andexp2
(for halfs) native definitionsSupport for
sycl::get_native(sycl::buffer)
Implemented
mem_advise
reset and managed concurrent memory checksSupport for element-wise operations on
joint_matrix
includingbfloat16
support, oneAPI matrix extensionSupport for Unified Shared Memory (USM)