work_groupsize selection for the
parallel_forinterface taking a
Added support for the sycl_ext_oneapi_graph extension.
Added support for the sycl_ext_oneapi_non_uniform_groups extension.
Added support for the sycl_ext_oneapi_peer_access extension.
Introduced the max_registers_per_work_group device query.
Added a mechanism via
SYCL_PROGRAM_COMPILE_OPTIONSsuch that the
maxrregcountptxas compiler option can be passed to the cuda backend in the following way:
SYCL_PROGRAM_COMPILE_OPTIONS="--maxrregcount=<value>". Note that this works for JIT compiled programs only.
Double precision floating point Group algorithms:
inclusive_scan_over_grouphave been fixed and no longer hang when compiled with the icpx compiler.
Fixed a bug in
atomic_fencewhere it was using the wrong
sycl::memory_scope::deviceas an argument.
clang++/cuda performance improvement by increasing inline threshold multiplier in NVPTX backend [22d98280]
__CUDA_ARCH__for SYCL [8f5000c3]
sycl_ext_oneapi_cuda_tex_cache_readto expose the
__ldg*clang builtins to sycl as a cuda only extension - Read-only Texture Cache [5360825e]
cl_khr_subgroupsas a subgroups supporting extension [8e6c092b]
atomic_fencedevice queries now return the minimum required capabilities rather than failing with an error [82ac98f8] may be dropped by NVIDIA [1e88df54]
Support the query of theoretical peak memory bandwidth - Intel’s Extensions for Device Information [8ce0a6d5]
Add Support for device ID and UUID - Intel’s Extensions for Device Information [8213074d]
Add support for
sycl_ext_oneapi_memcpy2don CUDA backend - OneAPI memcpy2d [9008a5d2]
Replace error on invalid work group size to
Address Wrong results from
Address the issue that can cause events not to be waited on as intended [1b225447]
Allow FTZ, prec-sqrt to override no-ftz, no-prec-sqrt [8096a6fb]
Implement support for NVIDIA architectures (such as nvidia_gpu_sm_80) as argument to fsycl-targets [e5de913f]
Implement matrix extension using new “unified” interface [166bbc36]
Support zero range kernel for cuda backends [a3958865]
Add missing macro to interop-backend-traits.cpp [a578c8141]
Allow varying program metadata in CUDA backend [25d05f3d]
ext_oneapi_cuda make_device no longer duplicates sycl::device [75302c53a]
Fix incorrectly constructed guards [ce7c594f]
Demonstrate how to pass ptxas options [f48f96eb3f]
Add mention of cuda gpu arch for enabling cuda-arch specific features [4e5d276f]
Initial release of oneAPI for NVIDIA® GPUs!
This release was created from the intel/llvm repository at commit 0f579ba.
Support for CUDA® backend
Support for sycl::half type
bf16builtins operating on storage types
Support for the SYCL builtins from relational, geometric, common and math categories
Support for sub_group extension
Support for group algorithms
Support for atomics with scopes and memory orders
Support for multiple streams in each queue to improve concurrent execution
--ffast-mathin CUDA libclc
Support for device side
Support for float and double exchange and compare exchange atomic operations in CUDA libclc
Enabled CXX standard library functions
Native event for default-ctored sycl::event has to be in COMPLETE state
tanh(for floats/halfs) and
exp2(for halfs) native definitions
mem_advisereset and managed concurrent memory checks
Support for element-wise operations on
bfloat16support, oneAPI matrix extension
Support for Unified Shared Memory (USM)