OpenCL extensions are split into three categories:
- KHR
Extension ratified by Khronos
- EXT
Extension with collaboration from multiple vendors, but not ratified by Khronos.
- Vendor
Extension defined by a single vendor, but may be implemented by other vendors, e.g oneAPI Construction Kit implements vendor extension
cl_intel_unified_shared_memory
andcl_intel_required_subgroup_size
.
oneAPI Construction Kit implements several Codeplay vendor extensions,
specified under the extension
directory. When adding a new vendor
extension the official OpenCL-Docs cl_extension_template should be used
for reference, but written in RST rather than ASCIIdoc.
When defining new enum valus for an extension, if those enums will be used in
existing entry-points, then they need to be unique to avoid conflicts with enums
defined by other vendors. To enable this vendors reserve a range of values in
16-bit blocks in cl.xml
Codeplay’s unique range is between 0x4260
& 0x426F
inclusively. If our
needs exceed this range then another block can be reserved, although it may not
be contiguous.
Important
The next available value in our range is 0x4263. Once this value is claimed update this text to the new value which is free to use next.
OpenCL C 1.2 - khr_opencl_c_1_2
The [OpenCL 1.2 extension specification][cl-12-ext] specifies the following set of OpenCL C extensions. These extensions have been grouped together into the khr_opencl_c_1_2 extension object.
cl_khr_global_int32_base_atomics
cl_khr_global_int32_extended_atomics
cl_khr_local_int32_base_atomics
cl_khr_local_int32_extended_atomics
cl_khr_byte_addressable_store
cl_khr_fp64
The include/extension/khr_opencl_c_1_2.h header file and source/extension/khr_opencl_c_1_2.cpp source file define how the extensions listed above are reported to the OpenCL application by extending the clGetDeviceInfo entry point. This extension does not provided any additional OpenCL API entry points.
Installable Client Driver (ICD) - cl_khr_icd
The ICD is provided by Khronos to allow multiple hardware vendor’s drivers to coexist on the same system and avoid an application suffering from linker errors when attempting to link against those drivers.
The OpenCL entry point definitions do not actually contain the implementation,
where the work is done, instead they call into the cl
namespace. For each
OpenCL entry point, such as clGetPlatformIDs. There is a matching function
in the cl
namespace that is invoked, in this example
cl::GetPlatformIDs
. This was done to provide a clean boundary between the
ICD and the implementation of the OpenCL entry points.
CL_API_ENTRY cl_int CL_API_CALL clGetPlatformIDs(const cl_uint num_entries,
cl_platform_id *platforms,
cl_uint *const num_platforms) {
return cl::GetPlatformIDs(num_entries, platforms, num_platforms);
}
Any object created by the driver, such as cl_command_queue
, must contain a
pointer to the ICD dispatch table. This is the mechanism used by the ICD to
determine which driver an API object works with; it functions in much the same
way as a C++ virtual function table but because of this similarity the OpenCL
API objects must not themselves contain a virtual function table. The ICD
specifies that any object created by the driver must reserve the first
sizeof(void*)
bytes in its structure for the ICD dispatch table.
Kernel Debug - cl_codeplay_kernel_debug
The Kernel Debug - cl_codeplay_kernel_debug extension grants developers the ability to specify build options which enable attaching a debugger to a kernel being executed on a device.
The
-g
flag, providing emission of debug symbols in the compiled kernel.The
-S <path>
flag, setting the source code location in debug info to the specified path. Our runtime then creates this file if it does not already exist on disk, allowing the debugger to display kernel source code without manual configuration.
Extra build options - cl_codeplay_extra_build_options
The Extra Build Options - cl_codeplay_extra_build_options extension allows the user to specify additional flags handled by the clBuildProgram and clCompileProgram entry-points.
The
-cl-llvm-stats
flag allowing llvm to print the statistics from all the passes that have any.The
-cl-precache-local-sizes=<sizes>
build option allows for the pre-caching of kernel compilation for the specified local work group sizes.
Kernel Exec Info - cl_codeplay_kernel_exec_info
The Kernel Exec Info - cl_codeplay_kernel_exec_info extension adds support for allowing additional information other than argument values to be passed to a kernel. This base extension doesn’t provide support for any particular parameter types but is intended to be built upon by future extensions that require this support. For example Intel USM extension USM - cl_intel_unified_shared_memory requires support for clSetKernelExecInfo which is part of the 2.0 API. Instead, USM will be able to use this extension to support 1.2.
cl_int clSetKernelExecInfoCODEPLAY(cl_kernel kernel,
cl_kernel_exec_info param_name,
size_t param_value_size,
const void* param_value)
Performance Counter - cl_codeplay_performance_counters
The Performance Counters - cl_codeplay_performance_counters extension allows the application to enable and get results from a set of supported, likely hardware, performance counters.
The flow of execution is as follows:
Query the devices’ supported performance counters using clGetDeviceInfo with a
param_name
ofCL_DEVICE_PERFORMANCE_COUNTERS_CODEPLAY
.Create a
cl_command_queue
using clCreateCommandQueueWithProperties with aproperties
key ofCL_QUEUE_PERFORMANCE_COUNTERS_CODEPLAY
and a value of typecl_performance_counter_config_codeplay*
containing a list ofcl_performance_counter_desc_codeplay
structures specifying which performance counters should be enabled usinguuid
s attained in step 1.Enqueuing a workload ensuring to provide a
cl_event
.Waiting for the workload to complete execution.
Querying the performance counter results using clGetEventProfilingInfo with a
param_name
ofCL_PROFILING_COMMAND_PERFORMANCE_COUNTERS_CODEPLAY
.Read the performance counter results using the
storage
member ofcl_performance_counter_codeplay
to select the correct anonymous union member of eachcl_performance_counter_result_codeplay
structure.
Soft Math - cl_codeplay_soft_math
The Soft Math - cl_codeplay_soft_math extension allows a developer to force
the math builtins used to be sourced from Abacus. An additional build option is
supported for clCompileProgram and clBuildProgram to specify this
"-codeplay-soft-math"
.
When a customer compiler backend consumes an executable, it is free to replace
builtin functions with the optimized equivalents for the target platform. For
instance, even though Abacus provides a conformant implementation of the count
leading zeros clz
builtin function, many mux targets will have a hardware
instruction that allows this function to be implemented more efficiently. With
the "-codeplay-soft-math"
option specified, the mux backend will not use
any hardware optimized builtins, and instead rely on the Abacus functionality.
Being able to specify that the mux target cannot use a more efficient implementation of Abacus builtin functionality allows testing and performance metrics to be performed much more easily.
Whole Function Vectorization - cl_codeplay_wfv
The Whole Function Vectorization - cl_codeplay_wfv extension provides a mechanism to vectorize an OpenCL across the primary work-item dimension, using our whole function vectorization library VECZ.
An additional build option -cl-wfv={always|auto|never}
is supported for
clCompileProgram and clBuildProgram to enable/disable whole function
vectorization. These choices are described in detail in the extension
specification.
A new entry point clGetKernelWFVInfoCODEPLAY
is added to allow whole
function vectorization information to be queried for a specified kernel, given
a device and local work size.
clGetKernelWFVInfoCODEPLAY
supports two parameter queries:
CL_KERNEL_WFV_STATUS_CODEPLAY
queries the status of whole function vectorization.CL_KERNEL_WFV_WIDTHS_CODEPLAY
queries the list of widths for each work-item dimension.
SPIR-V USM Generic Storage Class - SPV_codeplay_usm_generic_storage_class
To support USM functionality in SYCL, ComputeCpp has found it necessary to
generate SPIR-V without address space information in it. To enable this the
USM Generic Storage Class - spv_codeplay_usm_generic_storage_class
extension was created. With this extension enabled a SPIR-V module can pass the
Generic
storage class for all of its pointer type declarations to indicate
that no address space information is included in the declaration. Our SPIR-V
translator interprets all such pointer type declarations as having address
space 0. This address space was chosen primarily because this is the address
space ComputeCpp uses internally for this, and secondarily because it means
that function scope variables can remain as alloca
s, which is helpful for
later optimizations.
Note that this overrides the normal semantics of storage class Generic
. We
wrote the extension this way rather than adding a new storage class so we could
keep using existing SPIR-V tools without needing to fork them to add support.
Command-Buffers (Provisional) - cl_khr_command_buffer
oneAPI Construction Kit implements version 0.9.1 of the provisional cl_khr_command_buffer extension.
The simultaneous-use optional capability is supported, which allows the same command-buffer to be repeatedly enqueued without blocking in user code.
cl_khr_command_buffer is intended as a base extension that other future extensions will layer upon. These will provide additional features such as recording a command-buffer across multiple command-queues, potentially associated with different devices, as well as the ability to modify a command-buffer recording in between replays.
Our implementation currently has the following limitations which need resolved before the extension can be turned on by default in the oneAPI Construction Kit:
Event profiling is not implemented (see CA-3322).
Command Buffers: Mutable Dispatch - cl_khr_command_buffer_mutable_dispatch
oneAPI Construction Kit implements version 0.9.0 of the provisional cl_khr_command_buffer_mutable_dispatch extension.
cl_khr_command_buffer_mutable_dispatch builds upon cl_khr_command_buffer to allow users to modify kernel execution commands between enqueues of a command-buffer.
Extended Async Copies - cl_khr_extended_async_copies
oneAPI Construction Kit implements the cl_khr_extended_async_copies extension.
This has a default implementation which does simple copies of memory. This can be replaced by a customer implementation where it can be accelerated.
Required subgroup sizes for kernels - cl_intel_required_subgroup_size
oneAPI Construction Kit implements the cl_intel_required_subgroup_size extension.
This has a default implementation which does not report any available subgroup
sizes for any device. The compiler will report an error for any kernel given
the intel_reqd_sub_group_size
attribute with a size that is not reported by
the device. There is currently no handling of the attribute if a target does
report a set of sub-group sizes, as no in-tree target does so.
Furthermore, all kernels report the same non-zero value for
CL_KERNEL_SPILL_MEM_SIZE_INTEL
.