- OpenCL CMake
- How OpenCL Concepts Are Mapped Onto ComputeMux
- OpenCL Extensions
- OpenCL C 1.2 -
khr_opencl_c_1_2
- Installable Client Driver (ICD) -
cl_khr_icd
- Kernel Debug -
cl_codeplay_kernel_debug
- Extra build options -
cl_codeplay_extra_build_options
- Kernel Exec Info -
cl_codeplay_kernel_exec_info
- Performance Counter -
cl_codeplay_performance_counters
- Soft Math -
cl_codeplay_soft_math
- Whole Function Vectorization -
cl_codeplay_wfv
- USM -
cl_intel_unified_shared_memory
- SPIR-V USM Generic Storage Class -
SPV_codeplay_usm_generic_storage_class
- Command-Buffers (Provisional) -
cl_khr_command_buffer
- Command Buffers: Mutable Dispatch -
cl_khr_command_buffer_mutable_dispatch
- Extended Async Copies -
cl_khr_extended_async_copies
- Required subgroup sizes for kernels -
cl_intel_required_subgroup_size
- OpenCL C 1.2 -
- OpenCL External Extensions
- OpenCL ICD Loader
- OpenCL Intercept Layer
- Tools
- Tests
The oneAPI Construction Kit implementation of the OpenCL spec provides
definitions for its entry points which must not be changed. In addition the
OpenCL also defines the OpenCL object types, such as
_cl_platform_id
, which hold the state required to implement those objects.
Each header and source file is named after the OpenCL object it implements. For
example, the _cl_context
object can be found in include/cl/context.h
and
source/context.cpp
. This pattern is used to group associated functionality in
the same place. All entry points that take a cl_context
as the first parameter
or return a cl_context
, such as clCreateContext
, can be found in these
source files. This pattern is repeated for all of the OpenCL API objects.
There are two special cases of note which relate to the
cl_mem
object. To Separate the implementation of buffer and image objects two new types inheriting From_cl_mem
have been added;_cl_mem_buffer
representingcl_mem
objects Created withclCreateBuffer
; and_cl_mem_image
forcl_mem
’s representing Images. These definitions and their associated API entry points can be found in The buffer and image header and source files.
All objects, in OpenCL these specified to be reference counted. The
implementation of reference counting is shared between all API objects and can
be found in the include/cl/base.h
header file. Every API object inherits from
the cl::base<T>
class template, which makes use of the Curiously Recurring
Template Pattern (CRTP) to avoid the introduction of a virtual function table.
This is an important point because the OpenCL ICD requires the first
sizeof(void*)
bytes of each API object to contain the ICD dispatch table, not
a C++ virtual function table.
Supported Features and Implementation Details
Documentation on the OpenCL API supported by the oneAPI Construction Kit and implementation details for effective use. The API includes all standard, non-optional OpenCL 1.2 APIs and language features. oneAPI Construction Kit is also compatible with some deprecated OpenCL version 1.0 and 1.1 APIs, as well as those APIs defined in implemented extensions.
Data Types
OpenCL headers provide datatypes which are guaranteed to be a consistent size. This is important since the size of C/C++ datatypes is implementation defined, which leads to non-portable code and discrepancies between host and program data.
Note that the use of half precision floating point scalar and vector types is
optional, enabled using the cl_khr_fp16
extension.
oneAPI Construction Kit’s doesn’t currently support this as half is not
implemented in our maths library. However the oneAPI Construction Kit CPU target does support
cl_khr_fp64
, which is the double precision floating point
extension. Use clGetDeviceInfo
for information about what
floating point extensions are supported for your target device.
Scalar Data Types
API type |
Size |
---|---|
|
8-bit signed integer |
|
8-bit unsigned integer |
|
16-bit signed integer |
|
16-bit unsigned integer |
|
32-bit signed integer |
|
32-bit unsigned integer |
|
64-bit signed integer |
|
64-bit unsigned integer |
|
16-bit IEEE 754 floating point number |
|
32-bit IEEE 754 floating point number |
|
64-bit IEEE 754 floating point number |
cl_bool
is also available but unlike the other cl_
types is not guaranteed
to be the same size as the bool
in kernels.
Vector Data Types
API type |
Size |
---|---|
|
vector of n 8-bit signed integers |
|
vector of n 8-bit unsigned integers |
|
vector of n 16-bit signed integers |
|
vector of n 16-bit unsigned integers |
|
vector of n 32-bit signed integers |
|
vector of n 32-bit unsigned integers |
|
vector of n 64-bit signed integers |
|
vector of n 64-bit unsigned integers |
|
vector of n 16-bit IEEE 754 floating point numbers |
|
vector of n 32-bit IEEE 754 floating point numbers |
|
vector of n 64-bit IEEE 754 floating point numbers |
Built-in vector data types are supported by the oneAPI Construction Kit even if the underlying compute device does not support any or all of the vector data types. Vector widths defined by the standard are 2, 3, 4, 8, and 16.
Platform Info
The OpenCL platform layer implements platform-specific features that allow applications to find OpenCL devices, device configuration information, and to create OpenCL contexts using one or more devices.
Use clGetPlatformInfo to query the platform for information
such as available extensions, and the name & version of the implementation.
oneAPI Construction Kit typically will have platform name ComputeAorta
and
vendor name Codeplay Software Ltd.
The version string will also contain both
the version of the oneAPI Construction Kit and the LLVM version built against.
Device Info
A device is a collection of compute units typically correspond to a GPU, a multi-core CPU, and other processors such as DSPs.
To find the available devices on a platform use
clGetDeviceIDs. oneAPI Construction Kit should contain a host CPU device
of type CL_DEVICE_TYPE_CPU
, as well as optionally other accelerators. DSPs
fall under devices type CL_DEVICE_TYPE_ACCELERATOR
.
All the information about a device can be queried with
clGetDeviceInfo. Including device specific extensions and
information regarding memory size and work group limits. The CL_DEVICE_NAME
of
oneAPI Construction Kit’s CPU device target will be ComputeAorta ARCH
, where
ARCH
is replaced with the platform architecture, e.g., x86_64
.
Profiles
OpenCL devices can report supporting "FULL_PROFILE"
or "EMBEDDED_PROFILE"
by
passing CL_DEVICE_PROFILE
as the param_name
to
clGetDeviceInfo. oneAPI Construction Kit detects which profile
a device supports using the following table, as specified in [OpenCL 1.2][#opencl-1.2].
If any of these limits fall below the "FULL_PROFILE"
value then the device
will report support for "EMBEDDED_PROFILE"
.
OpenCL Device Property |
FULL |
EMBEDDED |
---|---|---|
|
128 MB |
1 KB |
|
1024 |
256 |
|
64 KB |
1 KB |
|
8 |
4 |
|
32 KB |
1 KB |
|
1 MB |
1 KB |
|
128 |
8 |
|
8 |
1 |
|
16 |
8 |
|
8192 |
2048 |
|
8192 |
2048 |
|
2048 |
0 |
|
2048 |
0 |
|
2048 |
0 |
|
65536 |
2048 |
|
2048 |
256 |
As cl_device_id
’s are created from a mux_device_t
, which expose a different
set of properties, an implementation of [Mux][../modules/mux/spec] can
control which OpenCL profile is reported using the following property mappings.
OpenCL Device Property |
Mux Device Property |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Program Compilation
An OpenCL program consists of a set of kernels that are identified as functions
declared with the __kernel
qualifier in the program source. Each program
object can represent program source or binary.
oneAPI Construction Kit tends to delay final compilation until kernel run time, e.g. [clEnqueueNDRangeKernel][clEnqueueNDRange] , where all the scheduling information will be available. Therefore if a user runs the same kernel kernel twice with identical work loads and scheduling, the first run may suffer some initial latency from compilation.
clCreateProgramWithSource
cl_program clCreateProgramWithSource(cl_context context,
cl_uint count,
const char **strings,
const size_t *lengths,
cl_int *errcode_ret);
Creates an OpenCL program object from OpenCL C source, oneAPI Construction Kit does no compilation at this stage.
clCreateProgramWithBinary
cl_program clCreateProgramWithBinary(cl_context context,
cl_uint num_devices,
const cl_device_id *device_list,
const size_t *lengths,
const unsigned char **binaries,
cl_int *binary_status,
cl_int *errcode_ret);
Create an OpenCL program object from a binary. If the binary is a pre-compiled binary, then it may be compiled again, but this will have no effect.
clCreateProgramWithBuiltInKernels
cl_program clCreateProgramWithBuiltInKernels(cl_context context,
cl_uint num_devices,
const cl_device_id *device_list,
const char *kernel_names,
cl_int *errcode_ret);
Create an OpenCL program object with built-in kernels. This is useful if you have a configurable, but not programmable, accelerator with predefined computations it is optimized for. oneAPI Construction Kit currently doesn’t support any built-in kernels but this will change in the future.
clCompileProgram
cl_int clCompileProgram(cl_program program,
cl_uint num_devices,
const cl_device_id *device_list,
const char *options,
cl_uint num_input_headers,
const cl_program *input_headers,
const char **header_include_names,
void (CL_CALLBACK *pfn_notify) (
cl_program program,
void *user_data
),
void *user_data);
Compile the program object. If the program was created from source oneAPI Construction Kit runs the clang compiler frontend at this point.
clLinkProgram
cl_program clLinkProgram(cl_context context,
cl_uint num_devices,
const cl_device_id *device_list,
const char *options,
cl_uint num_input_programs,
const cl_program *input_programs,
void (CL_CALLBACK *pfn_notify) (
cl_program program,
void *user_data
),
void *user_data,
cl_int *errcode_ret);
Link one of more program objects. In the spec wording it mentions that this may
create an executable, however in oneAPI Construction Kit we delay the creation
of an executable until [clEnqueueNDRangeKernel
][clEnqueueNDRange] for
performance reasons. We do however link separate LLVM Modules together here.
clBuildProgram
cl_int clBuildProgram(cl_program program,
cl_uint num_devices,
const cl_device_id *device_list,
const char *options,
void (CL_CALLBACK *pfn_notify) (
cl_program program,
void *user_data
),
void *user_data);
Build, or compile and link, the program object. This is equivalent to a call to
clCompileProgram
followed by clLinkProgram
.
The options
string parameter can be used to pass compiler flags, including
some only available as Codeplay vendor extensions.
clGetProgramInfo
cl_int clGetProgramInfo(cl_program program,
cl_program_info param_name,
size_t param_value_size,
void *param_value,
size_t *param_value_size_ret);
Query the program object for information. One use of this is to retrieve
compiled binaries which can we used in future to pass to
clCreateProgramFromBinary
to save compilation
time. In this case oneAPI Construction Kit will return LLVM IR for the binary
format.
clGetProgramBuildInfo
cl_int clGetProgramBuildInfo(cl_program program,
cl_device_id device,
cl_program_build_info param_name,
size_t param_value_size,
void *param_value,
size_t *param_value_size_ret);
Query the program for the latest build information. If compilation failed this function can be used to retrieve the error log, which oneAPI Construction Kit will output this as a LLVM diagnostic.
Image Support
An image object is used to store a 1, 2, or 3 dimensional texture, frame-buffer or image. The elements of an image object are selected from a list of predefined image formats. Samplers can then be used to read from multi-dimensional images.
Not all OpenCL devices will support images, check the CL_DEVICE_IMAGE_SUPPORT
property from clGetDeviceInfo
to find out for a specific
device. CL_INVALID_OPERATION
will be returned as an error code from some of
the image API functions if there are no devices in the context that support
images. Types such as image1d_t
and sampler_t
will also be unavailable.
The oneAPI Construction Kit CPU device target does support images, but these are emulated in software rather than hardware accelerated, so not performant. Other oneAPI Construction Kit platform devices may or may not support images.
Debugging OpenCL kernels
Although a developer can easily create a debug build of an OpenCL application, this will only allow debugging of the host side code. Successful debugging of kernels themselves is more involved. The sections below document the steps needed for smooth debugging of kernels.
Build flags
To best debug an OpenCL kernel the build flags -g
, -S
, and -cl-opt-disable
should be set in the options to clBuildProgram()
, where -g
and -S
are
part of our cl_codeplay_extra_build_options
platform extension.
-g
enables emission of debug info. Without this flag the debugger user will not be able to set a breakpoint on the kernel name.-S <path/to/source/file>
points the generated debug info to a source file on disk so that the debugger can display kernel source code. If this file does not exist already the OCL runtime will create a file from the string passed toclCreateProgramWithSource()
, if this API call occurred.-cl-opt-disable
disables performance optimizations. Omitting this flag will result in an inability to reliably inspect frame variables and source step. Analogous to debugging C/C++ code that hasn’t been built with-O0
.
Viewing kernel source in the debugger
Note: This section was created before the
-S
build option was added. Mapping the source in the debugger can be done as a more invasive fall-back but the suggested technique is to utilize-S
.
It is problematic for a debugger to show source code for cl programs since they
are created from an API call to clCreateProgramWithSource()
. Where the
source string passed in could be programmatically generated, not just read
directly from a .cl file on disk.
As a placeholder the OCL runtime sets the source filename to kernel.opencl
,
and directory to wherever libOpenCl.so
was dynamically loaded into.
This path then gets propagated into the debug info and picked up by the
debugger. So if a debugger user wants to set a breakpoint on a file & line,
then the file will have to be kernel.opencl
. This could be an issue if
multiple kernels from different cl programs are being run.
More significant for the user experience is that the debugger can’t show the
source code line for the current pc address of a stopped process. A workaround
to this problem in the case where the .cl file does exist on disk is to rename
the file to kernel.opencl
. The directories may still not match however, so an
example fix would be to copy the source .cl file to /tmp/kernel.opencl
, then
remap the directory in the debugger to point to /tmp
. lldb
uses a path
substitution, while in gdb
you can provide a catch all directory to search for
files in. Note lldb
doesn’t substitute environmental variables here so instead
of using $HOME
or $USER
set the absolute path.
(lldb) settings set target.source-map /home/foo/Aorta/build /tmp
(gdb) directory /tmp
LLDB sample session
Example debugging session of running the convolution UnitCL test and debugging
its kernel. Note UnitCL needs to be passed the -g
, -S
, and -cl-opt-disable
options via the --unitcl_build_options
command line argument.
We start by setting a function breakpoint on the kernel name with
‘b convolution’, which won’t be resolved immediately since libOpenCL.so
hasn’t been dynamically loaded yet.
$ lldb -- build/bin/UnitCL \
--gtest_filter=Execution.Task_07_04_Convolution \
--unitcl_build_options="-g -cl-opt-disable -S OCL/test/UnitCL/kernels/task_07.04_convolution.cl"
(lldb) b convolution
Breakpoint 1: no locations (pending)
WARNING: Unable to resolve breakpoint to any actual locations.
(lldb) r
lldb
shows all the active threads which are currently stopped which makes the
output cluttered, but it’s important to note that the thread IDs displayed have
no relation to OpenCL work item IDs. To see all the threads in lldb
run
‘thread list’, while to select an individual thread ‘thread select’ can be used.
Depending on the OCL host implementation scheduling each process thread could
execute several work items, and so may hit a kernel breakpoint several times.
Next in our debug session we step over the line int x = get_global_id(0);
with the ‘thread step-over’ command, also aliased to ‘next’. Then by printing
the frame variables we can see that the global id of the current work item is
130.
(lldb) thread select 7
(lldb) thread step-over
(lldb) frame var
(float *) src = 0x0000000000d28e80
(float *) dst = 0x00000000007fd800
(int) x = 130
(int) width = 0
We can also set a breakpoint on a file line, where the file must be set as
task_07.04_convolution.cl
. Additionally we can narrow this breakpoint scope
down to a single work-item with the condition ‘x==4’, since x holds the global
id. When the breakpoint is hit lldb
can list multiple threads as stopped, and
the debugger user may have to cycle through the threads to find the specific one
where x==4.
(lldb) break set -f task_07.04_convolution.cl -l 20 -c "x == 4"
(lldb) break del 1
(lldb) continue
convolution(src=0x0000000000d28e80, dst=0x00000000007fd800)
17 sum += weight * src[x + i];
18 }
19 sum /= totalWeight;
-> 20 dst[x] = sum;
21 }
22 else
23 {
(lldb) print x
(int) $0 = 4
(lldb) print dst[x]
(float) $1 = 0
(lldb) next
(lldb) print dst[x]
(float) $2 = 95.9932632
GDB sample session
Here we run through the same debugging scenario as for lldb
, but using the
equivalent gdb
command syntax.
$ gdb --args build/bin/UnitCL \
--gtest_filter=Execution.Task_07_04_Convolution \
--unitcl_build_options="-g -cl-opt-disable -S OCL/test/UnitCL/kernels/task_07.04_convolution.cl"
(gdb) b convolution
(gdb) run
Unlike lldb
, gdb
only displays a single thread when the process is stopped.
But you can see them all with command ‘info threads’, and select the individual
thread with thread $tid
.
(gdb) thread 6
[Switching to thread 6 (Thread 0x7fffeb8f2700 (LWP 4090))]
#0 convolution (src=0xd28e80, dst=0x7fd800) at task_07.04_convolution.cl:7
7 int x = get_global_id(0);
(gdb) next
8 int width = get_global_size(0);
(gdb) info locals
x = 130
width = 0
(gdb) info args
src = 0xd28e80
dst = 0x7fd800
We can set a conditional breakpoint on a file line only for work-item 4 using
the below gdb
command syntax.
(gdb) break task_07.04_convolution.cl:20 if x == 4
(gdb) delete 1
(gdb) continue
Continuing.
[Switching to Thread 0x7fffeb8f2700 (LWP 4106)]
Breakpoint 2, convolution (src=0xd28e80, dst=0x7fd800) at task_07.04_convolution.cl:20
20 dst[x] = sum;
(gdb) info locals
sum = 0.833333313
totalWeight = 12
x = 4
width = 256
(gdb) print dst[x]
$1 = 0
(gdb) next
7 int x = get_global_id(0);
(gdb) print dst[x]
$2 = 0.833333313
Deprecated Entry Points
Deprecated OpenCL 1.0 and 1.1 functions which are implemented by the oneAPI Construction Kit for conformance and backwards compatibility.
clGetExtensionFunctionAddress
void* clGetExtensionFunctionAddress(const char *func_name);
Query the platform for address of extension function, deprecated in OpenCL 1.2.
Replaced by
clGetExtensionFunctionAddressForPlatform
.
clEnqueueWaitForEvents
cl_int clEnqueueWaitForEvents(cl_command_queue queue,
cl_uint num_events,
const cl_event *event_list);
Enqueue an event wait list, deprecated in OpenCL 1.2. Replaced by
clWaitForEvents
.
clEnqueueBarrier
cl_int clEnqueueBarrier(cl_command_queue queue);
Enqueue a barrier on the command queue, deprecated in OpenCL 1.2. Replaced by
clEnqueueBarrierWithWaitList
clEnqueueMarker
cl_int clEnqueueMarker(cl_command_queue queue,
cl_event *event);
Enqueue a marker on the command queue, deprecated in OpenCL 1.2. Replaced by
clEnqueueMarkerWithWaitList
clUnloadCompiler
cl_int clUnloadCompiler();
Unload the compiler, deprecated in OpenCL 1.2. This is a hint from the application and does not guarantee that the compiler will actually be unloaded by the implementation. As this is just a hint oneAPI Construction Kit ignores it.
Replaced by clUnloadPlatformCompiler
.
clCreateImage2D
cl_mem clCreateImage2D(cl_context context,
cl_mem_flags flags,
const cl_image_format *image_format,
size_t image_width,
size_t image_height,
size_t image_row_pitch,
void *host_ptr,
cl_int *errcode_ret);
Create a 2D image memory object, deprecated in OpenCL 1.2. Replaced by
clCreateImage
.
clCreateImage3D
cl_mem clCreateImage3D(cl_context context,
cl_mem_flags flags,
const cl_image_format *image_format,
size_t image_width,
size_t image_height,
size_t image_depth,
size_t image_row_pitch,
size_t image_slice_pitch,
void *host_ptr,
cl_int *errcode_ret);
Create a 3D image memory object, deprecated in OpenCL 1.2. Replaced by
clCreateImage
.