This document aims to provide an overview of the design of the oneAPI Construction Kit. Its primary goal is to give developers a grounding in the project structure and a good idea of where specific components reside within the directory structure.
Project Structure
A common structure is used as much as possible throughout the oneAPI
Construction Kit repository. Implementations of open standards, such as
OpenCL and Vulkan, exist as subdirectories of the source
directory and
shared components, or modules, reside in the modules
directory.
Throughout the repository the following layout is adhered to when applicable:
include
the public interface of the component, except for APIs as the interface is defined by a third partysource
source code implementing the componenttest
source code for building test suites for testing the componenttools
tools which enable stand alone usage of the componentexamples
example applications detailing basic usage of the componentscripts
utilities to aid with building and testing componentsexternal
contains external dependencies, usually with different license agreements
Modules
Many components of the oneAPI Construction Kit are designed to be reused by
multiple open standard implementations or externally. These components are
referred to as modules and can be found in the modules
directory. Modules
follow the same directory layout as the root directory, described above, with
the external interface being found in the header files located in the include
directory and the implementation located in the source
directory. Not all
modules have test suites but those that are shared between projects within the
oneAPI Construction Kit umbrella do.
Builtins
OpenCL C specifies over 10000 builtin functions, which OpenCL C programs rely
on. oneAPI Construction Kit defines builtin functions in the builtins
module.
The declarations of all the OpenCL C builtin functions can be found in
include/builtins/builtins.h
, this includes both type and function
declarations. Builtin function definitions are spread across multiple files;
those implemented in plain OpenCL C can be found in source/builtins.cl
;
builtins implemented using C++, to take advantage of templates, can be found in
source/builtins.cpp
. These files are automatically generated by running the
bash script scripts/generate_header.sh
.
In our build we create Pre-Compiled Header(PCH) files for builtins.h
and embed
them inside our library. This is then used as an implicit header for all OpenCL C
kernels during compilation. Additionally the implementations of our builtins
are compiled down to LLVM bitcode and also embedded, which is a substantial
part of our build time. To reduce this latency a separate toolchain specifically
for compiling the builtins can be set using CMake option CA_BUILTINS_TOOLS_DIR
,
useful for pointing to a release toolchain in a debug build. Alternatively, to
avoid compiling the builtins completely, CA_EXTERNAL_BUILTINS
can be set to
ON
and CA_EXTERNAL_BUILTINS_DIR
set to point to the directory containing
pre-generated builtins.
All builtins that implement math operations are provided by abacus
, which is
shared across multiple projects; and all builtins implementing image
functionality are provided by libimg
.
Compilation
The builtin functions are compiled offline into LLVM bitcode (.bc
files) using
clang
, not the platform compiler used to build the OpenCL drivers shared
library. This matches the frontend used to compile OpenCL C source code in an
application using the oneAPI Construction Kit’s OpenCL driver. These bitcode
files are compiled into the OpenCL driver, on platforms which support linking
binary blobs. This is performed by the linker (an .rc
file on Windows and a
small .asm
file accessing the data section on Linux), while the fallback
mechanism transforms the binary into a header file containing a char
array.
Compiling the builtins source code to LLVM bitcode is the key to being able to
use multiple input languages, both OpenCL C and C++. The resulting bitcode files
are linked together using llvm-link
into a single bitcode file. Using this
mechanism allows the builtins
modules to take advantage of C++ for function
overloading and function templates to increase code reuse, especially for the
OpenCL C conversion and type casting builtins.
Additionally, the include/builtins/builtins.h
header file is compiled into a
(.pch
file) pre-compiled header. This is done so that the frontend compiler in
the oneAPI Construction Kit does not have to compile the entire header file,
which is over 11000 lines long, each time an application invokes clCompileProgram
or clBuildProgram
. The pre-compiled header along with the bitcode files
containing the definitions of the OpenCL C builtin functions are embedded into
the OpenCL driver.
In order to access the embedded bitcode and the pre-compiled header from within
the OpenCL driver, the builtins
module provides a static library containing
the binary along with an API to make the binary accessible. These are contained
in the include/builtins/bakery.h
and source/bakery.cpp
header and source
files, so named because the binary is baked into the library.
Abacus
Abacus is our high-precision math library crafted especially for the demanding precision requirements OpenCL has. Key features include:
OpenCL 1.2 floating point math functions.
Heavily optimized for GPU, DSP and vector architectures.
Satisfies the high precision requirements for OpenCL conformance.
High code quality and documentation for easy maintainability.
Abacus Integration
Abacus is integrated into the build process such that the .cl and .cpp files
that implement the functionality are built into an LLVM bitcode file. We pass
this module to ComputeMux Compiler backends via the builtins
parameter of
compiler::Target::init()
. ComputeMux backends can then link against the
bitcode file to bring in the definitions their kernel’s require.
libimg
Image support in OpenCL is optional, based on the CL_DEVICE_IMAGE_SUPPORT
device property but when enabled the libimg
module provides functionality for
both the OpenCL API and implements the OpenCL C builtin functions. The image
support provided by libimg
is a software implementation intended for targeting
CPU architectures but can also be used on other architecture which do not have
dedicated texture hardware.
Note that
libimg
is actually a shared module however since it implements the OpenCL C image builtins it resides within thebuiltins
module alongsideabacus
.
libimg Integration
In order to build libimg
it is necessary to supply a header called
image_library_integration.h
that defines the types and functions used by
the oneAPI Construction Kit, this filename is important as it is hard-coded into
the libimg
source files.
Host and Validate
When image support is enabled the image functions in the api
module, such as
api::CreateImage
, call analogous functions in include/libimg/host.h
header
file, for this example libimg::HostCreateImage
. Along with the implementations
of OpenCL API calls, a set of helper functions is provided to aid with the
integration into the OpenCL driver, these function serve various purposes
related to the data layouts and memory offsets of the image.
Functions that implement the guts of OpenCL API entry points, such as
libimg::HostFillImage
, do not perform any validation of input parameters as
defined in the OpenCL specification; instead the validation code resides in the
include/libimg/validate.h
and source/validate.cpp
header and source files.
Kernel
The implementations of the OpenCL C builtin image functions are declared in the
header include/libimg/kernel.h
with the definitions living in the
source/kernel.cpp
source file, this code is only executed in OpenCL C kernels.
For each OpenCL C builtin, such as float4 read_imagef(image3d_t, sampler_t, int4)
there is an analogous function, in this case Float4 __Codeplay_read_imagef_3d(Image, Sampler, Int4)
. Note that the Image
type
does not contain the dimensionality of the image, instead this has been moved
into the function name to retain this information.
Because the libimg
function signature is different from the signature of the
builtin produced by the compiler frontend, the builtin must be replaced.
Replacement of the builtin is performed in the pass
compiler::ImageArgumentSubstitutionPass
, documented above.
printf
The oneAPI Construction Kit’s printf
implementation works by adding an extra
buffer argument to kernels and then replacing the calls to printf
by code that
loads the arguments of the printf
call into the buffer. Internally the buffer
argument is created and added to the kernel just before nd-range commands are
executed, then just after the nd-range command has finished the buffer is read,
its contents are unpacked and printed out using the host printf
.
printf
buffer
The size of the buffer is provided by the device through the property
CL_DEVICE_PRINTF_BUFFER
.
Data in the printf
buffer is organized in the following way:
[ wg0: [<length><overflow><id><args...> ...] | ... | wgn: [<length><overflow><id><args ...> ...] ]
First the buffer is split per work group so that each work group has its own
chunk of buffer to use, this is necessary since we can’t synchronize between
work groups. These chunks must be at least 8 bytes, if they are not the kernel
execution will fail returning CL_OUT_OF_RESOUCES
, this limitation only affects
programs that use printf
.
Then in each work group’s buffer chunk, the first 8 bytes are used to store the
length of the data that was stored as well as the amount of this length that
actually overflowed, these two values are used to synchronize between work items
inside of the work group. These 8 bytes are initialized on the host for each
work group to 8
for the length (accounting for these 8 bytes), and to 0
for
the overflow value (no overflow in the beginning).
In work-group synchronization and overflows
In this part the buffer is assumed to be the chunk of buffer allocated to a work group.
Data in the buffer for a work group is organized in the following way:
[ <length><overflow><id><arg><arg>...<id>... ]
The first field, length
, stores the amount of data in bytes that printf
calls
attempted to write in the buffer, the second field, overflow, stores the amount
of data that printf
calls couldn’t write in the buffer because they
overflowed.
By subtracting the value of the length field by the value of the overflow field, we get exactly the amount of meaningful data that the buffer contains.
Each printf
call firsts calls atomic_add
on the length field with the amount
of space it needs to store in the buffer. This effectively reserves a chunk of
buffer for the printf
call starting at the value returned by the atomic_add
,
and of the size required for the printf
call.
Then the printf
call will check if the reserved chunk of buffer is actually
within the bounds of the buffer. If it isn’t, it means that the printf
call is
overflowing, in this case the call will not write anything to the buffer, and
will return -1
.
In addition, the overflowing printf
call will also atomically add to the
overflow field the amount of data it wanted to write in the buffer. This is
necessary because at this point the length of this call is accounted for in the
length field but we can’t simply subtract back the size of the call from the
length field in a thread safe way, so instead we keep track of how much of the
length field is data that wasn’t actually written to the buffer.
This also means that if a printf
call overflows, every following printf
call
will overflow as well, because after a call overflowed the length field will
hold a value bigger than the size of the buffer, so the chunk of buffer that new
printf
calls will attempt to reserve will necessarily be out of bounds. It
also means that the part of the buffer that was reserved by the first printf
call to overflow will be left unused.
Argument packing
This section describes how a
printf
call will write its data in the buffer chunk it reserved.
First the printf
call will write four bytes corresponding to its id, the id is
a value determined at compile time and is used by the host to recognize the
printf
calls and deduce how to unpack the argument data.
It will then write its argument data which may be nothing if the printf
call
doesn’t need to send data back to the host (typically calls with just the format
string or just string arguments). The argument data if present contains the
arguments packed one after the other in the buffer. As described by the printf
descriptor matching the id of the printf
call, and created on the host at
compile time.
Format string and string arguments
During compilation, we go over all the printf
calls and validate their format
string. If they are invalid, we simply replace their return value with -1
as
is mandated by the specification.
If a printf
call is valid, we give it an id and store data about it,
specifically we store the format string and the string arguments, since these
are known at compile time there is no need to transfer them from the device, and
we also store information about the arguments of the printf
call, this will
allow us to properly interpret the data retrieved from the device.
The compiler also transforms the OpenCL C printf
format string into a C99
printf
format string that can directly be used on the host.
Limitations
The * specifier for width and precision is not supported.
The buffer is split per work group, so a high number of work group will greatly limit the amount of space available for each work group, even if only one of them ever prints.
The host and the device are assumed to have the same endianness.
Arm denormal floating points support
Single precision: no support for denormals, they are flushed to zero.
Double precision: denormals are supported.
For floating point operations on Arm, we have access to the VFP which is able to run single and double precision operations, and which can also be configured to either support denormals or flush to zero. And we also have the faster NEON SIMD extension which can run vectors of single precision operations but doesn’t have any denormal support (they are flushed to zero).
The OpenCL 1.2 specification mandates that if doubles are supported, denormal double precision floating points must be supported as well, so because we want double support we can’t disable denormal support altogether.
So we enable the neon
LLVM feature to run single precision vector floating
point operations with NEON, and we also enable the neonfp
LLVM feature to
force the use of NEON for scalar single precision floating points as well.
With this we can then enable denormal support on the VFP for double precision
floating point operations.
Note that in this setup, scalar single precision floating points are not fully flushed to zero as some basic operations are still being run on the VFP which has denormal support enabled.
ComputeMux Runtime
The mux
module defines an API layer providing an interface between hardware
target specific code and general OpenCL implementation code. The mux
API is
set up to support multiple targets, one of which is the host
CPU target that
is described below. mux
is a shared module and must support multiple open
standards, not just OpenCL. For more detail see this section.
Documentation on how the OpenCL API maps onto the ComputeMux spec in our
implementation is also available here.
The include/mux/mux.h
header file is generated, using a Python script, from
the tools/api/mux.xml
schema. The API defined in this header is the public
API used in OpenCL code. Additional files are also generated from the schema in
the build directory, where you will find; the include/mux/select.h
header
which marshals the selection of the desired mux
target; the
include/mux/config.h
header, this defines an array of all target’s device
creation entry points to initialize each target.
Each entry point in the API performs parameter checking before passing on the
inputs to the selected target. For example, the error checking for the
muxCreateBuffer
entry point can be found in the source/buffer.cpp
source
file. The muxSelectCreateBuffer
inline function is defined in the
muxSelect.h
header, it selects the desired mux
target based on the
device->id
member of the mux_device_t
object.
A specification for mux
is available
describing, in more detail the purpose of an entry point, valid usage of an
entry point, expected error codes of an entry point.
ComputeMux Compiler
The compiler
module defines an API layer and a set of C++ classes providing a
compiler suite for ComputeMux Runtime targets.
The compiler
module is structured as a set of virtual interfaces and a loader
library, with a number of concrete implementations for different ComputeMux
targets. The virtual interfaces reside in include/compiler/*.h
, the library
entry point resides in library/include/library.h
and
library/source/library.cpp
, the dynamic loader library resides in
loader/include/loader.h
and loader/source/loader.cpp
, and the various
implementations reside in targets/*/
. More information on the structure can
be found here.
All compiler implementations report a static compiler::Info
object describing
the mux_device_info_t
that it targets, and what features are available. An
implementation can be selected by calling either compiler::compilers()
or
compiler::getCompilerForDevice()
.
A specification for compiler
is available
describing, in more detail the purpose of an entry point, valid usage of an
entry point, expected error codes of an entry point.
Host
The host
module is an implementation of the mux
and compiler
APIs
targeting the host system’s CPU, this includes targets such as X86, Arm, and
Aarch64. Documentation of the implementation detail of host
can be found
here. host
is also shared with other oneAPI Construction Kit
projects outside of OpenCL.
Following the same file structure as mux
, host
lays out code on a per
object basis. The host::queue_s
, which inherits from mux_queue_s
, is
declared in the include/host/queue.h
header file; its definition, and the
entry points acting on it, are defined in the source file source/queue.cpp
.
Vecz
Vecz
is a LLVM IR level SPMD (Single Program Multiple Data) vectorizer. It’s
contained in a module so that it can be built as a standalone library and
shipped as a separate product. This does not stop it being integrated into other
modules, such as host
which utilizes the vectorizer.
For detailed information about the design and implementation of VECZ
see its
documentation.
Cargo
The cargo
module contains a collection of Standard Template Library (STL) like
containers that take into consideration the specific needs of the oneAPI
Construction Kit. The oneAPI Construction Kit embeds LLVM within the driver and
LLVM requires being built without exceptions. The constructors of containers like
std::vector
perform allocations from the free store and then immediately
dereference them yet with exceptions disabled there is no error reporting
mechanism for allocation failures. The containers in cargo
aim to never
allocate in a constructor, this allows an error to be returned from member
functions which may perform an allocation.
The cargo::small_vector
class template, inspired by llvm::SmallVector
,
provides a std::vector
like container with a tunable size small buffer
optimization. Member functions like std::vector::insert
may also perform an
allocation, which might fail, which are specified to return an iterator. In
order for cargo::small_vector::insert
to maintain a familiar API it returns a
cargo::error_or<iterator>
object, this is a class which either contains a
suitable error code or the value of the iterator.
Begin able to return an error code or the desired value is a step in the right
direction, however we can do better. Member functions in cargo
that return
error codes also have the [[nodiscard]]
attribute specified on compilers which
support this functionality. Failing to check the return value of functions
marked with [[nodiscard]]
results in an error, guarding against mistakenly not
checking for an error code.
Serialization Format
Serialized binaries have the following format, a null terminated string “codeplay” at the start, a kernel header which contains the kernel details, followed by the ELF data for the kernels.
Pseudocode:
/* Binary Prefix */
char[strlen("codeplay") + 1];
/* Kernel Header */
uint32_t type;
uint64_t number_of_printf_calls;
for (number_of_printf_calls) {
uint64_t format_string_length;
char[format_string_length] format_string;
uint64_t types_length;
uint32_t[types_length] types;
uint64_t number_of_strings;
for (number_of_strings) {
uint64_t string_length;
char[string_length] string;
}
}
uint64_t number_of_kernels;
for (number_of_kernels) {
uint32_t number_of_arguments;
for (number_of_arguments) {
uint32_t argument_type;
char has_meta_data; /* 1 or 0 */
if (has_meta_data) {
uint32_t address_qualifier;
uint32_t access_qualifier;
uint32_t type_qualifier;
uint64_t type_name_length;
char[type_name_length] type_name_string;
uint64_t argument_name_length;
char[argument_name_length] argument_name_string;
}
}
uint64_t[3] reqd_work_group_size;
uint64_t kernel_name_length;
char[kernel_name_length] kernel_name;
}
/* Rest of the file is ELF data for the kernels */
Note:
De-serialization has two contact points within CA, clc
our offline compiler
and the compiler itself.