The ComputeMux Compiler is an OpenCL C, SPIR and SPIR-V compiler that consumes the source code or IL provided by an application and compiles it into an executable that can be loaded by the ComputeMux Runtime.
The module aims to provide a boundary beyond which no LLVM type definitions pass in order to keep logical concerns separate.
compiler module is structured as a set of virtual interfaces and a loader
library, with a number of concrete implementations for different ComputeMux
targets. The virtual interfaces reside in
include/compiler/*.h, the library
entry point resides in
library/source/library.cpp, the dynamic loader library resides in
loader/source/loader.cpp, and the various
implementations reside in
Dynamic vs Static loading
The compiler is designed to be used either directly as a static library, or indirectly through a loader library.
The simplest way to use the compiler library is to link with the
compiler-static target. Through this target, the compiler is accessed through
compiler/library.h header. This target is unavailable if CMake is
CA_RUNTIME_COMPILER_ENABLED set to
A more flexible option to use the compiler is to instead link wtih the
compiler-loader target. Through this target, the compiler is accessed through
compiler/loader.h header. This header is similar to
however each method additionally requires a
The purpose of the loader is to provide a compiler interface that will be
available when compiling the oneAPI Construction Kit regardless of the value of
CA_RUNTIME_COMPILER_ENABLED, and regardless of whether the
compiler is loaded at runtime or linked statically.
CA_RUNTIME_COMPILER_ENABLED is set to
ON then the
compiler loader can operate in two different ways:
CA_COMPILER_ENABLE_DYNAMIC_LOADERis set to
compiler::loadLibrarywill look for
libcompiler.so(Linux) in the default library search paths, depending on the platform.
If the environment variable
CA_COMPILER_PATHis set, then its value will be used as the library name instead. Additionally, if
CA_COMPILER_PATHis set to an empty string, then
compiler::loadLibrarywill skip loading entirely and will operate as if no compiler is available.
In this configuration, targets which depend on
compiler-loadershould also add the
compilertarget (the compiler shared library) as a dependency using
source/cl/CMakeLists.txtfor an example.
CA_COMPILER_ENABLE_DYMAMIC_LOADERis set to
compiler-loaderwill transitively depend on
compiler::loadLibrarywill instead immediately return an instance of
compiler::Librarythat references the static functions directly.
CA_RUNTIME_COMPILER_ENABLED is set to
compiler::loadLibrary will always return
nullptr, and therefore the compiler
will be disabled.
By default, the oneAPI Construction Kit is configured with
CA_RUNTIME_COMPILER_ENABLED set to
CA_COMPILER_ENABLE_DYNAMIC_LOADER set to
Selecting a compiler implementation
A compiler implementation is represented by a singleton instance of a
compiler::Info object. A list of all available compilers can be obtained by
be used to select the relevant compiler for a particular
compiler::Info struct (
include/compiler/info.h) describes a
particular compiler implementation that can be used to compile programs for a
Info contains information about the
compiler capabilities and metadata, and additionally acts as an interface for
compiler::Context interface (
include/compiler/context.h) serves as
an opaque wrapper over the LLVM context object. This object can also contain
other shared state used by compiler modules, and contains a mutex that is locked
when interacting with a specific instance of LLVM.
compiler::Target interface (
include/compiler/target.h) represents a
particular target device to generate machine code for. This object is also
responsible for creating instances of
compiler::Module (described below).
compiler::Module interface (
include/compiler/module.h) is responsible
for driving the compilation process from source code all the way to machine
code. It acts as a container for LLVM IR by wrapping the LLVM Module object, and
executes the required passes.
Compile OpenCL C
The clang frontend is instantiated in the
member function, this is where:
The OpenCL C language options are specified to the frontend
User specified macro definitions and include directories are set
muxdevice force-include headers (if present) are set
A diagnostic handler is provided to report compilation errors
This compilation stage also introduces the pre-compiled builtins header
providing the OpenCL C builtin function declarations to the frontend.
Compilation occurs when the
clang::EmitLLVMOnlyAction is invoked, then
ownership of the resulting
llvm::Module is transferred to
compiler::Module to be used in the next stage. Any errors occurring
during compilation are returned in the error log specified during the
compiler::Module, where they can be queried by the
In OpenCL, the
compiler::Module::compileOpenCLC member function directly
clCompileProgram but is also invoked by
compiler::Module::compileSPIRV member function implements the SPIR-V
frontend. First, the SPIR-V module is handed to
to turn it into a
llvm::Module, then some additional fixup passes are applied.
compiler::Module::compileSPIRV member function is responsible for
translating SPIR bitcode to an
llvm::Module by running additional fixup
Finalization is the final compilation stage which executes any remaining LLVM
passes and getting it ready to be passed to the backend implementation. This is
where the majority of the LLVM passes are run, once again on a clone of the
llvm::Module owned by the
compiler::Module object. Once the
llvm::PassManager has run all of the desired passes, the LLVM module is
ready to be turned into machine code, either through
compiler::Module::createBinary, or possibly deferred at runtime through the
compiler::Kernel interface (
include/compiler/kernel.h) represents a
single function entry point in a finalized
compiler::Module. It’s main purpose
is to provide an opportunity for the backend to perform optimizations and code
generation as late as possible. Most of the work is driven by the
compiler::Module::createSpecializedKernel method that creates a Mux runtime
kernel potentially optimized for a set of execution options that will be passed
to it during
OpenCL C Passes
compiler module provides a number of LLVM passes, which are specific to
processing the LLVM IR produced by clang after compiling OpenCL C source code.
The IR is processed into a form that the backend can consume. The passes are
described immediately below in the order they are executed by the LLVM pass
The OpenCL standard defines an optional
-cl-fast-relaxed-math flag that can be
set when building programs, allowing optimizations on floating point arithmetic
that could violate the IEEE-754 standard. When this flag is used we run the LLVM
module level pass
FastMathPass to perform these optimizations straight after
frontend parsing from clang.
First the pass looks for any
llvm::FPMathOperator instructions and for those
found sets the
llvm::FastMathFlags attribute to enable all of:
Unsafe algebra - Operation can be algebraically transformed.
NaNs - Arguments and results can be treated as non-NaN.
Infs - Arguments and results can be treated as non-Infinity.
No Signed Zeros - Sign of zero can be treated as insignificant.
Allow Reciprocal - Reciprocal can be used instead of division.
As well as the above
compiler::FastMathPass replaces maths and geometric
builtin functions with fast variants. Any math builtin functions which have a
native equivalent are replaced with the native function, specified as having an
implementation defined maximum error. For example
exp2(float4) is replaced
normalize are all defined in
OpenCL as having fast variants
fast_normalize which use reduced precision maths. If any of these functions
are present we also replace them with the relaxed alternative.
These builtin replacements are done by searching the LLVM module for call instructions which invoke the mangled name of a builtin function we want to replace. If the fast version of the builtin isn’t already in the module, i.e. it wasn’t called explicitly somewhere else, then we also need to add a function declaration for the mangled name of the fast builtin. Finally a new call instruction is created invoking the fast function declaration and the old call it replaces is deleted.
Bit Shift Fixup
LLVM IR does not define the results of oversized shift amounts, however SPIR does. As a result shift instructions need to be updated to perform a ‘modulo N’ by the shift amount prior to the shift operation itself, where N is the bit width of the value to shift.
BitShiftFixupPass implements this as a LLVM function pass iterating over all
the function instructions looking for shifts. For each shift found the pass uses
the first operand to work out ‘N’ for the modulo based on the bit width of the
operand type. If the shift amount from the second operand is less than N
however, then we can skip the shift without inserting a modulo operation since
the shift is not oversized. We can also skip shift instructions that already
have the modulo applied, which can happen if the SPIR module was created by
clang. Otherwise the pass creates a modulo by generating a ‘logical and’
instruction with operands
N-1 and the original shift amount, this masked value
is then used to replace the second operand of the shift.
The compiler pass
SoftwareDivisionPass is a function level pass designed to
prevent undefined behaviour in division operations. To do this the pass adds
runtime checks using
llvm::CmpInst instructions for two specific cases, divide
by zero and
INT_MIN / -1. Due to the specification of undefined behaviour if
one of these cases is detected we are free to update the behaviour of the divide
operation. In both cases we set the divisor operand of the divide instruction to
+1 using a
llvm::SelectInst with the original operand based on the result
of our checks.
Since IEEE-754 defines these error cases for floating point types our runtime
checks only need to be applied to integer divides. This is ensured in the pass
by checking if the instruction opcode is one of
Whereas floating point divide instructions will have opcode
Image Argument Substitution
OpenCL image calls with opaque types are replaced to use those coming from the image library.
A manual implementation of LLVM’s MemToReg pass, which promotes allocas
which have only loads and stores as uses to register references. This is needed
because after LLVM 5.0
llvm:MemToReg has regressed and is not removing all
the allocas it should be.
BuiltinSimplificationPass is a module level pass for simplifying builtin
function calls. The pass performs two kinds of optimization on builtins:
Converts builtins to more efficient variants where possible (for example, a call to the math function
pow(x, y), where
yis a constant that is representable by an integer, will be converted to
Replace builtins whose arguments are all constant (for example, a call to the math function
xis a constant, will be replaced by a new constant value that is the calculation of the cosine of
Of the myriad of architectures that have ComputeMux back ends, most do not have
access to an implementation of
printf whereby they can route a call to
printf within a kernel to
stdout of the process running on the host CPU
To enable our ComputeMux back ends to call
printf, we provide an optimized
software implementation. An additional kernel argument buffer is implicitly
added to any kernel that uses
printf, and our implementation of
that is run on the ComputeMux backend will write the results of the
into this buffer instead. Then, when the kernel has completed its execution, the
data that was written to this buffer is streamed out on the host CPU processor
CombineFPExtFPTruncPass is a function level pass, rather than a module pass,
FPTrunc instructions that cancel each out. This is
used after the
printf replacement pass because var-args
will be expanded to double by clang even if the device doesn’t support doubles.
So if the device doesn’t support doubles, the
printf pass will
parameters back to float.
CombineFPExtFPTruncPass will find and remove the
fpext (added by clang) and
fptrunc (added by the
printf pass) to
get rid of the doubles.
The pass is implemented by iterating over all the instructions looking for any
llvm::FPExtInst instructions. If one is found then we check its uses, if the
fpext is unused, remove it. Otherwise if the instruction only has one use and
llvm::FPTruncInst then we can replace all uses of the
the first operand of
fpext and delete both the
In clang the
convergent attribute can be set on a function to indicate to
the optimizer that the function relies on cross work item semantics. For
OpenCL we need this attribute to be set on the barrier function, for example,
since it’s used to control the scheduling of threads. Recent versions of clang
will proactively set such functions in OpenCL-C kernels as
we also set the attribute implicitly in the builtins header out of an abundance
This pass iterates over all the functions in the module, including declarations
requiring the pass to be a module pass instead of a function pass. If the
function inspected may be convergent, identified by the compiler’s
BuiltinInfo analysis, then we assign the
attribute to it. When the pass encounters a convergent function, all functions
calling that function are transitively marked convergent.