The ComputeMux Compiler is an OpenCL C, SPIR and SPIR-V compiler that consumes the source code or IL provided by an application and compiles it into an executable that can be loaded by the ComputeMux Runtime.
The module aims to provide a boundary beyond which no LLVM type definitions pass in order to keep logical concerns separate.
Structure
The compiler
module is structured as a set of virtual interfaces and a loader
library, with a number of concrete implementations for different ComputeMux
targets. The virtual interfaces reside in include/compiler/*.h
, the library
entry point resides in library/include/library.h
and
library/source/library.cpp
, the dynamic loader library resides in
loader/include/loader.h
and loader/source/loader.cpp
, and the various
implementations reside in targets/*/
.
Dynamic vs Static loading
The compiler is designed to be used either directly as a static library, or indirectly through a loader library.
Static Library
The simplest way to use the compiler library is to link with the
compiler-static
target. Through this target, the compiler is accessed through
the compiler/library.h
header. This target is unavailable if CMake is
configured with CA_RUNTIME_COMPILER_ENABLED
set to OFF
Dynamic Loader
A more flexible option to use the compiler is to instead link wtih the
compiler-loader
target. Through this target, the compiler is accessed through
the compiler/loader.h
header. This header is similar to compiler/library.h
,
however each method additionally requires a compiler::Library
object,
created using compiler::loadLibrary
.
The purpose of the loader is to provide a compiler interface that will be
available when compiling the oneAPI Construction Kit regardless of the value of
CA_RUNTIME_COMPILER_ENABLED
, and regardless of whether the
compiler is loaded at runtime or linked statically.
If CA_RUNTIME_COMPILER_ENABLED
is set to ON
then the
compiler loader can operate in two different ways:
If
CA_COMPILER_ENABLE_DYNAMIC_LOADER
is set toON
, thencompiler::loadLibrary
will look forcompiler.dll
(Windows) orlibcompiler.so
(Linux) in the default library search paths, depending on the platform.If the environment variable
CA_COMPILER_PATH
is set, then its value will be used as the library name instead. Additionally, ifCA_COMPILER_PATH
is set to an empty string, thencompiler::loadLibrary
will skip loading entirely and will operate as if no compiler is available.In this configuration, targets which depend on
compiler-loader
should also add thecompiler
target (the compiler shared library) as a dependency usingadd_dependencies
. Seesource/cl/CMakeLists.txt
for an example.If
CA_COMPILER_ENABLE_DYMAMIC_LOADER
is set toOFF
, thencompiler-loader
will transitively depend oncompiler-static
, andcompiler::loadLibrary
will instead immediately return an instance ofcompiler::Library
that references the static functions directly.
If CA_RUNTIME_COMPILER_ENABLED
is set to OFF
, then
compiler::loadLibrary
will always return nullptr
, and therefore the compiler
will be disabled.
By default, the oneAPI Construction Kit is configured with
CA_RUNTIME_COMPILER_ENABLED
set to ON
and
CA_COMPILER_ENABLE_DYNAMIC_LOADER
set to OFF
.
Selecting a compiler implementation
A compiler implementation is represented by a singleton instance of a
compiler::Info
object. A list of all available compilers can be obtained by
calling compiler::compilers()
, whilst compiler::getCompilerForDevice
can
be used to select the relevant compiler for a particular mux_device_info_t
.
Info
The compiler::Info
struct (include/compiler/info.h
) describes a
particular compiler implementation that can be used to compile programs for a
particular mux_device_info_t
. Info
contains information about the
compiler capabilities and metadata, and additionally acts as an interface for
creating a compiler::Target
object.
Context
The compiler::Context
interface (include/compiler/context.h
) serves as
an opaque wrapper over the LLVM context object. This object can also contain
other shared state used by compiler modules, and contains a mutex that is locked
when interacting with a specific instance of LLVM.
Target
The compiler::Target
interface (include/compiler/target.h
) represents a
particular target device to generate machine code for. This object is also
responsible for creating instances of compiler::Module
(described below) and
listing snapshot stages.
Module
The compiler::Module
interface (include/compiler/module.h
) is responsible
for driving the compilation process from source code all the way to machine
code. It acts as a container for LLVM IR by wrapping the LLVM Module object, and
executes the required passes.
Compile OpenCL C
The clang frontend is instantiated in the compiler::Module::compileOpenCLC
member function, this is where:
The OpenCL C language options are specified to the frontend
User specified macro definitions and include directories are set
mux
device force-include headers (if present) are setA diagnostic handler is provided to report compilation errors
This compilation stage also introduces the pre-compiled builtins header
providing the OpenCL C builtin function declarations to the frontend.
Compilation occurs when the clang::EmitLLVMOnlyAction
is invoked, then
ownership of the resulting llvm::Module
is transferred to
compiler::Module
to be used in the next stage. Any errors occurring
during compilation are returned in the error log specified during the
construction of compiler::Module
, where they can be queried by the
application.
Note
In OpenCL, the compiler::Module::compileOpenCLC
member function directly
maps to clCompileProgram
but is also invoked by clBuildProgram
.
Compile SPIR-V
The compiler::Module::compileSPIRV
member function implements the SPIR-V
frontend. First, the SPIR-V module is handed to spirv_ll::Context::translate
to turn it into a llvm::Module
, then some additional fixup passes are applied.
Compile SPIR
The compiler::Module::compileSPIRV
member function is responsible for
translating SPIR bitcode to an llvm::Module
by running additional fixup
passes.
Link
During compiler::Module::link
, the LLVM module is first cloned before the list
of all provided compile::Module
‘s are linked into the current module. As
before, during compiler::Module::compileOpenCLC
, a diagnostics handler is
specified. If linking was successful, the previous module is destroyed and the
linked modules ownership is moved to compiler::Module
.
Note
In OpenCL, the compiler::Module::link
member function directory maps to
clLinkProgram
but is also invoked by clBuildProgram
.
Finalize
Finalization is the final compilation stage which executes any remaining LLVM
passes and getting it ready to be passed to the backend implementation. This is
where the majority of the LLVM passes are run, once again on a clone of the
llvm::Module
owned by the compiler::Module
object. Once the
llvm::PassManager
has run all of the desired passes, the LLVM module is
ready to be turned into machine code, either through
compiler::Module::createBinary
, or possibly deferred at runtime through the
compiler::Kernel
object.
Kernel
The compiler::Kernel
interface (include/compiler/kernel.h
) represents a
single function entry point in a finalized compiler::Module
. It’s main purpose
is to provide an opportunity for the backend to perform optimizations and code
generation as late as possible. Most of the work is driven by the
compiler::Module::createSpecializedKernel
method that creates a Mux runtime
kernel potentially optimized for a set of execution options that will be passed
to it during muxCommandNDRange
.
OpenCL C Passes
The compiler
module provides a number of LLVM passes, which are specific to
processing the LLVM IR produced by clang after compiling OpenCL C source code.
The IR is processed into a form that the backend can consume. The passes are
described immediately below in the order they are executed by the LLVM pass
manager.
Fast Math
The OpenCL standard defines an optional -cl-fast-relaxed-math
flag that can be
set when building programs, allowing optimizations on floating point arithmetic
that could violate the IEEE-754 standard. When this flag is used we run the LLVM
module level pass FastMathPass
to perform these optimizations straight after
frontend parsing from clang.
First the pass looks for any llvm::FPMathOperator
instructions and for those
found sets the llvm::FastMathFlags
attribute to enable all of:
Unsafe algebra - Operation can be algebraically transformed.
No
NaN
s - Arguments and results can be treated as non-NaN.No
Inf
s - Arguments and results can be treated as non-Infinity.No Signed Zeros - Sign of zero can be treated as insignificant.
Allow Reciprocal - Reciprocal can be used instead of division.
As well as the above compiler::FastMathPass
replaces maths and geometric
builtin functions with fast variants. Any math builtin functions which have a
native equivalent are replaced with the native function, specified as having an
implementation defined maximum error. For example exp2(float4)
is replaced
with native_exp2(float4)
.
Geometric builtins distance
, length
, and normalize
are all defined in
OpenCL as having fast variants fast_distance
, fast_length
, and
fast_normalize
which use reduced precision maths. If any of these functions
are present we also replace them with the relaxed alternative.
These builtin replacements are done by searching the LLVM module for call instructions which invoke the mangled name of a builtin function we want to replace. If the fast version of the builtin isn’t already in the module, i.e. it wasn’t called explicitly somewhere else, then we also need to add a function declaration for the mangled name of the fast builtin. Finally a new call instruction is created invoking the fast function declaration and the old call it replaces is deleted.
Bit Shift Fixup
LLVM IR does not define the results of oversized shift amounts, however SPIR does. As a result shift instructions need to be updated to perform a ‘modulo N’ by the shift amount prior to the shift operation itself, where N is the bit width of the value to shift.
BitShiftFixupPass
implements this as a LLVM function pass iterating over all
the function instructions looking for shifts. For each shift found the pass uses
the first operand to work out ‘N’ for the modulo based on the bit width of the
operand type. If the shift amount from the second operand is less than N
however, then we can skip the shift without inserting a modulo operation since
the shift is not oversized. We can also skip shift instructions that already
have the modulo applied, which can happen if the SPIR module was created by
clang. Otherwise the pass creates a modulo by generating a ‘logical and’
instruction with operands N-1
and the original shift amount, this masked value
is then used to replace the second operand of the shift.
Software Division
The compiler pass SoftwareDivisionPass
is a function level pass designed to
prevent undefined behaviour in division operations. To do this the pass adds
runtime checks using llvm::CmpInst
instructions for two specific cases, divide
by zero and INT_MIN / -1
. Due to the specification of undefined behaviour if
one of these cases is detected we are free to update the behaviour of the divide
operation. In both cases we set the divisor operand of the divide instruction to
be +1
using a llvm::SelectInst
with the original operand based on the result
of our checks.
Since IEEE-754 defines these error cases for floating point types our runtime
checks only need to be applied to integer divides. This is ensured in the pass
by checking if the instruction opcode is one of SDiv
, SRem
, UDiv
, URem
.
Whereas floating point divide instructions will have opcode FDiv
or FRem
.
Image Argument Substitution
OpenCL image calls with opaque types are replaced to use those coming from the image library.
MemToReg
A manual implementation of LLVM’s MemToReg pass, which promotes allocas
which have only loads and stores as uses to register references. This is needed
because after LLVM 5.0 llvm:MemToReg
has regressed and is not removing all
the allocas it should be.
Builtin Simplification
BuiltinSimplificationPass
is a module level pass for simplifying builtin
function calls. The pass performs two kinds of optimization on builtins:
Converts builtins to more efficient variants where possible (for example, a call to the math function
pow(x, y)
, wherey
is a constant that is representable by an integer, will be converted topown(x, y)
).Replace builtins whose arguments are all constant (for example, a call to the math function
cos(x)
, wherex
is a constant, will be replaced by a new constant value that is the calculation of the cosine ofx
).
printf
Replacement.
Of the myriad of architectures that have ComputeMux back ends, most do not have
access to an implementation of printf
whereby they can route a call to
printf
within a kernel to stdout
of the process running on the host CPU
processor.
To enable our ComputeMux back ends to call printf
, we provide an optimized
software implementation. An additional kernel argument buffer is implicitly
added to any kernel that uses printf
, and our implementation of printf
that is run on the ComputeMux backend will write the results of the printf
into this buffer instead. Then, when the kernel has completed its execution, the
data that was written to this buffer is streamed out on the host CPU processor
via stdout
.
Combine fpext
fptrunc
CombineFPExtFPTruncPass
is a function level pass, rather than a module pass,
for removing FPExt
and FPTrunc
instructions that cancel each out. This is
used after the printf
replacement pass because var-args printf
arguments
will be expanded to double by clang even if the device doesn’t support doubles.
So if the device doesn’t support doubles, the printf
pass will fptrunc
the
parameters back to float. CombineFPExtFPTruncPass
will find and remove the
matching fpext
(added by clang) and fptrunc
(added by the printf
pass) to
get rid of the doubles.
The pass is implemented by iterating over all the instructions looking for any
llvm::FPExtInst
instructions. If one is found then we check its uses, if the
fpext
is unused, remove it. Otherwise if the instruction only has one use and
it’s a llvm::FPTruncInst
then we can replace all uses of the fptrunc
with
the first operand of fpext
and delete both the fptrunc
and fpext
.
Set convergent
Attr
In clang the convergent
attribute can be set on a function to indicate to
the optimizer that the function relies on cross work item semantics. For
OpenCL we need this attribute to be set on the barrier function, for example,
since it’s used to control the scheduling of threads. Recent versions of clang
will proactively set such functions in OpenCL-C kernels as convergent
, but
we also set the attribute implicitly in the builtins header out of an abundance
of caution.
This pass iterates over all the functions in the module, including declarations
requiring the pass to be a module pass instead of a function pass. If the
function inspected may be convergent, identified by the compiler’s
BuiltinInfo
analysis, then we assign the llvm::Attribute::Convergent
attribute to it. When the pass encounters a convergent function, all functions
calling that function are transitively marked convergent.