The compiler::utils
module provides a number of utility functions and LLVM
passes inside the compiler::utils
namespace. These are designed to be
reusable across multiple ComputeMux implementations to reduce code duplication.
Some of these utilities are described below.
TransferKernelMetadataPass & EncodeKernelMetadataPass
These passes are responsible for setting up metadata on kernels under compilation. Many other passes implicitly rely on the metadata and attributes added by these passes, so it is recommended to run them first, if possible.
The difference between the two passes concerns the list of kernels it runs over:
The
TransferKernelMetadataPass
runs over multiple kernels in the module, using!opencl.kernels
metadata if present, else all functions in the module with thellvm::CallingConv::SPIR_KERNEL
calling convention.The
EncodeKernelMetadataPass
runs on one kernel, supplied by name to the pass upon construction.
Their job is three-fold:
To add mux-kernel entry-point function attributes to the kernels covered by each pass.
To add !reqd_work_group_size function metadata if not already attached. It sets this information based on local work-group size information which is:
(
TransferKernelMetadataPass
) - taken from the kernel’s entry in the!opencl.kernels
module-level metadata.(
EncodeKernelMetadataPass
) - optionally passed to the pass on construction. The local sizes passed to the pass should either be empty or correspond 1:1 with the list of kernels provided.
To add mux-work-item-order work-item order function attributes. It uses optional data supplied to either pass on construction to encode this metadata. If not set, the default
xyz
order is used.
LowerToMuxBuiltinsPass
This pass replaces calls to the language implementation’s builtin functions with alternative sequences of instructions involving ComputeMux builtins.
Not all builtins must be lowered, but any builtins which can be re-expressed in the terms of the following ComputeMux builtins should be lowered:
Sync builtins:
__mux_mem_barrier
,__mux_(sub|work)_group_barrier
.Work-item builtins:
__mux_get_local_id
,__mux_get_group_id
, etc.Group builtins:
__mux_(sub|work)_group_(any|all|scan|reduce|broadcast)
.DMA builtins:
__mux_dma_(read|write)_(1D|2D|3D)
and__mux_dma_wait
.
Targets must lower any language builtins which can be expressed in terms of these ComputeMux builtins in order for other ComputeMux compiler passes to corectly recognise the program semantics.
This is because these builtins have special semantics that the compiler and LLVM are generally unable to intuit using built-in properties of functions in LLVM (e.g., attributes). They generally have some meaning “across” other invocations of the same program or that influence the behaviour of other invocations running in parallel.
In the case of DMA builtins, the pass assumes that the language builtins are
already using the target’s intended ‘event’ type - that type is forwarded on
directly to the __mux
builtins. If the target wishes to replace these event
types across the module, they may use the ReplaceTargetExtTysPass
to do
so (for LLVM 17 onwards). The target may also directly replace the event
types used by the __mux
DMA builtins at a later stage.
See the full list of builtins for more information.
AlignModuleStructsPass
In LLVM, padding is usually enforced in the compiler frontend. However, because
the frontends used by ComputeMux often generates code for a target-agnostic
intermediate triple such as spir64
or spir32
, the alignment will have
been enforced and assumed valid according to that.
However, the final target data layout used by the code generator may have looser requirements than the SPIR ABI, in which case the final generated code may inadvertently break the higher-level language specification.
As a concrete example, the SPIR ABI requires that vector types are aligned to their own byte size. However, 32-bit arm targets have 128-bit vectors aligned to only 64 bits. So when generating code for something like:
struct S {
int4 a;
int4 b;
};
The incoming spir32
IR will have generated a structure type assuming its
own data layout’s alignment requirements:
target datalayout = "e-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024"
%struct.S = type { <4 x i32>, <4 x i32> }
Once the compiler has switched the triple to arm32
and its corresponding
datalayout, the elements of the structure type will be aligned to only 64 bits.
For this reason, the AlignModuleStructsPass
adds explicit padding to match
the higher-level alignment requirements:
%struct.S = type { <4 x i32>, [8 x i8], <4 x i32>, [8 x i8] }
The job of AlignModuleStructsPass
is thus to find all the struct types
which need alignment padding and create matching types with the padded variant.
Once we have generated these we need to substitute all the references, which is
performed in replaceModuleTypes()
. Since the types differ we cannot simply
use llvm::ReplaceAllUsesWith()
, and as a result we need to use
llvm::CloneFunctionInto()
with the type map parameter to do the job of
replacing types. One particular instruction we need to manually modify however
is GEP, since the indices into the struct will have changed with the addition
of padding members.
All kernels which are cloned and replaced in order to replace struct types
steal the mux-kernel
function attributes from their source function.
ReplaceAtomicFuncsPass
The module pass ReplaceAtomicFuncsPass changes any calls to atomic functions
made in the module into atomic LLVM instructions. This is performed by
iterating over all the instructions looking for llvm::CallInst
calls, and
for each checking the mangled name of the called function. The SPIR mangling
validated against includes the address space (AS<n>) which is not part of
Itanium.
If the mangling matches any of the overloaded variants of the OpenCL atomic
functions we use a mapping of mangled functions to
llvm::AtomicRMWInst::BinOp
operators to find the operator to use when
creating the atomic instruction. When the operator is an atomic_cmpxchg
we
build a llvm::AtomicCmpXchgInst
instruction, otherwise a
llvm::AtomicRMWInst
instruction is generated. Finally the call instruction
is replaced with our atomic instruction using ReplaceAllUsesWith
and then
erased.
FixupCallingConventionPass
To make sure that the calling convention of functions in the module are
understood by the target code generator, the FixupCallingConventionPass
can
be run as a module pass.
This is often required as most LLVM backends aren’t able to generate code for
the llvm::CallingConv::SPIR_FUNC
or llvm::CallingConv::SPIR_KERNEL
calling conventions used by SPIR-V.
The pass accepts a single calling convention to be used across the entire module. Note that it does not perform any other transformations of the function or callees, meaning this pass is only valid when the target calling conventions are ABI compatible.
The FixupCallingConventionPass
iterates over all the functions in the
executable module and, if that function is not an intrinsic, updates the
calling convention of the function and all its call instruction callees.
If the pass is given either the SPIR_FUNC
or SPIR_KERNEL
calling
convention, the pass will automatically fix up any mismatches between
llvm::CallingConv::SPIR_FUNC
and llvm::CallingConv::SPIR_KERNEL
, using
functions with mux-kernel attributes as a source of truth to distinguish between kernels and
other functions.
WorkItemLoopsPass
The WorkItemLoopsPass
is responsible for adding explicit parallelism to
implicitly parallel SIMT kernels. It does so by wrapping each kernel up in a
triple-nested loop over all work-items in the work-group. Thus, kernels
scheduled by this pass can be invoked once per work-group.
The order in which work-items are executed is fairly flexible as per the programming models the oneAPI Construction Kit supports, but generally in ascending order from 0 to N-1 through the innermost X dimension, followed by the Y dimension, and lastly the Z dimension.
Conceptually, the pass transforms old_kernel
into new_kernel
in the
example below:
void old_kernel(int *in, int *out) {
size_t id = get_local_linear_id(0);
out[id] = in[id] * 4;
}
void new_kernel(int *in, int *out) {
for (size_t z = 0, sizeZ = get_local_size(2); z != sizeZ; z++) {
for (size_t y = 0, sizeY = get_local_size(1); y != sizeY; y++) {
for (size_t x = 0, sizeX = get_local_size(0); x != sizeX; x++) {
size_t id = (z * sizeY * sizeX) + (y * sizeX) + x;
out[id] = in[id] * 4;
}
}
}
}
To satisfy the programming model, the pass must be careful around control
barriers and barrier-like functions. The WorkItemLoopsPass
splits a
kernel into separately executing kernel functions using barrier calls as
boundaries. Each section of the kernel split by these barriers is known as a
barrier region.
void old_kernel(int *in, int *out) {
size_t id = get_local_linear_id(0);
out[id * 2] = in[id];
// All work-items in the work-group must encounter the barrier before any
// are allowed to continue execution beyond the barrier.
work_group_barrier(CLK_GLOBAL_MEM_FENCE);
out[id * 2 + 1] = in[id] * 4;
}
void new_kernel(int *in, int *out) {
// Barrier region #0
for (size_t z = 0, sizeZ = get_local_size(2); z != sizeZ; z++) {
for (size_t y = 0, sizeY = get_local_size(1); y != sizeY; y++) {
for (size_t x = 0, sizeX = get_local_size(0); x != sizeX; x++) {
size_t id = (z * sizeY * sizeX) + (y * sizeX) + x;
out[id * 2] = in[id];
}
}
}
// The control aspect of the barrier has been satisfied by the loops, so
// it has been decomposed to just a memory barrier.
mem_fence(CLK_GLOBAL_MEM_FENCE);
// Barrier region #1
for (size_t z = 0, sizeZ = get_local_size(2); z != sizeZ; z++) {
for (size_t y = 0, sizeY = get_local_size(1); y != sizeY; y++) {
for (size_t x = 0, sizeX = get_local_size(0); x != sizeX; x++) {
size_t id = (z * sizeY * sizeX) + (y * sizeX) + x;
out[id * 2 + 1] = in[id] * 4;
}
}
}
}
To propagate data dependencies between these barrier regions, an analysis is performed to create a struct of live variables which is passed as an argument to each kernel. Generated kernels then reference this struct rather than the original values. A simplified example follows:
void old_kernel(int *in, int *out) {
size_t id = get_local_linear_id(0);
// X is a barrier-carried dependency: produced in one barrier region and
// accessed in another.
int x = in[id] * 4;
// All work-items in the work-group must encounter the barrier before any
// are allowed to continue execution beyond the barrier.
work_group_barrier(CLK_GLOBAL_MEM_FENCE);
// Use X, produced by the previous barrier region.
out[id] = x;
}
void new_kernel(int *in, int *out) {
struct kernel_live_vars {
int x;
};
// Illustrative purposes: this is in reality a stack allocation.
kernel_live_vars *live_vars =
malloc(get_local_size(0) * get_local_size(1)
* get_local_size(2) * sizeof(live_vars));
for (size_t z = 0, sizeZ = get_local_size(2); z != sizeZ; z++) {
for (size_t y = 0, sizeY = get_local_size(1); y != sizeY; y++) {
for (size_t x = 0, sizeX = get_local_size(0); x != sizeX; x++) {
size_t id = (z * sizeY * sizeX) + (y * sizeX) + x;
live_vars[id] = in[id] * 4;
}
}
}
mem_fence(CLK_GLOBAL_MEM_FENCE);
for (size_t z = 0, sizeZ = get_local_size(2); z != sizeZ; z++) {
for (size_t y = 0, sizeY = get_local_size(1); y != sizeY; y++) {
for (size_t x = 0, sizeX = get_local_size(0); x != sizeX; x++) {
size_t id = (z * sizeY * sizeX) + (y * sizeX) + x;
out[id] = live_vars[id];
}
}
}
}
The loop that reconstructs the kernels in the wrapper function uses the
vectorization dimension as innermost cycle, and it relies on
mux-work-item-order function attributes for the outermost loops. The logic for the
dimension unmarshalling lies in
modules/compiler/utils/include/utils/vecz_order.h
.
Preserving debug info is a problem for the WorkItemLoopsPass
due to live
variables getting stored in a struct passed as an argument to each of the
generated kernels. As a result the memory locations pointed to by the debug
info are out of date with respect to newly written values. By specifying the
IsDebug
flag when creating the pass we can resolve this problem at the
expense of performance.
When the IsDebug
flag is set the pass adds a new alloca
which contains a
pointer to the live variables struct of the currently executing work-item, since
there is a separate struct for each work-item in a work-group. A new store
instruction to this alloca
is also inserted before calls to each of the
separated kernels with the new address of the live variables struct for the
work-item about to be executed. These extra writes to the stack have a runtime
cost which is why this transformation is only done when compiling for debug.
The benefit of adding the extra alloca
is that it forces the address to be
placed on the stack, where we can point to it with llvm.dbg.declare()
intrinsics, rather than reading the address from a register where it won’t
persist. Not all source variables are classed as live however if they are not
used past the first barrier, so when the IsDebug
flag is set we also modify
the algorithm for finding live variables to mark these alloca
instructions
as live. Otherwise their values won’t be updated for the current work item past
the first barrier and the debugger will print incorrect values.
To point to the location in the live variables struct where each source
variable lives we use DWARF expressions, represented in LLVM by a
DIExpression
metadata node. In our expression we first use a
DW_OP_deref
DWARF operation to dereference the pointer in our debugging
alloca
to find the start of the live variables struct. Then next in the
expression we have a DW_OP_plus
operation taking an integer operand for the
byte offset into the struct for that particular variable.
In order to establish which values actually cross a barrier, we traverse the CFG and build inter-barrier regions. We start traversal at the beginning of the function, and at the barriers, and we end whenever we encounter another barrier or a return statement. We collect all values that are defined within one region, which have uses in any other region, which are called “external uses”. We also collect values that are defined within one region and used in the same region, but where the definition does not dominate the use. These are “internal uses” and can occur where a barrier is present in a loop, such that the same barrier that begins the inter-barrier region can also be hit at the end of that region. (The definition must have dominated all its uses in the original function, but a barrier inside a loop can result in the second part of the loop body preceding the first within the inter-barrier region.)
We also implement a “Barrier Tidying” optimization that posts-processes the set of live values to remove certain values where it is expected that loading and storing these values will incur more overhead than simply recalculating them from other available values (including other barrier-stored values and kernel parameters). Values considered removable are:
NOP casts,
Casts from a narrow type to a wider type,
All other casts where the source operand is already in the barrier,
Vector splats,
Calls to “rematerializable” builtins - see
compiler::utils::eBuiltinPropertyRematerializable
If the barrier contains scalable vectors, the size of the struct is dependent
on the value of vscale
, and so is the total number of struct instances for
a given work group size. In this case we create the barrier memory area as a
byte buffer (i.e. an array of i8
), instead of an array of barrier structs.
The address of the barrier struct for the subkernel invocations have to be
computed knowing the vscale, and pointer-cast to the barrier struct type. Any
scalable vector members of the barrier struct are put into a flexible array
member (of type i8
) at the end, so that GEPs to individual members can be
constructed by calculating their byte offsets into this array and the results
cast to pointers of the needed type. The position of individual scalable vector
members is calculated by multiplying their equivalent “fixed width” offset
(i.e. the same as if vscale were equal to 1) by the actual vscale.
Once we know which values are to be included in the barrier struct, we can split the kernel proper, creating a new function for each of the inter-barrier regions, cloning the Basic Blocks of the original function into it. We apply the barrier in the following order: external uses are remapped into loads from the barrier struct, then any barrier-resident values are stored into the barrier, and finally, internal uses are remapped into the barrier. External and internal uses are dealt with separately, since external uses can always be safely loaded only once at the beginning of the new function, where as internal uses may or may not need to load the loop-updated value. For this reason, stores are always created immediately after the definitions of the relevant values, rather than at the barrier at the end of the region. (This may have some scope for further optimization work.) When tidying has removed a value from the barrier, we have to also clone those values as well, in order to re-compute these values from the value actually stored in the barrier struct. Each subkernel returns an integer ID that maps to the barriers, corresponding to the barrier that was encountered at the end of the subkernel. There is a special barrier ID that represents the return statement of the original kernel, and also one that represents the kernel entry point.
This pass runs over all functions in the module which have mux-kernel entry-point attributes.
The new wrappers take the name of either the ‘tail’ or ‘main’ kernels – whichever is present – suffixed by “.mux-barrier-wrapper”. The wrappers call either the original kernel(s) if no barriers are present, or the newly-created barrier regions if barriers are present. The original kernels are left in the module in either case but are marked as internal so that later passes can optimize them if they are no longer called once inlined.
Newly-created functions preserve the original calling convention, unless they
are kernels. In that case, the new functions will have SPIR_FUNC
calling
convention. Newly-created functions steal the mux-kernel
attributes from
the original functions.
Once we have all of our subkernels, we apply the 3-dimensional work item loops individually to each subkernel. The return value of a subkernel is used to determine which subkernel loop to branch to next, or to exit the wrapper function, as appropriate.
Work-group scheduling (vectorized and scalar loops)
The WorkItemLoopsPass is responsible for stitching together multiple kernels to make a single kernel capable of correctly executing all work-items in the work-group.
In particular, when a kernel has been vectorized with Vecz it executes multiple work-items at once. Unless the work-group size in the vectorized dimension is known to be a multiple of the vectorization factor, there exists the possibility that some work-items will not be executed by the vectorized loop.
As such, the WorkItemLoopsPass is able to stitch together kernels in several different configurations:
Vector + scalar loop
Vector loop + vector-predicated tail
Vector loop only
Scalar loop only
Vector + Scalar
The vector + scalar kernel combination is considered the default behaviour. Most often the work-group size is unknown at compile time and thus it must be assumed that the vector loop may not execute all work-items.
This configuration is used if the WorkItemLoopsPass is asked to run on a vectorized function which has !codeplay_ca_vecz.derived function metadata linking it back to its scalar progenitor. In this case, both the vector and scalar kernel functions are identified and are used. The vector work-items are executed first, followed by the scalar work-items.
const size_t peel = group_size_x % vec_width;
const size_t peel_limit = group_size_x - peel;
if (group_size_x >= vector_width) {
for (size_t z = 0; z < group_size_z; ++z) {
for (size_t y = 0; y < group_size_y; ++y) {
for (size_t wi = 0; wi < peel_limit; wi += vec_width) {
// run vectorized kernel if vec_width > 1,
// otherwise the scalar kernel.
}
}
}
}
if (group_size_x < vector_width || group_size_x % vector_width != 0) {
for (size_t z = 0; z < group_size_z; ++z) {
for (size_t y = 0; y < group_size_y; ++y) {
// peeled loop running remaining work-items (if any) on the scalar
// kernel
for (size_t wi = peel_limit; wi < group_size_x; ++wi) {
// run scalar kernel
}
}
}
}
Barriers are supported in this mode by creating a separate barrier struct for both the vector and scalar versions of the kernel.
There are circumstances in which this mode is skipped in favour of “vector only” mode:
If the local work-group size is known to be a multiple of the vectorization factor.
This is identified through the !reqd_work_group_size function metadata. This is often automatically added to functions by compiler frontends if kernels are supplied with attributes (e.g.,
reqd_work_group_size
in OpenCL). Alternatively, if the work-group size is known at compile time, use the TransferKernelMetadataPass or EncodeKernelMetadataPass to encode functions with this information.
If the WorkItemLoopsPass has been created with the ForceNoTail option. * This is a global toggle for all kernels in the program.
If the kernel has been vectorized with vector predication. In this case the vector loop is known to handle scalar iterations itself.
If any of these conditions are true, the “vector only” mode is used.
Vector + Vector-predicated
The vector + vector-predicated kernel combination is a special case optimization of the default behaviour.
If the pass detects both a vector and vector-predicated kernel linked to the same original kernel with the same vectorization width, the scalar tail loop is replaced with a straight-line call to the vector-predicated kernel, which will perform all of the scalar iterations at once.
const size_t peel = group_size_x % vec_width;
const size_t peel_limit = group_size_x - peel;
if (group_size_x >= vector_width) {
for (size_t z = 0; z < group_size_z; ++z) {
for (size_t y = 0; y < group_size_y; ++y) {
for (size_t wi = 0; wi < peel_limit; wi += vec_width) {
// run vectorized kernel if vec_width > 1,
}
if (peel) {
// run vector-predicated kernel
}
}
}
}
Vector only
If the WorkItemLoopsPass is run on a vectorized kernel for which no vecz linking metadata is found to identify the scalar kernel, or if a scalar kernel is found but one of the conditions listed above hold, then the kernel is emitted using the vector kernel only. It is assumed that if no scalar kernel is found it is because targets know that one is not required.
Scalar only
If the WorkItemLoopsPass is run on a scalar kernel then only the scalar kernel is used.
OptimalBuiltinReplacementPass
The OptimalBuiltinReplacementPass
is an optimization call-graph pass designed
to replace calls to builtin functions with optimal equivalents.
The OptimalBuiltinReplacementPass
iterates over the call graph from kernels
inwards to their called functions, and visits all call sites in the caller
functions. If a call is made to a function that the pass is interested in, the
call is deleted and is replaced with a series of inline IR instructions. Using
the call graph guarantees that replacements are made on a priority basis;
outermost functions are replaced before any functions they themselves call.
Replacements are optionally made according to a specific BuiltinInfo
object, which may be passed to this pass. It defaults to nullptr
. If this
BuiltinInfo
is present then it is asked whether it recognizes any builtin
functions and is tasked with inlining a suitable sequence of instructions.
Replacements are also performed on two abacus-internal builtins: __abacus_clz
and __abacus_mul_hi
. Replacing these rather than their OpenCL user-facing
builtins allows replacements in more cases, as the abacus versions are used to
implement several other builtin functions.
The __abacus_clz
builtin – count leading zeros – can be exchanged for a
hardware intrinsic: llvm.ctlz
. However, some variants are skipped: 64-bit
scalar and vector variants are skipped, since Arm uses calls to an external
function to help it implement this case.
The __abacus_mul_hi
builtin – multiplication returning the “high” part of
the product – can be exchanged for a shorter series of LLVM instructions which
perform the multiplication in a wider type before shifting it down. This is
desirable because abacus has a rule that it never introduces larger types in
its calculations. LLVM, however, is able to match a specific sequence of
instructions against a “mul hi” node, which is canonical, well-optimized, and
many targets directly lower that node to a single instruction. 64-bit versions
(scalar and vector) are skipped since 64-bit “mul hi” and 128-bit integers are
not well supported on all targets.
The __abacus_fmin
and __abacus_fmax
builtins can be exchanged for
hardware intrinsics: llvm.minnum
and llvm.maxnum
. This is not performed
on ARM targets due to LLVM backend compiler bugs.
LinkBuiltinsPass
The LinkBuiltinsPass
will manually link in any functions required from a
given builtins module, into the current module. This pass allows us to strip
out unnecessary symbols whilst performing our link step resulting in the
equivalent of a simple global DCE pass with no overhead. Previously, we would
link our kernel module into the lazily-loaded builtins module (the recommended
way to link between a small and a large LLVM module), which we would not be
able to do in a pass (as the Module the pass refers to effectively dies as the
linking would occur).
LLVM’s LinkModules
function is destructive to the source module - it will
happily destroy the source module as it links it into the destination. This is
fine for most cases, but not ours. In our case, we want to load the builtins
module once (in our finalizer) and then re-use that loaded module multiple
times (saves significant memory & processing requirements on our hot path).
Note that in some cases linking builtins before vectorization is desirable,
except for special builtins such as get_global_id()
. This is particularly
the case for scalable vector support where there is no equivalent in the
builtins. To enable early linking, pass EarlyLinking = true
when
constructing the pass.
MakeFunctionNameUniquePass
The module pass MakeFunctionNameUniquePass
is used to give distinct names
to scheduled kernels. This is necessary since a single kernel can be run more
than once across different work sizes and we want to be able differentiate
them.
When creating the pass, one string parameter needs to be passed for for the new
unique kernel name. MakeFunctionNameUniquePass
then simply looks for all
functions with mux-kernel entry-point attributes and sets the function’s name to be the
first string argument.
Note
This pass is only used in-tree by the host target. When doing just-in-time compilation at execution time, only one kernel is under compilation.
This pass is not useful when doing ahead-of-time compilation, when many kernels may be in the same module.
ReduceToFunctionPass
The LLVM module when passed to scheduled kernel can contain multiple kernel
functions present in the device-side program, however by this stage of
compilation we are only interested in running a subset of these kernels. In order to
improve the speed of subsequent passes and reduce code size we therefore have
module pass ReduceToFunctionPass
, which removes dead functions not used by the
target kernels. The ReduceToFunction
pass runs over all functions with
mux-kernel
attributes by default. All top-level kernel functions that are required to be
preserved by this pass should have this attribute set.
Note
Like the MakeFunctionNameUniquePass, this is only used in-tree by
host
which does just-in-time compilation, when one kernel entry point
can be singled out.
Note
A deprecated version of this pass takes a string list of functions names to preserve, which should include the name of our enqueued kernel and any internal functions needed for later passes.
When ReduceToFunctionPass
is then run it iterates over the list of
preserved functions (obtained either through metadata or the deprecated list of
kernel names). Those functions are then marked to keep, and so are any
functions called inside it. Afterwards the pass looks through all the functions
in the module and erases any not marked for keeping.
RunVeczPass
The RunVeczPass
module pass provides a wrapper for using our
Vecz oneAPI Construction Kit IR vectorizer. This vectorizes
the kernel to a SIMD width specified when the pass is created. In our case
this is typically local size in the first dimension but there are other
factors to consider when picking the width, like being a power of 2.
We only enable the vectorizer in host when the -cl-wfv={always|auto}
option
is provided, a condition check which is the first thing this pass does. If this
check fails, the pass exits early, otherwise the vectorizer is invoked through
top level API vecz::Vectorizer::vectorize
. If the passed option is
-cl-wfv=auto
, then we first have to check the layout of the input kernel to
find out if it is advantageous to vectorize it, and only do so if it is the
case. If the passed option is -cl-wfv=always
, then we will try to vectorize
the kernel in any case. If successful, this will return a new vectorized kernel
function created in the LLVM module so that this vectorized kernel is used
instead of our scalar kernel from here on.
Cost Model Interface
User cost-modelling in vecz can be handled by the
vecz::VeczPassOptionsAnalsis
which takes a user defined query function on
construction. This pass is a required analysis pass for vecz, so be sure to add
it to your analysis manager.
Vecz queries the result of this analysis before operating on a kernel, and the
user function may fill an array of VeczPassOptions
which contain suitably
modelled widths, vectorization factors, and scalability options determined
suitable for the target.
The VeczPassOptionsAnalysis
pass can be default-constructed - in which case
vecz makes a conservative decision about kernel vectorization - or be
constructed passing in a user callback function. The function takes as its
parameters a reference to the function to be optionally vectorized, and a
reference to a vector of VeczPassOptions
which it is expected to fill in.
If it’s not interested in seeing the function vectorized, it returns false;
otherwise it fills in the VeczPassOptions
array with the choicest
vectorization options it can muster for the target. For example:
void InitMyAnalysisManager(llvm::ModuleAnalysisManager &MAM) {
MyCostModel CM;
MAM.registerPass([CM] {
return vecz::VeczPassOptionsAnalysis(
[CM](llvm::Function &F,
llvm::SmallVectorImpl<vecz::VeczPassOptions> &Opts) {
if (CM->getCostWFV(&F) > 0) {
// Vectorizing will make things worse, so don't
return false;
}
VeczPassOptions O;
vecz::VectorizationChoices &choices = O.choices;
if (!MyCostModel->hasDoubles()) {
choices.enable(eCababilityNoDoubleSupport);
}
if (CM->getCostPartialScalarization(&F) < 0) {
choices.enable(vecz::VectorizationChoices::ePartialScalarization);
}
if (CM->getCostBOSCC(&F) < 0) {
choices.enable(vecz::VectorizationChoices::eLinearizeBOSCC);
}
// Our silly target only has 42-wide SIMD units!
opts.factor = Vectorization::getFixedWidth(42);
Opts.emplace_back(std::move(O));
return true;
});
});
}
To access the VeczPassOptionsAnalysis
from inside any other pass in the
same pass manager, do the following:
auto queryPassOpts = getAnalysis<vecz::VeczPassOptionsAnalysis>();
The above returns a pointer to the cost model the wrapper pass was constructed
with, and may return nullptr
if no cost model was provided.
The Cost Model header file resides at utils/cost_model.h
.
Scheduling Parameters
Certain ComputeMux builtin functions may require extra-function data passed to them to be lowered. The AddSchedulingParametersPass can help to achieve this by modifying functions with extra parameters, known as “scheduling parameters”, to functions that require them.
Target Scheduling Parameters
Scheduling parameters are ultimately up to the ComputeMux target to define, but a default set of two pointer-to-struct parameters is used to achieve the default lowering of ComputeMux builtins:
MuxWorkItemInfo
:The 3-dimensional local ID: an array of 3 natural-width integers (e.g.,
[3 x i64]
when compiling for a 64-bit device).The sub-group ID: a 32-bit integer (
i32
).The kernel width: a 32-bit integer (
i32
). The kernel width represents how many work-items are being executed in parallel. This is1
by default, unless the kernel is vectorized, in which case the width is the vectorization factor.The number of sub-groups: a 32-bit integer (
i32
).The maximum sub-group size: a 32-bit integer (
i32
).
MuxWorkGroupInfo
:The group id: a 3-dimensional array of natural-width integers (
[3 x iN]
).The number of groups: a 3-dimensional array of natural-width integers (
[3 x iN]
).The global offset: a 3-dimensional array of natural-width integers (
[3 x iN]
).The local work-group size: a 3-dimensional array of natural-width integers (
[3 x iN]
).The number of work dimensions: a 32-bit integer (
i32
).
AddSchedulingParametersPass
The AddSchedulingParametersPass
pass requests the target-specific list of
scheduling parameters via BuiltinInfo::getMuxSchedulingParameters
. This
list is added to all kernel entry points and to all mux builtins that require
them; this is determined by BuiltinInfo::requiresSchedulingParameters
.
BuiltinInfo::getMuxSchedulingParameters
is virtual and thus may be overridden
by targets. See BuiltinInfo::SchedParamInfo
for the data that must be filled
in to communicate this information.
The pass starts by identifying which builtins require scheduling parameters. It then propagates this initial list of functions to all functions calling those functions, all functions calling those functions, and on in this fashion until the entire call graph is covered.
The list of scheduling parameters is emitted to the module under !mux-scheduling-params.
Once the list of functions to add scheduling parameters has been calculated,
each function is cloned in order to amend the functions’ type with new
scheduling parameters. All scheduling parameters are added to all functions
that require them. This is for ease of implementation and to reduce pass
interdependencies: any subsequent pass can know that scheduling parameters are
entirely present or entirely not present, without in-depth analysis. Unused
function parameters on functions with internal linkage are later pruned by
LLVM’s DeadArgumentEliminationPass
.
Functions for which BuiltinInfo::requiresSchedulingParameters
returns true
are cloned taking their old names with them. Old uses are suffixed .old
.
Other functions are cloned and given the suffix .mux-sched-wrapper
.
By way of an example, given a module with the following functions:
; Requires scheduling parameters
declare void @foo()
; Requires scheduling parameters only transitively
define void @bar() #0 {
call void @foo()
ret void
}
attributes #0 = { "mux-kernel" }
With the default set of scheduling parameters, the
AddSchedulingParametersPass
would produce:
; Old version of @foo - no scheduling parameters
declare void @foo.old()
; Old version of @bar - no longer a kernel - can be cleaned up later
define void @bar(i32) {
call void @foo.old()
ret void
}
; New version of @foo with scheduling parameters
declare !mux_scheduled_fn !1 void @foo(ptr %wi, ptr %wg)
; New version of @bar with scheduling parameters - the new kernel
define void @bar.mux-sched-wrapper(i32, ptr %wi, ptr %wg) #0 !mux_scheduled_fn !2 {
call void @foo(ptr %wi, ptr %wg)
ret void
}
attributes #0 = { "mux-kernel" }
!mux-scheduling-params = !{!0}
!0 = !{!"MuxWorkItemInfo", !"MuxWorkGroupInfo"}
!1 = !{i32 0, i32 1}
!2 = !{i32 1, i32 2}
DefineMuxBuiltinsPass
The DefineMuxBuiltinsPass
performs a scan over all functions in the module,
calling BuiltinInfo::defineMuxBuiltin
on all mux builtin function
declarations.
There is a soft dependency on the AddSchedulingParametersPass if the default
set of mux scheduling parameters is used. This is because the default lowering
of many work-item builtins requires data stored in the structure parameters. If
the target provides a custom implementation of
BuiltinInfo::defineMuxBuiltin
, the AddSchedulingParametersPass
may not
be required.
If a definition of a mux builtin requires calls to other mux builtins which
themselves need defining, such dependencies can be added to the end of the
module’s list of functions so that the DefineMuxBuiltinsPass
will visit
those in turn. One example of this is the lowering of __mux_get_global_id
which calls __mux_get_local_id
, among other functions.
AddKernelWrapperPass
To encapsulate all the original parameters of the kernel as a single struct
argument we use the pass AddKernelWrapperPass
. This creates a struct
composed of the individual kernel parameters which the runtime can create and
pass when invoking the kernel. If IsPackedStruct = true
is passed via the
pass options on construction, then all of the parameters are tightly packed,
otherwise each parameter is aligned to a power of 2 equal to or above the size.
As a first step towards achieving this the pass iterates over all the current
kernel parameter types and adds them as members to a new struct type called
MuxPackedArgs.<kernel name>
. The rules for parameter packing are as follows:
If
PassLocalBuffersBySize = true
, buffers in the local/Workgroup address space are passed via the devicesize_t
type. In this mode, the buffer is stack allocated upon kernel entry, creating analloca
instruction with the size passed throughMuxPackedArgs
. The stack-allocated pointer will then be passed to the old kernel via a pointer. The host runtime is expected to pass the size of the buffer allocation. IfPassLocalBuffersBySize
is nottrue
, then they are treated as regular pointers.If the kernel argument is a pointer type with the
byval
parameter attribute, the parameter is passed by itsbyval
type.If none of the above cases hold, then the parameter is packed as-is.
Next, the pass creates a wrapper function which will be used as the new kernel
entry point, taking a pointer to the MuxPackedArgs
struct as its first
parameter. Any scheduling parameters present in kernel are dealt with thus,
depending on the value of BuiltinInfo::SchedParamInfo::PassedExternally
:
If
true
, are passed through the wrapper functionIf
false
, must be initialized byBuiltinInfo::initalizeSchedulingParamForWrappedKernel
in the entry block, before being passed to the original wrapped kernel.
The wrapper pass takes its name from the original function name of the old
wrapped kernel, with a pass-specific suffix appended. The wrapped kernel is
marked inline. A call instruction to this wrapped kernel is then created inside
the wrapper, using GEPs to the appropriate MuxPackedArgs
element for each of
the parameters.
This pass runs on all functions in the module with mux-kernel attributes. The new wrapper functions take this attribute from the original functions.
Any !mux_scheduled_fn metadata is dropped on the wrapper function, as the old metadata is no longer accurate, and no further passes depend on it.
; Has two parameters (%a0, %a1) and two scheduling parameters (%x, %y)
declare !mux_scheduled_fn !1 void @foo(i8 %a0, i16 %y, i32 %a1, i64 %x) #0
attributes #0 = { "mux-kernel" }
!mux-scheduling-params = !{!0}
!0 = !{ !"x", !"y" }
!1 = !{ i32 1, i32 3 }
Assuming that for scheduling parameter %x
,
BuiltinInfo::SchedParamInfo::PassedExternally
is true
and for %y
it
is false
, after running this pass:
; A packed argument structure containing %a0 and %a1
%MuxPackedArgs.foo = type { i8, i32 }
declare !mux_scheduled_fn !1 void @foo(i8 %a0, i16 %y, i32 %a1, i64 %x)
; Has one packed-argument parameter and one pass-through scheduling parameter: %x
define void @foo.mux-kernel-wrapper(ptr %packed-args, i64 %x) #0 {
; Load the original kernel arguments from the packed structure
%a0 = load i8 ptr %packed-args
%a1.addr = getelementptr %MuxPackedArgs.foo, ptr %packed-args, i32 0, i32 1
%a1 = load i32, ptr %a1.addr
; Initialize %y as per BuiltinInfo::initalizeSchedulingParamForWrappedKernel
%y = ...
call void @foo(i8 %a0, i16 %y, i32 %a1, i64 %x)
}
attributes #0 = { "mux-base-fn-name"="foo" "mux-kernel" }
!mux-scheduling-params = !{!0}
!0 = !{ !"x", !"y" }
!1 = !{ i32 1, i32 3 }
ReplaceLocalModuleScopeVariablesPass
The ReplaceLocalModuleScopeVariables
pass identifies global variables in
the local/Workgroup address space
and places them in a struct called localVarTypes
, allocated in a newly
created wrapper function. A pointer to the struct is then passed via a
parameter to the original kernel. The wrapper function takes over function
attributes and metadata from the original kernel.
When creating the struct we need to be aware of the alignment of members so that they are OpenCL conformant for their type. To do this we manually pad the struct by keeping track of each elements offset and adding byte array entries for padding to meet alignment requirements. Finally the whole struct is aligned to the largest member alignment found.
Once the struct is created the pass replaces all instructions using each of the global variables identified in the previous step with instructions referencing the matching struct member instead. Finally the identified global variables are removed once all of their uses have been replaced.
ReplaceMuxMathDeclsPass
Replaces function declarations from the Builtins module. These functions are used internally in our builtins implementation to tune behaviour within algorithms. As such the pass should be run after the builtins have been linked into the LLVM module being modified.
The following builtins are replaced:
__mux_isftz
- Whether the target flushes to zero.__mux_usefast
- Whether to use faster, less accurate maths algorithms.__mux_isembeddedprofile
- Whether the mux target implements OpenCL embedded profile.
Declarations matching each of these function names are searched for by
ReplaceMuxMathDeclsPass
, and if found, a function body is created returning
a constant value. These constant return values are set from bool
parameters
passed by the runtime on pass creation, and may be derived from hardware
features like denormal support, or from compilation flags like fast-math. These
functions must not be declared with the noinline
attribute: later generic
optimization passes, such as Dead Code Elimination, should be able remove the
unused control-flow in kernel code once the definitions of these builtins have
been inlined.
UniqueOpaqueStructsPass
When linking two llvm::Module
s containing forward declarations of opaque
structure types with the same name, or deserializing an llvm::Module
in a
context which already declares an opaque structure with the same name as an
opaque structure type in the module, LLVM attempts to resolve the name clash by
appending a suffix to one of the types e.g., opencl.event_t
becomes
opencl.event_t.0
. This situation is problematic if passes rely on the
opaque struct type’s name to identify it.
The UniqueOpaqueStructsPass
can be used by targets after linking modules or
deserializing modules in a new context. Running this pass will replace all
instances of an opaque structure type with a suffix in its name with the
unsuffixed version, if the unsuffixed variant exists in the context and is also
opaque.
After this pass has run all opaque suffixed types will have been removed and replaced with the unique unsuffixed opaque struct type if it exists.
SimpleCallbackPass
Certain simple operations on llvm::Module
s that don’t warrant their own
dedicated pass can be accomplished using the SimpleCallbackPass
which
invokes a callback function when the pass is run. The callback returns void
and is provided the llvm::Module
as a parameter.
Note
It is undefined behaviour for the callback to modify the Module
in such a
way that analyses are invalidated.
ReplaceWGCPass
The ReplaceWGCPass
provides software implementations of the ComputeMux
work-group collective builtins. Targets wishing to support work-group
collectives in software may run this pass. This pass makes heavy use of
barriers, so do not expect performance. Because it introduces barriers into the
module, this pass must be run before any barrier analysis or
materialization e.g., the PrepareBarriersPass and WorkItemLoopsPass.
This pass introduces global variables into the module qualified with the
local/Workgroup address space and
therefore must be run before any pass that materializes __local
variables in another form, e.g., the ReplaceLocalModuleScopeVariablesPass.
AddMetadataPass<AnalysisTy, HandlerTy>
This pass converts kernel metadata retrieved through an analysis and encodes the metadata into a binary format through the provided handler. The serialized bytes are added to the IR as a global constant and tagged such that it will be placed into the “.notes” section in the ELF file, when the binary is created.
This pass relies on metadata to have been previously added to the IR before the
pass is run. In the Vectorize case this pass should be run after any vecz
passes and after the ComputeLocalMemoryUsagePass
to ensure that all
metadata is present.
ReplaceMemIntrinsicsPass
A pass that replaces calls to llvm.memcpy.*
, llvm.memset.*
and
llvm.memmove.*
with calls to a generated loop. This pass can be used for
targets which are not able to generate backend code for these intrinsics or do
not link with a library which supports this. Although some attempt is made not to
generate these intrinsics, they can exist in SPIRV
code. Note that
llvm.memmove.*
does not currently support different address spaces for the
pointer arguments.
PrepareBarriersPass
The PrepareBarriersPass
is useful in order to satisfy the requirements the
WorkItemLoopsPass has on kernels containing barrier-like functions if
running in conjunction with the RunVeczPass. If running, it should be run
before using the vectorizer.
It ensures that barriers are synchronized between two or more vectorized
versions of the same kernel. It gives each barrier a unique ID, which the
vectorizer preserves in each vectorized kernel, meaning the
WorkItemLoopsPass
can correctly schedule the work-item loops for each
barrier region.
RemoveLifetimeIntrinsicsPass
The LLVM intrinsics llvm.lifetime.start
and llvm.lifetime.end
take as an
argument a pointer to memory and define where in the code referencing that
memory objects is valid. Before the start intrinsic memory pointed to is
undefined and a load to it can be replaced with an undef
. Likewise the memory
is undefined after the end intrinsic and any stores can be removed as dead.
The function pass RemoveLifetimeIntrinsicsPass
removes these intrinsics
from a module by iterating over all the instructions and erasing any lifetime
intrinsics found, as well as the bit-casts they use for the pointer argument.
Removing this information is useful for debugging since the backend is less
likely to optimize away variables in the stack no longer used, as a result this
pass should only be run on debug builds of the module.
RemoveExceptionsPass
oneAPI Construction Kit does not support exceptions. However, functions without
the NoUnwind
attribute can still be generated in certain cases. This pass
adds the NoUnwind
attribute to every function in the module, for target code
generators that can’t handle exceptions.
VerifyReqdSubGroupSizeLegalPass & VerifyReqdSubGroupSizeSatisfiedPass
These passes check whether the compiler can handle, and has successfully handled, a kernel with a required sub-group size.
The VerifyReqdSubGroupSizeLegalPass
searches for any kernel with a required
sub-group size and checks whether the device supports such a size. It does this
using the target’s compiler::utils::DeviceInfo
analysis. Any unsupported
size results in a compiler diagnostic, which the compiler can handle (usually
via a build error).
The VerifyReqdSubGroupSizeSatisfiedPass
searches for any kernel entry point
with a required sub-group size and checks whether the vectorizer was able to
satisfy that requirement. As such, it should be run after vectorization. A
compiler diagnostic is raised for each kernel for which this does not hold.
ReplaceTargetExtTysPass
The ReplaceTargetExtTysPass
pass replaces certain target extension types found in the
initial compiler IR. It replaces them with new types reported by the
BuiltinInfo::getRemappedTargetExtTy
analysis function. This is conceptually
replacing abstract and target-agnostic opaque types with concrete ones ready
for the target.
This pass can replace any of the following types:
spirv.Image
spirv.Event
spirv.Sampler
It replaces any of the above types across the module, replacing any functions with any of these target extension types as function parameters or return types in-place, i.e., with a new function with the updated function signature.
If the target’s compiler backend is able to handle any of the above types
natively then the target may opt out of this process completely. Note
however that some aspects of the ComputeMux compiler may make assumptions
about some of the above types, such as the type of images passed to any of the
Image Library functions. This means that in such a situation,
it may be required to skip other passes such as the
compiler::ImageArgumentSubstitutionPass
.
ManualTypeLegalizationPass
The ManualTypeLegalizationPass
pass replaces half
operations with
float
operations, inserting conversions as needed. It does this to work
around LLVM issue 73805, where LLVM’s own legalization replaces whole chains of
operations rather than each operation individually, thus leaving out rounding
operations implied by the LLVM IR.
This replacement is only done on targets that promote half
to float
during type legalization. On targets where half
is a native type, or where
half
is known to be promoted using “soft-promotion” rules, LLVM is presumed
to translate half
correctly.
Metadata Utilities
There are several key pieces of metadata used for inter-communication between the oneAPI Construction Kit passes, documented in the ComputeMux Compiler Specification.
In order to avoid hard-coding assumptions about the metadata’s names, number of operands, types of operands, etc., utility functions should be used to access or manipulate the metadata. The specific names and/or operands of these metadata is not guaranteed to be stable between the oneAPI Construction Kit versions.
Attribute Utilities
There are several key attributes used for inter-communication between the oneAPI Construction Kit passes, documented in the ComputeMux Compiler Specification.
The modules/compiler/utils/include/utils/attributes.h
header contains all
such APIs, several of which are given here by way of example:
void setIsKernel(llvm::Function &F)
Adds the
mux-kernel
attribute to functionF
.
void setIsKernelEntryPt(llvm::Function &F)
Adds
"mux-kernel"="entry-point"
attribute to functionF
bool isKernel(const llvm::Function &F)
Returns true if function
F
has amux-kernel
attribute
bool isKernelEntryPt(const llvm::Function &F)
Returns true if function
F
has amux-kernel
attribute with the value"entry-point"
.
void dropIsKernel(llvm::Function &F)
Drops the
mux-kernel
attribute from functionF
, if present.
void takeIsKernel(llvm::Function &ToF, llvm::Function &FromF)
Transfers
mux-kernel
attributes from functionFromF
to function ToF, if present on the old function. Overwrites any such metadata in the new function.
Sub-groups
A implementation of OpenCL C sub-group builtins is provided by the default compiler pipeline.
The OpenCL C sub-group builtins are first translated into the corresponding
ComputeMux builtin functions. These functions are understood by the rest of the
compiler and can be identified and analyzed by the BuiltinInfo
analysis.
A definition of these mux builtins for where the sub-group size is 1 is
provided by BIMuxInfoConcept
used by the DefineMuxBuiltinsPass.
Vectorized definitions of the various sub-group builtins are provided by the VECZ pass, so any target running VECZ (and the above passes) will be able to support sub-groups of a larger size than 1. Note that VECZ does not currently interact “on top of” the mux builtins - it replaces them in the functions it vectorized. This is future work to allow the two to build on top of each other.
If a target wishes to provide their own sub-group implementation they should
provide a derived BIMuxInfoConcept
and override defineMuxBuiltin
for
the sub-group builtins.
Linker support
An interface to the lld
linker is provided through a function. It may be
desirable to call this after the compiler pipeline has produced object code.
This requires that the compiler target is linked with the lld
libraries, as
well as building the lld
when producing the LLVM
libraries.
The interface is provided as a header: compiler/utils/lld_linker.h
. Targets
should additionally link against the compiler-linker-utils
library. The
linker is exposed via the following interface:
llvm::Expected<std::unique_ptr<llvm::MemoryBuffer>> lldLinkToBinary(
const llvm::ArrayRef<uint8_t> rawBinary,
const std::string &linkerScriptStr, const uint8_t *linkerLib,
unsigned int linkerLibBytes,
const llvm::SmallVectorImpl<std::string> &additionalLinkArgs);
The rawBinary
is the object final output, and the function returns the
final output binary. An optional library may be passed in as a binary in
linkerLib
. Targets can pass in additional linker flags in
additionalLinkArgs
. Typical additional flags include -e0
to suppress
warnings for having no entry point.