Introduction
This document serves to introduce users to the ComputeMux compiler pipeline. The reference compiler pipeline performs several key transformations over several phases that can be difficult to understand for new users.
The pipeline is constructed and run in compiler::BaseModule::finalize
(overridable but provided by default) which defers to the pure virtual
compiler::BaseModule::getLateTargetPasses
which must be implemented by
a target.
The pipeline described here is thus the reference model used by all in-tree ComputeMux targets, used as a suggestion of how to construct a basic target pipeline. Everything is ultimately customizable by the target.
Contents
Objective & Execution Model
The reference compiler pipeline’s objective is to compile incoming LLVM IR modules containing one or more kernel functions to object code ready for execution when invoked by the host-side runtime. The assumptions placed on the input and output kernels is as follows:
The original kernel is assumed to adhere to an implicit SIMT execution model; it runs once per each work-item in an NDRange.
The final compiled kernel is assumed to be invoked from the host-side runtime once per work-group in the NDRange.
This execution model is referenced elsewhere in documentation as a work-item per thread or core per work-item scheduling model.
The following diagram provides an overview of the main phases of the reference ComputeMux compiler pipeline in terms of the underlying and assumed kernel execution model.
The inner-most function is the original input kernel, which is wrapped by new functions in successive phases, until it is ready in a form to be executed by the ComputeMux driver.
The WorkItemLoopsPass is the key pass which makes some of the implicit parallelism explicit. By introducing work-item loops around each kernel function, the new kernel entry point now runs on every work-group in an NDRange.
Note
Targets adhering to a work-item per thread or core per work-item
scheduling model will likely wish to omit the WorkItemLoopsPass
.
Host
As mentioned, the above documents the reference compiler pipeline. The in-tree Host CPU target uses a different model, where the driver splits up the NDRange into multi-dimensional slices and calls the compiler-generated code for each slice. Each work-group in the slice is then traversed using multi-dimensional work-group loops generated by the AddEntryHookPass:
Compiler Pipeline Overview
With the overall execution model established, we can start to dive deeper into the key phases of the compilation pipeline.
Input IR
The program begins as an LLVM module, either generated by a compiler frontend or deserialized from another intermediate form such as SPIRV. Kernels in the module are assumed to obey a SIMT programming model, as described earlier in Objective & Execution Model.
Simple fix-up passes take place at this stage: the IR is massaged to conform to specifications or to fix known deficiencies in earlier representations.
Adding Metadata/Attributes
ComputeMux IR metadata and attributes are attached to kernels. This information is used by following passes to identify certain aspects of kernels which are not otherwise attainable or representable in LLVM IR.
The TransferKernelMetadataPass and EncodeKernelMetadataPass are responsible for adding this information.
See the ComputeMux Compiler Specification for details about specific metadata and attributes.
Early transformation passes and optimizations take place at this stage. Of note is LLVM’s SLP vectorizer.
Note
This takes place in BaseModule::finalize
before calling
BaseModule::getLateTargetPasses
so metadata and attributes can be assumed
to be present as an input to that method.
Early Vectorization Passes
These passes are activated when the VectorizeSLP and VectorizeLoop bits are set in clang::CodeGenOptions (corresponding to the -cl-vec={none|loop|slp|all} command line options). These options activate sequences of standard LLVM passes that attempt to vectorize the kernel for a single work item. The vectorization passes used are Loop Vectorization, SLP Vectorization, and Load/Store Vectorization. Some basic simplification passes are also applied (Loop Rotation, Instruction Combine, CFG Simplification and Dead Code Removal). Loop Rotation was found necessary for Loop Vectorization to work as expected. Loop Rotation can also generate redundant code in the case that the number of iterations is known at compile time to be a multiple of the vector width, and the CFG Simplification is able to clean it up. Load/Store vectorization is applied if either of Loop or SLP options are selected.
Whole Function Vectorization
The Vecz whole-function vectorizer is optionally run.
Note that VECZ may perform its own scalarization, depending on the options passed to it, potentially undoing the work of any of the Early Vectorization Passes, although it is able to preserve or even widen pre-existing vector operations in many cases.
Linking Builtins
Abacus builtins are linked into the module by the LinkBuiltinsPass.
This is historically done after whole-function vectorization, with the
vectorizer generating scaled up vector forms of known builtins (e.g.,
round(float2) -> x4 -> round(float8)
).
Note
It is also possible to link most builtins before vectorization, where the vectorizer will rely on inlining of any used builtins, and vectorize accordingly. Some builtins must still be linked after vectorization, however. See the pass’s documentation for more information.
Work-item Scheduling & Barriers
The work-item loops are added to each kernel by the WorkItemLoopsPass.
The kernel execution model changes at this stage to replace some of the implicit parallelism with explicit looping, as described earlier in Objective & Execution Model.
Barrier Scheduling takes place at this stage, as well as Vectorization Scheduling if the vectorizer was run.
Adding Scheduling Parameters
Scheduling parameters are added to builtin functions that need them. These extra parameters are used by software implementations of mux work-item builtins to provide extra data, used when lowering in a later phase.
The AddSchedulingParametersPass is responsible for this transformation.
A concrete example of this is the OpenCL get_global_id(uint)
builtin.
These are defined (by default) as calling the ComputeMux equivalent
__mux_get_global_id
. This function body will have been materialised when
linking builtins earlier.
While some hardware may have all of the necessary features for implementing this builtin, many architectures don’t. Thus the software implementation needs to source extra data from somewhere external to the function. This is the role that scheduling parameters fill.
Note
The BuiltinInfo
analysis controls which scheduling parameters are added.
Targets may override BuiltinInfo
to change their scheduling parameters
whilst making use of this pass. See the tutorial on
Custom Lowering Work-Item Builtins on how this may be
accomplished.
Pseudo C code:
void foo() {
size_t id = __mux_get_global_id(0);
}
size_t __mux_get_global_id(uint);
// The AddSchedulingParametersPass produces the following
// scheduling structures:
struct MuxWorkItemInfo { size_t local_ids[3]; ... };
struct MuxWorkGroupInfo { size_t group_ids[3]; ... };
// And this wrapper function
void foo.mux-sched-wrapper(MuxWorkItemInfo *wi, MuxWorkGroupInfo *wg) {
size_t id = __mux_get_global_id(0, wi, wg);
}
// And a new version of __mux_get_global_id with scheduling parameters
size_t __mux_get_global_id(uint, MuxWorkItemInfo *wi, MuxWorkGroupInfo *wg);
A combination of the ComputeMux driver as well as outer loops are responsible for filling in all of the scheduling parameter data. For example:
The
WorkItemLoopsPass
sets the local ID at each loop level.The ComputeMux driver externally sets up the work-group information such as the work-group IDs and sizes.
Defining mux Builtins
The bodies of mux builtin function declarations are provided. They may use the
extra information passed through parameters added by the
AddSchedulingParametersPass
in a previous phase.
The DefineMuxBuiltinsPass performs this transformation.
Some builtins may rely on others to complete their function. These dependencies are handled transitively.
Note
The BuiltinInfo
analysis controls how mux builtins are defined. Targets
may override BuiltinInfo
to change how specific builtins are defined
whilst making use of this pass. See the tutorial on
Custom Lowering Work-Item Builtins on how this may be
accomplished.
Pseudo C code:
struct MuxWorkItemInfo { size_t[3] local_ids; ... };
struct MuxWorkGroupInfo { size_t[3] group_ids; ... };
// And this wrapper function
void foo.mux-sched-wrapper(MuxWorkItemInfo *wi, MuxWorkGroupInfo *wg) {
size_t id = __mux_get_global_id(0, wi, wg);
}
// The DefineMuxBuiltinsPass provides the definition
// of __mux_get_global_id:
size_t __mux_get_global_id(uint i, MuxWorkItemInfo *wi, MuxWorkGroupInfo *wg) {
return (__mux_get_group_id(i, wi, wg) * __mux_get_local_size(i, wi, wg)) +
__mux_get_local_id(i, wi, wg) + __mux_get_global_offset(i, wi, wg);
}
// And thus the definition of __mux_get_group_id...
size_t __mux_get_group_id(uint i, MuxWorkItemInfo *wi, MuxWorkGroupInfo *wg) {
return i >= 3 ? 0 : wg->group_ids[i];
}
// and __mux_get_local_id, etc
size_t __mux_get_local_id(uint i, MuxWorkItemInfo *wi, MuxWorkGroupInfo *wg) {
return i >= 3 ? 0 : wi->local_ids[i];
}
Kernel Wrapping
kernel’s ABI is finalized, ready for it being called by the ComputeMux driver.
The AddKernelWrapperPass performs this transformation.
Kernel parameters are packed together into an auto-generated struct
type. A
pointer to this structure is passed as the first parameter to the new kernel.
Scheduling parameters such as the work-group info
are also preserved and
passed to this new kernel.
Scheduling parameters such as the work-item info
that do not constitute the
kernel ABI are initialized by the wrapper, before being passed to the wrapped
kernel.
Note
The BuiltinInfo
analysis controls scheduling parameters and how they
interact with the kernel ABI. Targets may override BuiltinInfo
to change
how specific builtins are defined whilst making use of this pass. See the
tutorial on Custom Lowering Work-Item Builtins on how this
may be accomplished.
Pseudo C code:
struct MuxWorkItemInfo { ... };
struct MuxWorkGroupInfo { ... };
void foo(global int *a, double f, MuxWorkItemInfo *, MuxWorkGroupInfo *);
// AddKernelWrapperPass produces the following packed-argument struct:
struct MuxPackedArgs.foo { global int *a; double f; };
// AddKernelWrapperPass produces the following wrapper:
void foo.mux-kernel-wrapper(MuxPackedArgs.foo *args, MuxWorkGroupInfo *wg) {
// Note - the default behaviour is to stack-allocate MuxWorkItemInfo,
// leaving all fields uninitialized. The previous compiler passes always
// 'set' values before they 'get' them. Targets can customize this
// behaviour: see the tutoral linked above.
MuxWorkItemInfo wi;
global int *a = args->a;
double f = args->f;
return foo(a, f, &wi, wg);
}
Binary Object Creation
Any final passes are run now before the module is passed off to generate an object file. For ahead-of-time targets, this may involve calling on LLVM to generate an ELF file. For just-in-time targets, nothing further may be done at this stage - instead deferring compilation until execution time.
When presented with a binary object, the host runtime needs to identify the kernel to call. A common approach used by ComputeMux compiler targets is to run the AddMetadataPass which helps to encode kernel metadata into the final ELF file. This can then be decoded
Barrier Scheduling
The fact that the WorkItemLoopsPass handles both work-item loops and
barriers can be confusing to newcomers. These two concepts are in fact
linked. Taking the kernel code below, this section will show how the
WorkItemLoopsPass
lays out and schedules a kernel’s work-item loops in the
face of barriers.
kernel void foo(global int *a, global int *b) {
// pre barrier code - foo.mux-barrier-region.0()
size_t id = get_global_id(0);
a[id] += 4;
// barrier
barrier(CLK_GLOBAL_MEM_FENCE);
// post barrier code - foo.mux-barrier-region.1()
b[id] += 4;
}
The kernel has one global barrier, and one statement on either side of it. The
WorkItemLoopsPass
conceptually breaks down the kernel into barrier
regions, which constitute the code following the control-flow between all
barriers in the kernel. The example above has two regions: the first contains
the call to get_global_id
and the read/update/write of global memory
pointed to by a
; the second contains the read/update/write of global memory
pointed to by b
.
To correctly observe the barrier’s semantics, all work-items in the work-group
need to execute the first barrier region before beginning the second. Thus the
WorkItemLoopsPass
produces two sets of work-item loops to schedule this
kernel:
Live Variables
Note also that id
is a live variable whose lifetime traverses the
barrier. The WorkItemLoopsPass
creates a structure of live variables which
are passed between the successive barrier regions, containing data that needs
to be live in future regions.
In this case, however, calls to certain builtins like get_global_id
are
treated specially and are materialized anew in each barrier region where they
are used.
Vectorization Scheduling
The WorkItemLoopsPass is responsible for laying out kernels which have been vectorized by the Vecz whole-function vectorizer.
The vectorizer creates multiple versions of the original kernel. Vectorized
kernels on their own are generally unable to fulfill work-group scheduling
requirements, as they operate only on a number of work-items equal to a
multiple of the vectorization factor. As such, for the general case, several
kernels must be combined to cover all work-items in the work-group; the
WorkItemLoopsPass
is responsible for this.
Note
The following diagram uses a vectorization width of 4.
For brevity, the diagram below only details in inner-most work-item loops. Most kernels will in reality have 2 outer levels of loops over the full Y and Z work-group dimensions.
In the above example, the vectorized kernel is called to execute as many work-items as possible, up to the largest multiple of the vectorization less than or equal to the work-group size.
In the case that there are work-items remaining (i.e., if the work-group size is not a multiple of 4) then the original scalar kernel is called on the up to 3 remaining work-items. These remaining work-items are typically called the ‘peel’ iterations by ComputeMux.
Note that other vectorized kernel layouts are possible. See the documentation for the WorkItemLoopsPass to find out other possibilities.