Codeplay’s reference ComputeMux target, for host (x86, Arm & Aarch64) implementation.
Host Device
The information reported by a host device can vary depending on the build configuration of the oneAPI Construction Kit. See the [Developers Guide](developer-guide.md#oneapi-construction-kit-cmake-options) for details on the effects of host specific CMake options.
Interesting Properties
The Host implementation of ComputeMux can get away with various assumptions that a ComputeMux implementation for an independent process cannot.
Assumptions possible for a “host” ComputeMux implementation:
sizeof(void*)
is the same on host and device.sizeof(size_t)
is the same on host and device.All memory is shared and coherent between host and device.
All address spaces map to the same physical memory.
A single function pointer can be executed on host and device, i.e. can compile kernels directly to memory without needing to reload them.
Compiler-rt type functions will be present without building them, e.g.,
__udivsi3
or__floatdidf
.
Other properties for a “host” ComputeMux implementation:
Running kernels takes CPU resources, so the user application, the runtime, and the kernels are all competing for resources. E.g. an OpenCL application running on a GPU-style device may busy-wait to get low-latency, but for “host” that wastes a CPU that could be used for compute.
Because “host” does not need to upload kernels to a separate memory it is conceptually easier to defer compilation until close to the point of execution.
Float Support
On 32-bit ARM builds we run both half and single precision floats on NEON which has Flush To Zero(FTZ) behaviour. As a result the host device doesn’t report support for denormal numbers. This does not apply to double precision floats where denormals are supported.
Compilation Options
On builds where both CA_ENABLE_DEBUG_SUPPORT
is set and a compiler is
available the host device reports the following custom build options for the
compilation_options
member of mux_device_info_s
.
$ ./clc --help
ComputeAorta x86_64 device specific options:
--dummy-host-flag no-op build flag
--dummy-host-flag2 no-op build flag
--dummy-host-option value
no-op option which takes a value
These are provided to test the mechanism for reporting and setting device
specific build options. The only effect they have on kernel compilation is
being propagated as program metadata, to assist testing so we can check options
have been correctly parsed. LLVM metadata node host.build_options
is set to a
string matching the contents of compiler::Options::device_args
.
### Performance Counters
Support for counter type queries is implemented in host with PAPI
, a low
level performance counter API. This support can be enabled with the
CA_HOST_ENABLE_PAPI_COUNTERS
cmake option, and it requires that the PAPI
development libraries can be found on the system. PAPI can be built on Windows
but the way we measure on our worker threads is platform specific, so PAPI
performance counters are currently only supported on Linux. Before setting up a
query pool with a list of counters you want to measure, you should query the
available counters and pass the selection you want to use to the papi util
papi_event_chooser
. This will tell you if your chosen events are compatible
with one another. Follow the links for a general overview of PAPI and detailed API
documentation.
Host Binaries
Host can generate and accept binary executables, possibly containing multiple kernels each. They use LLVM JIT’s TargetMachine’s object file format for storing the executable code after optimizations and other LLVM passes, which is then relinked into the running program using LLVM’s RuntimeDyld dynamic linker wrapper. The object file has an additional section called .notes, which stores information about the kernels such as their names and local memory usage.
.notes
section binary format
The .notes
section contains a valid Metadata API binary. You should use
the metadata API (documented here)
to deserialize any metadata it contains. The metadata in this section is
created by the AddMetadataPass.
Host Scheduled Kernel
For host, scheduled kernels take a kernel residing within its own LLVM module, and:
Assert that the local work sizes for the
x
,y
&z
dimensions are1
Clone the module into a new LLVM context owned by the host scheduled kernel
Add declarations for work item functions that aren’t used in the kernel, but will be called by other work item functions that are used
Change the function signature for all function definitions to include a new work group information parameter
Add definition for
get_local_id
andget_local_size
based on local work sizesAdd definition for
get_global_size
,get_global_offset
&get_group_id
based on work group information parameterAdd definition for
get_global_id
calculated fromget_local_id
andget_group_id
Create a packed struct of the parameter types for the kernel
Add a new kernel wrapper function that takes the packed struct, unpacks each parameter and then calls the actual kernel
Add a new wrapper function that loop over the
x
,y
&z
global dimensionsThe
x
,y
&z
values are set into the work group information parameter
LLVM Passes
HostBIMuxInfo
The host target subclasses BIMuxInfoConcept
via HostBIMuxInfo
to
override the list of scheduling parameters added to functions, used to lower
work-item builtins and add work-group scheduling loops.
The HostBIMuxInfo
is used to add the three scheduling parameters detailed
below to kernel entry points and work-item builtins via the
AddSchedulingParametersPass, is used to lower
work-item builtins in the DefineMuxBuiltinsPass, and to initialize these custom
parameters in the AddKernelWrapperPass.
In addition to the default work-item info struct, the host target adds two custom structures.
Schedule Info
The Mux_schedule_info_s
structure (or “scheduling info” structure) is a
kernel ABI parameter for the host target. It must therefore be passed to the
kernel by the driver.
It is largely a copy of the defualt work-group info structure, but with two
additional parameters - slice
and total_slices
- to help construct the
work-group scheduling loops.
struct Mux_schedule_info_s {
size_t global_size[3];
size_t global_offset[3];
size_t local_size[3];
size_t slice;
size_t total_slices;
uint32_t work_dim;
};
Mini Work-Group Info
Since many of the default work-group info fields are present in
Mux_schedule_info_s
, the “mini work-group info” struct contains only the
group id and the number of groups.
struct MiniWGInfo {
size_t group_id[3];
size_t num_groups[3];
};
This structure does not present itself as an ABI parameter. Its num_groups
fields are initialized from calculations on Mux_schedule_info_s
, and its
group_id
fields are initialized by AddEntryHookPass
by each level of
the work-group loops.
AddEntryHookPass
AddEntryHookPass performs work-group scheduling. The pass then adds scheduling code inside a new kernel wrapper function which calls the previous kernel entry function per work-group slice. A “work-group slice” is defined here as a 3D set of work-groups over Z, Y, and a subset of the X dimensions split evenly across the number of threads used by the host mux target.
This pass assumes that the AddSchedulingParametersPass has been run, and that the necessary scheduling parameters have been added to kernel entry points, detailed above.
Once (up to) three levels of work-group loops have been added, the MiniWGInfo structure’s group_id fields are updated by the scheduling code in each loop level before the call to the original kernel.
AddFloatingPointControlPass
AddFloatingPointControlPass is a hardware aware pass which calls specialized hardware helper functions to do work. If the targeted architecture isn’t supported then the pass exits early without modifying the module.
The pass is designed to set the hardware’s floating point control register in a wrapper before calling the kernel. Then afterwards restore the register to it’s original state before exiting.
An important configuration in the pass is setting FTZ (flush to zero) behaviour for floating point denormal numbers. FTZ mode means that numbers smaller than the minimum normal exponent, called denormals, are treated as 0.0. Disabling it gives us extra precision at the cost of performance.
Arm
In our host implementation this pass only affects double precision floats since our single precision float operations are run on NEON which is FTZ by design and cannot be changed. This is configured by using compiler flag +neonfp when creating a host finalizer. Double precision floats are run on the VFP unit.
To disable FTZ we set the CL_FP_DENORM bit in the floating point status and control register fpscr to zero. Implemented in the pass by using LLVM intrinsic arm_set_fpscr to zero out the whole fpscr, disabling FTZ and setting the rounding mode to ‘round to nearest even’. Our pass also uses the intrinsic arm_get_fpscr on kernel entry to grab the previous fpscr state so we can restore it again on exit with arm_set_fpscr.
Aarch64
In the aarch64 LLVM target there are no intrinsics for getting or setting the FPCR register, and as a result we need to do it using inline assembly. This involves linking the LLVMAArch64AsmParser library, which is extra overhead. Therefore if intrinsics become available we should be reimplement this functionality to use them instead of inline assembly.
Unlike Arm both single and double precision floating point instructions are run on the same, and now unified, VFP/NEON unit.
X86
On x86 and x86_64 the MXCSR register is used to configure the behaviour of the SSE instructions where we are running floating point computations.
DisableNeonAttributePass
Module pass DisableNeonAttribute is a 64-bit Arm specific pass, which exits straight away for other targets. Its purpose is to detect if there are any llvm::UIToFP or llvm::SIToFP vector cast instructions from [u]long to float. If so all the functions in the module are marked with the attribute -neon to disable Neon in the backend. We have to do this for all functions, rather than just the function containing the instruction, to prevent a calling convention ABI mismatch.
Disabling Neon is needed to fix a failing CTS conversions test from [u]long to float. Where the AArch64 Neon backend does the conversion in two stages, i64 to double then double to float. Losing precision because of incorrect rounding in the intermediate value.