Targets may support a number of bespoke instructions which don’t typically map directly to IR generated through the normal compilation process. These include:
Operations provided by the hardware that perform ‘standard’ operations. Standard operations are those created from the normal compilation process or maths builtins which the hardware supports. A lot of common operations such as fused multiply-add and transcendental math functions will map in this way, including vector ones.
Proprietary operations. Examples include DMA or thread control.
For standard operations, these shall either be handled in the target backend
compiler or by using an additional pass in the ComputeMux compiler passes. This
pass can look for patterns of IR and replace them with LLVM intrinsics which the
target supports. Alternatively it could make a call to a function if a linker is
called such as lld
. An example of such a replacement could be matching a
pattern of a DSP style instruction which automatically converts a float to a
fixed int and stores to memory.
The IR patterns and types may not exactly match the instrinsics and may require splitting up of vectors or casting types. If using the whole function vectorizer, vector types not in the original kernel could be created.
If this is handled in the backend, then nothing special needs to be done. If intrinsic mapping is required in the ComputeMux passes then a bespoke pass may be written which maps onto intrinsics which are defined in a target specific intrinsic file in LLVM as described in Adding a new intrinsic function.
Proprietary operations such as DMA or thread control may map onto existing
builtins provided by languages such as OpenCL. This requires passes to look for
DMA style operations in builtins such as the OpenCL async_work_group_copy
or
in similar extensions, as well as builtins that Codeplay have defined. See
DMA for more details.
Complex operations or intrinsics may be captured in bespoke extensions which effectively extend languages such as OpenCL to support additional builtin functions. This can be used to call intrinsics directly or allow replacing of function calls with other code such as calling a DMA function.
Mapping to existing OpenCL builtins
The simplest way to expose a particular instruction from the hardware is to find a functionally equivalent standard OpenCL builtin function (e.g. fma, that performs a fused multiply-add operation) and to ensure that compiling this builtin function results in the desired instruction being generated by the compiler in the kernel object. As a result, the user does not need to use a non-standard function and existing kernel code can be executed more efficiently on the hardware without requiring any changes to the kernel itself.
Since many OpenCL builtins provided by the oneAPI Construction Kit are implemented “in software” (for example through the Abacus math library), this involves replacing calls to selected OpenCL builtins with a custom LLVM IR implementation. A common approach is to create a new “builtin replacement” LLVM pass that looks for calls to a set of specific builtins and generate LLVM IR code to replace the calls, e.g. using compiler intrinsics. The pass is then added to the ComputeMux target’s compilation pipeline prior to the stage where the oneAPI Construction Kit-provided builtin functions are linked with the kernel. The LLVM backend for the target finally generates optimized machine code from the LLVM IR instructions generated by the pass for the builtin function.
Creating New Custom Builtin Functions That Map To Compiler Intrinsics
In order to expose instructions that are proprietary or are too complex for the compiler to generate from specific code patterns in the kernel function, new custom builtin functions can be created. These functions can be called by the user from within a kernel the same way OpenCL and SYCL builtin functions are called. The ComputeMux target exposing these custom builtins will then turn calls to these functions into other code, most commonly calls to a compiler intrinsic that is recognized by the target’s LLVM compiler back-end. The LLVM back-end in turn will replace the intrinsic call with instructions that are specific to the hardware being targeted.
This can be done both at the OpenCL and SYCL level, each through a different pathway.
Creating New Custom OpenCL Builtins
With the first pathway, new custom builtin functions can be created and then used from OpenCL C kernels. A so-called “force include” header file is configured within the ComputeMux target, which is then automatically and transparently included when compiling any OpenCL C kernel from source. Compilation could be done using the OpenCL C API or with the oneAPI Construction Kit compiler tools such as oclc and clc. Declarations of the new builtin functions to expose can be added to this header using standard OpenCL C syntax (including data types and address spaces, for example).
Once the “force include” header has been written and is used during the oneAPI Construction Kit build process, an LLVM pass which turns calls to these builtin functions into the appropriate LLVM IR code (e.g. an intrinsic call) can to be added to the ComputeMux target’s compilation pipeline. This pass acts as the bridge between the front-end (header containing builtin declarations) and the compiler backend that can generate the desired proprietary instructions.
Since new custom OpenCL builtins are inherently non-standard, a new OpenCL vendor extension should be created for the ComputeMux target to advertise the presence of these custom functions.
Creating New Custom SYCL Builtins
Creating and exposing new custom builtins so that they can be used from SYCL kernels follows a similar process to OpenCL, with some differences. Builtin declarations can be added to a C++ SYCL header that is specific to the target. This header needs to be explicitly included by the kernel source file and used by the ComputeCpp device compiler to generate declarations for these builtin functions when compiling the kernel to SPIRV.
SPIR-V code generated by ComputeCpp is then translated into LLVM IR by the oneAPI Construction Kit and passed to the ComputeMux target. From that point on, the compilation process is the same for custom builtins exposed through OpenCL and SYCL. A LLVM pass is needed to translate custom builtin calls to compiler intrinsics.
In some cases, SPIR-V extensions to add new custom SPIR-V opcodes may be necessary. Please refer to the ComputeCpp documentation for more details on the SPIR-V compilation process.