This is version 0.73.0 of the specification.
ComputeMux is Codeplay’s proprietary API for executing compute workloads across heterogeneous devices. ComputeMux is an extremely lightweight, bare-to-the-metal abstraction on modern hardware, and is used to power a suite of open source standards. ComputeMux consists of a Runtime Specification and Compiler Specification. This document refers to the Compiler component.
Throughout this specification, ComputeMux has been shortened to Mux for brevity, but for the avoidance of all doubt, “Mux” refers to “ComputeMux”.
Glossary
The key statements must, must not, shall, shall not, should, should not, may, in this document are to be interpreted as described in IETF RFC 2119.
must or shall - means that the definition is an absolute requirement of the specification.
must not or shall not - means that the definition is an absolute prohibition of the specification.
should - means that the definition is highly likely to occur, but there are extraneous circumstances that could result in it not occurring.
should not - means that the definition is highly likely to not occur, but there are extraneous circumstances that could result in it occurring.
may - means that an item is optional.
Introduction
This document describes the Mux compiler API, how it is used, and requirements
to follow on its usage. The Mux compiler API is a C++ API that provides the
compiler implementation used by a particular customer target (specified by a
mux_device_info_t
from the Runtime Specification).
This specification describes a number of C++ classes with methods that a customer compiler must implement. Some entry points are pure virtual methods and are therefore mandatory, but others already have base implementations which must be called if the function is overridden by a customer compiler. Note that the oneAPI Construction Kit does not use exceptions, and therefore there are no exception guarantee requirements.
Cargo
Cargo is the oneAPI Construction Kit’s STL like container library used by a number of compiler API methods that conforms to stricter memory requirements than the C++ STL e.g. constructors do not allocate memory and exceptions are never thrown from any container.
Context
The Context
object serves as an opaque wrapper over the LLVM context
object. This object satisfies the C++ named requirement ‘Lockable’, and must
be locked when the compiler is interacting with a specific instance of LLVM.
Context
exposes the following interface:
namespace compiler {
class Context {
public:
bool isValidSPIR(cargo::array_view<const uint8_t> binary);
bool isValidSPIRV(cargo::array_view<const uint32_t> code);
cargo::expected<spirv::SpecializableConstantsMap, std::string>
getSpecializableConstants(cargo::array_view<const uint32_t> code);
void lock();
bool try_lock();
void unlock();
};
}
The compiler::Context
object is fully implemented by the Mux compiler
library, and customer targets are not required to implement it.
Info
An Info
object is a description of a particular compiler implementation
that can be used to compile programs for a particular mux_device_info_t
.
Info
objects are expected to have a static get
method that returns a
singleton instance containing information about the compiler capabilities and
metadata, similar to mux_device_info_t
objects in the Runtime
Specification.
namespace compiler {
struct Info {
public:
virtual std::unique_ptr<compiler::Target> createTarget(
compiler::Context *context,
cargo::optional<mux_device_t> device,
mux_allocator_info_t allocator_info) = 0;
builtins::file::capabilities_bitfield getBuiltinCapabilities();
mux_device_info_t device_info;
bool supports_deferred_compilation;
const char *compilation_options;
bool vectorizable;
bool dma_optimizable;
bool scalable_vector_support;
};
}
device_info
- The singleton instance ofmux_device_info_t
which this compiler targets.supports_deferred_compilation
- Is true if this compiler supports deferred compilation by implementingcompiler::Module::createKernel
and thecompiler::Kernel
class, otherwise false.compilation_options
- A null-terminated C string of semicolon-separated compilation options specific to this compiler.vectorizable
- Is true if the device supports vectorization otherwise false.dma_optimizable
- Is true if the device supports DMA optimizations otherwise false.scalable_vector_support
- Is true if the device supports scalable vectors otherwise false.
Valid Usage
compilation_options
must conform to the compilation options syntax, defined below.
Info::createTarget
Info::createTarget
creates a new instance of a subclass of
compiler::Target
.
std::unique_ptr<compiler::Target> createTarget(
compiler::Context *context);
context
- an instance ofcompiler::Context
.callback
- an optional callback used to provide a message back to the user.
Return Value
If there was an allocation failure,
nullptr
must be returned.If
context
isnullptr
,nullptr
must be returned.Otherwise an instance of
compiler::Target
should be returned.
Info::getBuiltinCapabilities
Info::getBuiltinCapabilities
retrieves a bitfield describing the builtin
capabilities of the target device, based on Info::device_info
.
builtins::file::capabilities_bitfield getBuiltinCapabilities();
Return Value
A bitfield describing the builtin capabilities should be returned.
Compilation Options Syntax
compilation_options
must follow the syntax of a comma separated
tuple of (name, [1|0], help) with the following rules:
Argument name in the first tuple entry must start with a double hyphen and not contain whitespace characters.
The second element must be a ‘1’ or a ‘0’ denoting if a value needs to be provided for the option.
The final tuple entry is a help message to be displayed by compiler tools. All help whitespace must only be `` `` characters; other whitespace characters (
\t
,\n
, etc.) must not be used.If multiple options are reported then each tuple must be separated by a semi-colon.
Example of valid options reported by a device, including both an option which requires a value and an option which is just a build flag.
info_ptr->compilation_options =
"--dummy-device-option,1,takes an integer value;"
"--dummy-device-flag,0,enables device optimization";
Enumerating compiler::Info
’s
Compiler targets are required to provide a free-standing function that lists one
or more static instances of the compiler::Info
object for each compiler
configuration that this target supports. The name of this function does not
matter, but it is named getCompilers
in this example.
void getCompilers(compiler::AddCompilerFn add_compiler);
add_compiler
- an object that overloadsoperator()
which informs the oneAPI Construction Kit about a static instance ofcompiler:Info
. Used to register a specific compiler configuration.
One way of implementing this requirement is to add a static function to the
compiler::Info
object:
struct MyCompilerInfo : public compiler::Info {
// ...
static void get(compiler::AddCompilerFn add_compiler) {
static MyCompilerInfo info;
add_compiler(&info);
}
};
Then, provide the fully qualified name to this function in CMake:
add_mux_compiler_target(MyCompiler
COMPILER_INFO MyCompilerInfo::get
HEADER_DIR my_compiler/info.h)
Target
A Target
object is an instance of the compiler which “targets” a particular
Mux device. It is used as the entry point into customer code from the compiler
library.
namespace compiler {
class BaseTarget {
public:
BaseTarget(
const compiler::Info *compiler_info,
compiler::Context *context,
compiler::NotifyCallbackFn callback);
virtual Result initWithBuiltins(std::unique_ptr<llvm::Module> builtins) = 0;
virtual Result listSnapshotStages(mux_device_t device, uint32_t count,
const char **out_stages,
uint32_t *out_count) = 0;
virtual std::unique_ptr<compiler::Module> createModule(
uint32_t &num_errors,
std::string &log) = 0;
const compiler::Info *getCompilerInfo() const;
};
}
BaseTarget Constructor
A Target
object which extends BaseTarget
must have a constructor
which calls BaseTarget
’s constructor with the following arguments
BaseTarget(
const compiler::Info *compiler_info,
compiler::Context *context,
compiler::NotifyCallbackFn callback);
compiler_info
- the compiler info used to create this object.context
- an instance ofcompiler::Context
.callback
- an optional callback used to provide a message back to the user.
BaseTarget::initWithBuiltins
BaseTarget::initWithBuiltins
initializes the given target object after
loading builtins.
compiler::Result initWithBuiltins(
std::unique_ptr<llvm::Module> builtins);
builtins
- an LLVM module containing the embedded builtins provided by the oneAPI Construction Kit.
Return Value
If there was an allocation failure,
compiler::Result::OUT_OF_MEMORY
must be returned.Otherwise
compiler::Result::SUCCESS
should be returned.
BaseTarget::createModule
BaseTarget::createModule
creates a new instance of a subclass of
compiler::BaseModule
that supports this target.
std::unique_ptr<compiler::Module> createModule(
uint32_t &num_errors,
std::string &log);
num_errors
- a reference to an integer that will contain the number of errors reported by the Module object during compilation.log
- a reference to astd::string
that will contain errors reported by the Module object during compilation.
Return Value
If there was an allocation failure,
nullptr
must be returned.Otherwise an instance of
compiler::Module
should be returned.
BaseTarget::listSnapshotStages
Warning
Snapshots are deprecated and will be removed in a future version of the oneAPI Construction Kit.
BaseTarget::listSnapshotStages
lists all snapshot stages supported by
this compiler target.
compiler::Result listSnapshotStages(
uint32_t count,
const char **out_stages,
uint32_t *out_count);
count
- element count of theout_stages
array.out_stages
- array of C strings to be populated with snapshot names, may benullptr
.out_count
- number of snapshot stages available.
Return Value
If there was an allocation failure,
compiler::Result::OUT_OF_MEMORY
must be returned.If
count
is 0, andout_stages
is not NULL,compiler::Result::INVALID_VALUE
must be returned.If
out_stages
isnullptr
, andcount
is not 0,compiler::Result::INVALID_VALUE
must be returned.Otherwise
compiler::Result::SUCCESS
should be returned.
Module
A Module
object is the top level container for a device program compiled
from one of the supported source types. A Module may contain multiple entry
points and may have one or more named kernels unless it is a library module.
Module
is used to drive the compilation process, starting with the OpenCL C,
SPIR or SPIR-V front-ends, optionally linking against other Modules, then
applying further optimizations before passing it to the back-end.
BaseModule
implements all of the front-end functionality, and it is left to
the Mux target implementation to implement the remaining pure virtual methods
that handle the back-end and code generation.
namespace compiler {
class BaseModule {
public:
BaseModule(compiler::BaseTarget &target,
compiler::ContextImpl &context,
uint32_t &num_errors,
std::string &log);
virtual Result createBinary(cargo::array_view<std::uint8_t> &buffer) = 0;
virtual std::unique_ptr<compiler::utils::PassMachinery> createPassMachinery();
protected:
virtual Kernel *createKernel(const std::string &name) = 0;
public:
virtual void clear();
virtual Result compileSPIR(std::string &output_options);
virtual cargo::expected<spirv::ModuleInfo, Result> compileSPIRV(
cargo::array_view<const std::uint32_t> buffer,
const spirv::DeviceInfo &spirv_device_info,
cargo::optional<const spirv::SpecializationInfo &> spirv_spec_info);
virtual Result compileOpenCLC(
cargo::string_view device_profile,
cargo::string_view source,
cargo::array_view<compiler::InputHeader> input_headers);
virtual Result link(cargo::array_view<Module *> input_modules);
virtual Result finalize(
KernelInfoCallback kernel_info_callback,
std::vector<builtins::printf::descriptor> &printf_calls);
virtual Kernel *getKernel(const std::string &name);
virtual std::size_t size();
virtual std::size_t serialize(std::uint8_t *output_buffer);
virtual bool deserialize(cargo::array_view<const std::uint8_t> buffer);
virtual std::unique_ptr<compiler::utils::PassMachinery> createPassMachinery();
virtual void initializePassMachineryForFrontend(
compiler::utils::PassMachinery &,
const clang::CodeGenOptions &) const;
virtual void initializePassMachineryForFinalize(
compiler::utils::PassMachinery &) const;
protected:
// Utility functions.
virtual llvm::ModulePassManager getLateTargetPasses(
compiler::utils::PassMachinery &) = 0;
virtual Kernel *createKernel(const std::string &name) = 0;
cargo::optional<SnapshotDetails> shouldTakeSnapshot(
const char *stage) const;
void addDiagnostic(cargo::string_view message);
void addBuildError(cargo::string_view message);
// Member variables.
std::unique_ptr<llvm::Module> finalized_llvm_module;
compiler::BaseContext &context;
compiler::BaseTarget ⌖
struct SnapshotDetails {
const char *stage;
compiler_snapshot_callback_t callback;
SnapshotFormat format;
void *user_data;
};
std::vector<SnapshotDetails> snapshot;
compiler::Options options;
private:
std::unique_ptr<llvm::Module> llvm_module;
};
}
BaseModule Constructor
A Module
object which extends BaseModule
must have a constructor
which calls BaseModule
’s constructor with the following arguments:
BaseModule(
compiler::BaseTarget &target,
compiler::ContextImpl &context,
uint32_t &num_errors,
std::string &log);
target
- thecompiler::Target
object used to create this module.context
- an instance ofcompiler::Context
.num_errors
- a reference to an integer that will contain the number of errors reported by the Module object during compilation.log
- a reference to astd::string
that will contain errors reported by the Module object during compilation.
BaseModule::finalize
BaseModule::finalize
runs IR passes on the llvm_module
which prepare it
for binary creation.
The passes run by the default implementation are a mixture of LLVM middle-end
optimizations and ComputeMux-specific passes that lower the incoming
llvm_module
from a higher-level form dependent on the original kernel
source-language (e.g., being produced by BaseModule::compileOpenCLC
,
BaseModule::compileSPIRV
, or BaseModule::compileSPIR
) into a canonical
“ComputeMux” form.
Note
- Note that most of the lower-level target-specific passes are left to
BaseModule::getLateTargetPasses
which must be implemented.
Targets may override this method to customize the pipeline.
BaseModule::getLateTargetPasses
BaseModule::getLateTargetPasses
is an internal method called at the end of
BaseModule::finalize
, and is reponsible for adding any final
target-specific IR passes to the pipeline, in preparation for the creation of
the final binary in BaseModule::createBinary
. Note that no
BaseModule::finalize
passes have actually been run by the time at which
this method is called, neither is the llvm::Module
that the passes will be
run on exposed.
This method receives the same PassMachinery
used throughout the
BaseModule::finalize
pipeline, that has been initialized with
BaseModule::initializePassMachineryForFinalize
. Targets may therefore rely
on any analyses they’ve previously registered.
BaseModule::createPassMachinery
The PassMachinery
class manages the lifetime and initialization of all
components required to set up a new-style LLVM pass manager. It includes
various methods for registering debug information and parsing pipeline text
(for PassBuilder.parsePassPipeline
) and initalizing of state. The default
implementation will cover common passes, but if a user wants to register their
own for debug and parse they can create their own by deriving from
BasePassMachinery
.
The PassMachinery
class takes an llvm TargetMachine
pointer in the
constructor. By default this can only be known in the derived class, and so to
support the TargetMachine
being known throughout the compilation pipeline, it
is advised to override the BaseModule::createPassMachinery
, even if only to
create the BaseModulePassMachinery
with a known TargetMachine
. A derived
version of PassMachinery
is also advised to support parsing and debugging of
target specific passes. This should generally be derived from
BaseModulePassMachinery
and the various register*
methods of
BaseModulePassMachinery
called from the derived class.
BaseModule::initializePassMachineryForFrontend
BaseModule::initializePassMachineryForFrontend
sets up a PassMachinery
for use in the pipelines run by BaseModule::compileOpenCLC
,
BaseModule::compileSPIRV
, and BaseModule::compileSPIR
. A default
implementation is provided, though targets may override this method to register
custom analyses or tune the pipeline.
BaseModule::initializePassMachineryForFinalize
BaseModule::initializePassMachineryForFinalize
sets up a PassMachinery
for use in the pipeline run by BaseModule::finalize
(and by extension
BaseModule::getLateTargetPasses
). A default implementation is provided,
though targets may override this method to register
custom analyses or tune the pipeline.
BaseModule::createBinary
BaseModule::createBinary
creates a compiled binary which can be loaded by
the corresponding Mux implementation using muxCreateExecutable
.
compiler::Result createBinary(cargo::array_view<std::uint8_t> &buffer);
buffer
- an array view over the binary buffer. This array view is valid until the next call tocreateBinary
.
Return Value
If there was an allocation failure,
compiler::Result::OUT_OF_MEMORY
must be returned.Otherwise
compiler::Result::SUCCESS
should be returned.
BaseModule::createKernel
BaseModule::createKernel
creates a deferred kernel, an object which
represents a specific kernel function within the Module which can have its
compilation deferred. Note that this function should not create a new kernel
function in the module, but instead creates a new compiler::Kernel
object
that represents an existing kernel in the module.
This method must return nullptr
if the compiler::Module
does not
support deferred compilation of kernels and
compiler::Info::supports_deferred_compilation
is false
.
BaseModule::getKernel
will either look up compiler::Kernel
objects by
kernel name, or call BaseModule::createKernel
to create compiler::Kernel
objects lazily.
compiler::Kernel *createKernel(
const std::string &name);
name
- the name of the kernel function to select from the module.
Return Value
If there was an allocation failure,
nullptr
must be returned.If this module does not support deferred compilation,
nullptr
must be returned.Otherwise an instance of
compiler::Kernel
should be returned.
Kernel
A Kernel
object represents a single kernel function inside a Module whose
compilation into a mux_kernel_t
can be deferred at any point up to the point
we enqueue the kernel into a command buffer. The Kernel
class is not
required to be implemented if the compiler implementation does not support
deferred compilation.
Kernel
may be used to perform further optimizations to specific kernels
once additional information is provided, such as local or global work-group
sizes, and/or descriptors.
namespace compiler {
class BaseKernel {
public:
BaseKernel(size_t preferred_local_size_x,
size_t preferred_local_size_y,
size_t preferred_local_size_z,
size_t local_memory_size);
virtual Result precacheLocalSize(size_t local_size_x,
size_t local_size_y,
size_t local_size_z) = 0;
virtual cargo::expected<uint32_t, Result> getDynamicWorkWidth(
size_t local_size_x,
size_t local_size_y,
size_t local_size_z) = 0;
virtual cargo::expected<cargo::dynamic_array<uint8_t>, Result> createSpecializedKernel(
const mux_ndrange_options_t &specialization_options) = 0;
virtual cargo::expected<uint32_t, Result> getSubGroupSize() = 0;
virtual cargo::expected<uint32_t, Result> querySubGroupSizeForLocalSize(
size_t local_size_x, size_t local_size_y, size_t local_size_z) = 0;
virtual cargo::expected<std::array<size_t, 3>, Result>
queryLocalSizeForSubGroupCount(size_t sub_group_count) = 0;
virtual cargo::expected<size_t, Result> queryMaxSubGroupCount() = 0;
};
}
Constructor
A Kernel
object which extends BaseKernel
must have a constructor
which calls BaseKernel
’s constructor with the following arguments:
BaseKernel(
size_t preferred_local_size_x,
size_t preferred_local_size_y,
size_t preferred_local_size_z,
size_t local_memory_size);
preferred_local_size_x
- the preferred local size in the x dimension for this kernel object.preferred_local_size_y
- the preferred local size in the y dimension for this kernel object.preferred_local_size_z
- the preferred local size in the z dimension for this kernel object.local_memory_size
- the amount of local memory used by this kernel object.
BaseKernel::precacheLocalSize
BaseKernel::precacheLocalSize
signals to the compiler to optionally
pre-cache a specific local work-group size configuration that may be required
later by BaseKernel::createSpecializedKernel
.
compiler::Result precacheLocalSize(
size_t local_size_x,
size_t local_size_y,
size_t local_size_z);
local_size_x
- the size of the x dimension of the local work-group.local_size_y
- the size of the y dimension of the local work-group.local_size_z
- the size of the z dimension of the local work-group.
Return Value
If there was an allocation failure,
compiler::Result::OUT_OF_MEMORY
must be returned.If
local_size_x
is 0,compiler::Result::INVALID_VALUE
must be returned.If
local_size_y
is 0,compiler::Result::INVALID_VALUE
must be returned.If
local_size_z
is 0,compiler::Result::INVALID_VALUE
must be returned.Otherwise
compiler::Result::SUCCESS
should be returned.
BaseKernel::getDynamicWorkWidth
BaseKernel::getDynamicWorkWidth
obtains the dynamic work width of this
kernel for a given local work-group size.
The work width indicates the number of work-items in a work-group that will execute together. Note that the work width may be less than the size of the work-group but never greater than, and may be 1.
Commonly the work width will relate to the hardware vector/wave-front/warp
width (likely the device’s max_work_width
), but may be lowered if a
particular kernel cannot fully exploit the hardware. The work width may be less
than or greater than the hardware width, depending on factors such as what data
types are used in the kernel.
cargo::expected<uint32_t, compiler::Result> getDynamicWorkWidth(
size_t local_size_x,
size_t local_size_y,
size_t local_size_z);
local_size_x
- the size of the x dimension of the local work-group.local_size_y
- the size of the y dimension of the local work-group.local_size_z
- the size of the z dimension of the local work-group.
Return Value
If there was an allocation failure,
cargo::make_unexpected(compiler::Result::OUT_OF_MEMORY)
must be returned.If
local_size_x
is 0,compiler::Result::INVALID_VALUE
must be returned.If
local_size_y
is 0,compiler::Result::INVALID_VALUE
must be returned.If
local_size_z
is 0,compiler::Result::INVALID_VALUE
must be returned.Otherwise, a work width should be returned. The work width must be greater than 0.
BaseKernel::createSpecializedKernel
BaseKernel::createSpecializedKernel
creates a compiled binary containing (at least)
the kernel represented by this compiler::Kernel
object, which may have
been cloned and optimized further from the original module given all the
information required to execute. This binary should be loadable by the
corresponding Mux implementation using muxCreateExecutable
Parameter information consists of descriptions of each parameter passed to the kernel function. Execution information consists of information on the number of work-groups to execute, and a work-group offset.
The compiler::Kernel
object used to create this binary is guaranteed to
be destroyed after the mux_executable_t
created from this binary is
destroyed.
cargo::expected<cargo::dynamic_array<uint8_t>, Result> createSpecializedKernel(
const mux_ndrange_options_t &options);
options
- the execution options that will be used when the kernel is executed bymuxCommandNDRange
.
Return Value
If there was an allocation failure,
cargo::make_unexpected(compiler::Result::OUT_OF_MEMORY)
must be returned.If
options.descriptors
is not NULL anddescriptors_length
is 0,cargo::make_unexpected(compiler::Result::INVALID_VALUE)
must be returned.If
options.descriptors
is NULL anddescriptors_length
is not 0,cargo::make_unexpected(compiler::Result::INVALID_VALUE)
must be returned.If any element in
options.local_size
is 0,cargo::make_unexpected(compiler::Result::INVALID_VALUE)
must be returned.If
options.global_offset
is NULL,cargo::make_unexpected(compiler::Result::INVALID_VALUE)
must be returned.If
options.global_size
is NULL,cargo::make_unexpected(compiler::Result::INVALID_VALUE)
must be returned.If
options.length
is 0 or greater than 3,cargo::make_unexpected(compiler::Result::INVALID_VALUE)
must be returned.If
options.descriptors
contains an element where thetype
data member ismux_descriptor_info_type_custom_buffer
anddevice->info->custom_buffer_capabilities
is0
,cargo::make_unexpected(compiler::Result::INVALID_VALUE)
must be returned.If there was a failure during any code generation,
cargo::make_unexpected(compiler::Result::FINALIZE_PROGRAM_FAILURE)
must be returned.Otherwise an instance of
cargo::dynamic_array<uint8_t>
containing a valid binary should be returned.
BaseKernel::querySubGroupSizeForLocalSize
BaseKernel::querySubGroupSizeForLocalSize
calculates the maximum sub-group
size that would result from enqueing the kernel with the given local size.
Enqueuing the kernel with the specified local size shall result in at least
one sub-group of the size returned in out_sub_group_size
and may
additionally result in exactly one sub-group of size less than that returned
when the local size is not evenly divisible by the sub-group size.
virtual cargo::expected<uint32_t, Result> querySubGroupSizeForLocalSize(
size_t local_size_x, size_t local_size_y, size_t local_size_z);
local_size_x
- the size of the x dimension of the local work-group.local_size_y
- the size of the y dimension of the local work-group.local_size_z
- the size of the z dimension of the local work-group.
Return Value
If there was an allocation failure,
compiler::Result::OUT_OF_MEMORY
must be returned.If any of
local_size_x
,local_size_y
orlocal_size_z
are zero,compiler::Result::INVALID_VALUE
must be returned.If the device targeted by this kernel does not support sub-groups,
compiler::Result::FEATURE_UNSUPPORTED
must be returned. ``Otherwise, a sub-group size should be returned. The sub-group size must be greater than 0.
BaseKernel::queryLocalSizeForSubGroupCount
BaseKernel::queryLocalSizeForSubGroupCount
calculates the local size that
when enqueued with the kernel would result in the specified number of
sub-groups.
virtual cargo::expected<std::array<size_t, 3>, Result>
queryLocalSizeForSubGroupCount(size_t sub_group_count);
sub_group_count
- the requested number of sub-groups.
Return Value
If there was an allocation failure,
compiler::Result::OUT_OF_MEMORY
must be returned.If the device targeted by this kernel does not support sub-groups,
compiler::Result::FEATURE_UNSUPPORTED
must be returned.Otherwise, a local size should be returned. The local size must be 1 dimensional, that is, at least two of the elements in the array must be 1. The local size must be evenly divisible by the sub-group size in the kernel. If no local size would result in the requested number of sub-groups this function may return a local size of zero.
BaseKernel::queryMaxSubGroupCount
BaseKernel::queryMaxSubGroupCount
calculates the maximum number of
sub-groups that can be supported by the kernel for any local size.
virtual cargo::expected<size_t, Result> queryMaxSubGroupCount();
Return Value
If there was an allocation failure,
compiler::Result::OUT_OF_MEMORY
must be returned.If the device targeted by this kernel does not support sub-groups,
compiler::Result::FEATURE_UNSUPPORTED
must be returned. ``Otherwise, a sub-group count should be returned. The sub-group count must be greater than zero.
LLVM intermediate representation
Mangling
Mangling is used by the vectorizer to declare, define and use internal overloaded builtin functions. In general, the mangling scheme follows Appendix A of the SPIR 1.2 specification, itself an extension of the Itanium C++ mangling scheme.
Vector Types
The Itanium specification under-specifies vector types in general, so vendors
are left to establish their own system. In the vectorizer, fixed-length vector
types follow the convention that LLVM, GCC, ICC and others use. The first
component is Dv
followed by the number of elements in the vector, followed by
an underscore (_
) and then the mangled element type:
<2 x i32> -> Dv2_i
<32 x double> -> Dv32_d
Scalable-vector IR types do not have an established convention. Certain vendors such as ARM SVE2 provide scalable vector types at the C/C++ language level, but those are mangled in a vendor-specific way.
The vectorizer chooses its own mangling scheme using the Itanium
vendor-extended type syntax, which is u
, followed by the length of the
mangled type, then the mangled type itself.
Scalable-vectors are first mangled with nx
to indicate the scalable
component. The next part is an integer describing the known multiple of the
scalable component. Lastly, the element type is mangled according to the
established vectorizer mangling scheme (i.e. Itanium).
Example:
<vscale x 1 x i32> -> u5nxv1j
<vscale x 2 x float> -> u5nxv2f
<vscale x 16 x double> -> u6nxv16d
<vscale x 4 x i32 addrspace(1)*> -> u11nxv4PU3AS1j
define void @__vecz_b_interleaved_storeV_Dv16_dPU3AS1d(<16 x double> %0, double addrspace(1)* %1, i64 %2) {
define void @__vecz_b_interleaved_storeV_u6nxv16dPU3AS1d(<vscale x 16 x double> %0, double addrspace(1)* %1, i64 %2) {
Builtins
The LLVM intermediate representation stored in
compiler::BaseModule::finalized_llvm_module
may contain declarations
and calls to any of the following __mux
functions: __mux_isftz()
,
__mux_dma_read_1D()
, __mux_dma_read_2D()
, __mux_dma_read_3D()
,
__mux_dma_write_1D()
, __mux_dma_write_2D()
, __mux_dma_write_3D()
,
__mux_dma_wait()
, __mux_get_global_size()
,
__mux_get_global_id()
, __mux_get_global_offset()
,
__mux_get_local_size()
, __mux_get_local_id()
,
__mux_get_sub_group_id()
, __mux_get_num_groups()
,
__mux_get_num_sub_groups()
, __mux_get_max_sub_group_size()
,
__mux_get_group_id()
, __mux_get_work_dim()
,
__mux_mem_barrier()
, __mux_work_group_barrier()
,
__mux_sub_group_barrier()
, __mux_usefast()
,
__mux_isembeddedprofile()
, __mux_get_global_linear_id()
,
__mux_get_local_linear_id()
or __mux_get_enqueued_local_size()
.
bool __mux_isftz(void)
- Returns whether the device flushes floating-point values to 0.bool __mux_usefast(void)
- Returns whether we should use faster, but less accurate, algorithms for maths builtins used in the LLVM module.bool __mux_isembeddedprofile(void)
- Returns whether the device implements OpenCL 1.2 Embedded Profile.size_t __mux_get_global_size(uint i);
- Returns the number of global invocations for thei
’th dimension.size_t __mux_get_global_id(uint i);
- Returns the unique global invocation identifier for thei
’th dimension.size_t __mux_get_global_offset(uint i);
- Returns the global offset (in invocations) for thei
’th dimension.size_t __mux_get_local_size(uint i);
- Returns the number of local invocations within a work-group for thei
’th dimension.size_t __mux_get_local_id(uint i);
- Returns the unique local invocation identifier for thei
’th dimension.uint __mux_get_sub_group_id(void);
- Returns the subgroup ID.size_t __mux_get_num_groups(uint i);
- Returns the number of work-groups for thei
’th dimension.uint __mux_get_num_sub_groups();
- Returns the number of subgroups for the current work-group.uint __mux_get_max_sub_group_size();
- Returns the maximum subgroup size in the current kernel.size_t __mux_get_group_id(uint i);
- Returns the unique work-group identifier for thei
’th dimension.uint __mux_get_work_dim(void);
- Returns the number of dimensions in use.__mux_dma_event_t __mux_dma_read_1D(_attribute((address_space(3)))
uint8_t *dst, _attribute((address_space(1))) uint8_t *src, size_t width,
__mux_dma_event_t event)
- DMA 1D read fromsrc
todst
ofwidth
bytes. May useevent
from previous DMA call. Returns event used.__mux_dma_event_t __mux_dma_read_2D(_attribute((address_space(3)))
uint8_t *dst, _attribute((address_space(1))) uint8_t *src, size_t width,
size_t dst_stride, size_t src_stride, size_t height __mux_dma_event_t event)
- DMA 2D read fromsrc
todst
ofwidth
bytes andheight
rows, withdst_stride
bytes between dst rows andsrc_stride
bytes between src rows. May useevent
from previous DMA call. Returns event used.__mux_dma_event_t __mux_dma_read_3D(_attribute((address_space(3)))
uint8_t *dst, _attribute((address_space(1))) uint8_t *src, size_t width,
size_t dst_line_stride, size_t src_line_stride, size_t height, size_t
dst_plane_stride, size_t src_plane_stride, size_t depth, __mux_dma_event_t
event)
- DMA 3D read fromsrc
todst
ofwidth
bytes,height
rows, anddepth
planes, withdst_line_stride
bytes between dst rows,src_line_stride
bytes between src rows,dst_plane_stride
bytes between dst planes, andsrc_plane_stride
between src planes. May useevent
from previous DMA call. Returns event used.__mux_dma_event_t __mux_dma_write_1D(_attribute((address_space(1)))
uint8_t *dst, _attribute((address_space(3))) uint8_t *src, size_t width,
__mux_dma_event_t event)
- DMA 1D write fromsrc
todst
ofwidth
bytes. May useevent
from previous DMA call. Returns event used.__mux_dma_event_t __mux_dma_write_2D(_attribute((address_space(1)))
uint8_t *dst, _attribute((address_space(1))) uint8_t *src, size_t width,
size_t dst_stride, size_t src_stride, size_t height __mux_dma_event_t event)
- DMA 2D write fromsrc
todst
ofwidth
bytes andheight
rows, withdst_stride
bytes between dst rows andsrc_stride
bytes between src rows. May useevent
from previous DMA call. Returns event used.__mux_dma_event_t __mux_dma_write_3D(_attribute((address_space(3)))
uint8_t *dst, _attribute((address_space(1))) uint8_t *src, size_t width,
size_t dst_line_stride, size_t src_line_stride, size_t height, size_t
dst_plane_stride, size_t src_plane_stride, size_t depth, __mux_dma_event_t
event)
- DMA 3D write fromsrc
todst
ofwidth
bytes,height
rows, anddepth
planes, withdst_line_stride
bytes between dst rows,src_line_stride
bytes between src rows,dst_plane_stride
bytes between dst planes, andsrc_plane_stride
between src planes. May useevent
from previous DMA call. Returns event used.void __mux_dma_wait(uint num_events, __mux_dma_event_t*)
- Wait on events initiated by a DMA read or write.size_t __mux_get_global_linear_id(void)
- Returns a linear ID equivalent to(__mux_get_global_id(2) - __mux_get_global_offset(2)) *
__mux_get_global_size(1) * __mux_get_global_size(0) +
(__mux_get_global_id(1) - __mux_get_global_offset(1)) *
__mux_get_global_size(0) + (__mux_get_global_id(0) -
__mux_get_global_offset(0))
.size_t __mux_get_local_linear_id(void)
- Returns a linear ID equivalent to__mux_get_local_id(2) * __mux_get_local_size(1) *
__mux_get_local_size(0) + __mux_get_local_id(1) * __mux_get_local_size(0)
+ __mux_get_local_id(0)
.size_t __mux_get_enqueued_local_size(uint i)
- Returns the enqueued work-group size in thei
’th dimension, for uniform work-groups this is equivalent tosize_t __mux_get_local_size(uint i)
.void __mux_mem_barrier(i32 %scope, i32 %semantics)
- Controls the order that memory accesses are observed (serves as a fence instruction). This control is only ensured for memory accesses issued by the invocation calling the barrier and observed by another invocation executing within the memory%scope
. Additional control over the kind of memory controlled and what kind of control to apply is provided by%semantics
. See below for more information.void __mux_work_group_barrier(i32 %id, i32 %scope, i32 %semantics)
andvoid __mux_sub_group_barrier(i32 %id, i32 %scope, i32 %semantics)
- Wait for other invocations of the work-group/sub-group to reach the current point of execution (serves as a control barrier). A barrier identifier is provided by%id
(note that implementations must ensure uniqueness themselves, e.g., by running thecompiler::utils::PrepareBarriersPass
). These builtins may also atomically provide a memory barrier with the same semantics as__mux_mem_barrier(i32 %scope, i32 %semantics)
. See below for more information.
The LLVM intermediate representation stored in
compiler::BaseModule::finalized_llvm_module
may contain declarations of
the following type: __mux_dma_event_t
.
__mux_dma_event_t
- Represents an event object that may be passed to and shall be returned from the various__mux_dma
builtins.__mux_dma_event_t
objects may be waited on by the__mux_dma_wait
builtin.
A Mux implementation shall provide definitions for the above declarations.
Memory and Control Barriers
The mux barrier builtins synchronize both memory and execution flow.
The specific semantics with which they synchronize are defined using the following enums.
The %scope
parameter defines which other invocations observe the memory
ordering provided by the barrier. Only one of the values may be chosen
simultaneously.
enum MemScope : uint32_t {
MemScopeCrossDevice = 0,
MemScopeDevice = 1,
MemScopeWorkGroup = 2,
MemScopeSubGroup = 3,
MemScopeWorkItem = 4,
};
The %semantics
parameter defines the kind of memory affected by the
barrier, as well as the ordering constraints. Only one of the possible
orderings may be chosen simultaneously. The memory field is a
bitfield.
enum MemSemantics : uint32_t {
// The 'ordering' to apply to a barrier. A barrier may only set one of the
// following at a time:
MemSemanticsRelaxed = 0x0,
MemSemanticsAcquire = 0x2,
MemSemanticsRelease = 0x4,
MemSemanticsAcquireRelease = 0x8,
MemSemanticsSequentiallyConsistent = 0x10,
MemSemanticsMask = 0x1F,
// What kind of 'memory' is controlled by a barrier. Acts as a bitfield, so
// a barrier may, e.g., synchronize both sub-group, work-group and cross
// work-group memory simultaneously.
MemSemanticsSubGroupMemory = 0x80,
MemSemanticsWorkGroupMemory = 0x100,
MemSemanticsCrossWorkGroupMemory = 0x200,
};
Atomics and Fences
The LLVM intermediate representation stored in
compiler::BaseModule::finalized_llvm_module
may contain any of the
following atomic instructions:
cmpxchg for the monotonic ordering with strong semantics only
atomicrmw for the following opcodes:
add
,and
,sub
,min
,max
,umin
,umax
,or
,xchg
,xor
for the monotonic ordering only
A compiler shall correctly legalize or select these instructions to ISA specific operations.
The LLVM intermediate representation stored in
compiler::BaseModule::finalized_llvm_module
may also contain any of the
following atomic instructions:
cmpxchg for the monotonic ordering with weak semantics
load with the instruction marked as atomic for the monotonic ordering only
store with the instruction marked as atomic for the monotonic ordering only
fence for the acquire ordering, release ordering and acq_rel ordering only
A compiler may choose not to support these instructions depending on which open standards it wishes to enable through the oneAPI Construction Kit. For example; support for the OpenCL C 3.0 standard requires support for these instructions.
The atomic instructions listed above shall not have a syncscope argument.
No lock free requirements are made on the above atomic instructions. A target may choose to provide a software implementation of the atomic instructions via some other mechanism such as a hardware mutex.
Metadata
The following table describes metadata which can be introduced at different stages of the pipeline:
Name |
Fields |
Description |
---|---|---|
|
i32, i32, i32 |
Required work-group size encoded as X, Y, Z. If not present, no required size is assumed. |
|
i32 |
Maximum dimension used for work-items. If not present, |
|
various (incl. vectorization options) |
Information about a kernel entry point regarding its work-item
iteration over sub-kernels as stitched together by the
|
|
vectorization options, |
Links one function to another, indicating that the function acts as the base - or source - of vectorization with the given vectorization options, and the linked function is the result of a successful vectorization. A function may have many such pieces of metadata, if it was vectorized multiple times. |
|
vectorization options, |
Links one function to another, indicating that the function is the result of a successful vectorization with the given vectorization options, using the linked function as the base - or source - of vectorization. A function may only have one such piece of metadata. |
|
vectorization options |
Metadata indicating a failure to vectorize with the provided vectorization options. |
|
i32, i32(, i32, i32)? |
Metadata indicating the function parameter indices of the pointers to MuxWorkItemInfo and MuxWorkGroupInfo structures, respectively. A negative value (canonicalized as -1) indicates the function has no such parameter. Up to two additional custom parameter indices can be used by targets. |
Users should not rely on the name, format, or operands of these metadata.
Instead, utility functions are provided by the utils
module to work with
accessing, setting, or updating each piece of metadata.
Note
The metadata above which refer to vectorization options have no concise
metadata form as defined by the specification and are not guaranteed to
be backwards compatible. See the C++ utility APIs in the utils
module as
described above for the specific information encoded/decoded by
vectorization.
Name |
Fields |
Description |
---|---|---|
|
A single operand, itself containing !{i32, i32} |
The major/minor OpenCL C version that this module is compatible with. If unset the compiler assumes 1.2. The compiler will infer different semantics and supported builtin functions depending on this metadata. |
|
string, string, … |
A list of scheduling parameter names used by this target. Emitted into
the module at the time scheduling parameters are added to functions that
requires them (see |
Function Attributes
The following table describes function attributes which can be introduced at different stages of the pipeline:
Attribute |
Description |
---|---|
|
Denotes a “kernel” function. Additionally denotes a
“kernel entry point” if the value is |
|
Denotes the name of the “original function” of a function. This original function may or may not exist in the module. The original function name is propagated through the compiler pipeline each time ComputeMux creates a new function to wrap or replace a function. |
|
Denotes the “base name component” of a function. Used by several passes when creating new versions of a kernel, rather than appending suffix upon suffix. For example, a pass that suffixes newly-created functions with
|
|
Estimated local-memory usage for the function. Value must be a positive integer. |
|
Work-item order (the dimensions over which work-items are executed from
innermost to outermost) as defined by the |
|
Typically found on call sites. Determines the ordering of work-item execution after a berrier. See the BarrierSchedule enum. |
mux-kernel
attribute
ComputeMux programs generally consist of a number of kernel functions, which have a certain programming model and may be a subset of all functions in the module.
ComputeMux compiler passes often need to identity kernel functions amongst other functions in the module. Further to this, a ComputeMux implementation may know that an even smaller subset of kernels are in fact considered kernels under compilation. In the interests of compile-time it is not desirable to optimize kernels that are known to never run.
Under this scheme, it is further possible to distinguish between kernels that are entry points and those that aren’t. Entry points are kernels which may be invoked from the runtime. Other kernels in the module may only be run when invoked indirectly: called from kernel entry points.
The mux-kernel
function attribute is used to communicate kernels under
compilation and kernel entry points (a subset of those) between passes. This
approach has a myriad of advantages. It provides a stable, consistent, kernel
identification method which other data do not: names cannot easily account for
new kernels introduced by optimizations like vectorization; calling conventions
are often made target-specific at some point in the pipeline; pointers to
functions are unstable when kernels are replaced/removed.
Passes provided by ComputeMux ensure this attribute is updated when adding, removing, or replacing kernel functions. Each ComputeMux pass in its documentation lists whether it operates on kernels or kernel entry points, if applicable.