ComputeCpp version releases are currently focused on the following goals:
- Develop new features for developers to try out and to support our customers
- Provide feature completeness for the SYCL 2020 standard
- Replace the current device compiler with a brand new one, using Address Space Inference
ComputeCpp 2.11.0 brings a bunch of improvements on all these fronts.
New features
Initial prototype for USM/Buffer inter-op
SYCL 2020 brought Unified Shared Memory as an alternative to the buffer/accessor model used for managing device data. However, while the two models can co-exist within the same SYCL application, the standard didn't specify an efficient way to use data from one model in the other. This becomes problematic when using multiple libraries which were programmed with different models in mind.
For example, if you're writing a new SYCL 2020 application that uses USM pointers, but want to perform calls into a SYCL library that was written in a way that uses buffers as input parameters, you need to have a way to get a buffer from a USM pointer in order to pass that to the library. Similarly, if that library constructs a buffer for external use, you need to be able to obtain a USM pointer from that buffer so you can use it elsewhere in your new USM-aware application.
We did consider some existing proposals from other SYCL vendors for USM/Buffer interop, but because of the differences in our own implementation decided on a different approach. We do intend to collaborate with other proposals in order to standardize the extension once we have something that we know works well for ComputeCpp.
Please note that USM only works with the experimental compiler.
The initial prototype focuses only on one part: retrieving a buffer from a USM pointer.
This is done using the get_buffer
extension function:
queue exampleQueue;
context exampleContext = exampleQueue.get_context();
float* usmPtr = malloc_shared<float>(1024, exampleQueue);
buffer<float, 1> buf = ext::codeplay::get_buffer(usmPtr, exampleContext);
From this point on you can generally use buf
as you would any other buffer.
Calling get_buffer
multiple times with the same USM pointer and context
will always return the same buffer object.
There are currently some limitations to the prototype that we plan to remove with further updates to this interop.
For more information on this extension please refer to ComputeCpp Extensions.
New way of specifying arguments to offline compilation
Offline Compilation is a Professional Edition feature
The usual compilation flow for ComputeCpp is to use the device compiler
to generate some Intermediate Representation (IR) (e.g. SPIR-V),
which is at run-time passed on to the OpenCL driver for the final compilation.
ComputeCpp PE users can perform Offline Compilation
(aka. Ahead-Of-Time Compilation) of device kernels
using the device compiler option --sycl-custom-tool
.
This generates code using the native device instructions
instead of producing an IR binary.
Previously, it was possible to specify certain arguments
for the provided tool by using the flag -sycl-custom-args
,
which passes on any provided arguments to the offline compilation tool.
Additionally, the device compiler adds a few hidden arguments when invoking the tool,
namely the output and the input in the form of -o {output} {input}
However, not all tools can follow this interface.
For these tools, ComputeCpp 2.11.0 adds a new device compiler flag, -sycl-custom-command-line
.
Here is an example usage to invoke the Intel ocloc
tool
(OpenCL Offline Compiler):
compute++ <other arguments> \
--sycl-custom-tool=/path/to/tool/ocloc \
-sycl-custom-command-line "-out_dir {output_dirname} -output {output_basename} -output-no-suffix -file {input} -device skl -spirv_input" \
-sycl-custom-output-file-suffix ".gen"
Please refer to ComputeCpp Integration Guide to learn more about the topic. The new flag is discussed ComputeCpp Integration Guide.
SYCL 2020 improvements
This section will provide a short code sample for all the new SYCL 2020 features introduced in ComputeCpp 2.11.0.
Work-group algorithms in device code
See the SYCL specification (4.17.4. Group algorithms library) for more details.
exampleQueue.submit([&](handler& cgh) {
// accIn and accOut are accessors
auto accIn = accessor{inputBuf, cgh};
auto accOut = accessor{outputBuf, cgh};
cgh.parallel_for<kernel_name>(
nd_range{range{size}, range{32}}, [=](nd_item<1> i) {
// Setup
auto linearID = i.get_global_linear_id();
auto workGroup = i.get_group();
auto beginPtr = accIn.data();
auto endPtr = beginPtr + size;
// Use different work-group algorithms
accOut[linearID] = group_broadcast(workGroup, 7);
accOut[linearID] =
joint_reduce(workGroup, beginPtr, endPtr, plus<int>{});
joint_exclusive_scan(workGroup, beginPtr, endPtr, accOut.data(),
plus<int>{});
bool isAnyTrue = any_of_group(workGroup, linearID == 17);
});
});
Kernel bundles
See the SYCL specification (section 4.11.2. Synopsis) for more details.
// Retrieve the kernel bundle in input state for the request context
// and some devices, associated with the context
auto inputBundle =
get_kernel_bundle<bundle_state::input>(exampleContext, contextDevices);
auto objectBundle = compile(inputBundle);
// Linking same bundle multiple times here, but could link different bundles
std::vector<kernel_bundle<bundle_state::object>> manyBundles{objectBundle,
objectBundle};
auto builtBundle = link(manyBundles, testDevices);
exampleQueue.submit([&](handler& cgh) {
// Need to instruct the handler to use this kernel bundle
cgh.use_kernel_bundle(builtBundle);
cgh.parallel_for<kernel_name>(range{1024}, ([=](item<1>) {
// Do some computation
}));
});
marray
See the SYCL specification (4.14.3.1. Math array interface) for more details.
marray<float, 8> arr1, arr2;
marray<float, 17> arr3{arr1, arr2, 99.f};
arr1 *= arr2;
Backend interop improvements
See the SYCL specification (4.5.1.2. Template function get_native) and (4.5.1.3. Template functions make_*) for more details.
context syclContext;
backend_return_t<backend::opencl, platform> clPlatform =
get_native<backend::opencl>(syclContext.get_platform());
platform syclPlatform = make_platform<backend::opencl>(clPlatform);
is_device_copyable
See the SYCL specification (4.12.3. is_device_copyable type trait) for more details.
Let's assume we have a type that is not standard layout and not trivially copyable. The SYCL specification doesn't guarantee this kind of type can be copied from host to the device, and encourages SYCL implementations to raise a warning/error if the user tries to do so.
Let's construct a simple type that cannot be copied to the device, at least according to the SYCL specification:
struct int_wrapper {
private:
int value;
public:
bool dummy; // Breaks standard layout
int_wrapper(int value) : value(value) {}
int_wrapper(const int_wrapper&) = default;
int_wrapper& operator=(int_wrapper Other) { // Breaks trivially copyable
value = Other.value;
return *this;
}
operator int() const { return value; }
};
However, for this particular type we know that it's OK to copy it to the device - it's essentially just an integer, so it should be safe to copy and use it.
With this knowledge SYCL allows us to specialize is_device_copyable
for the type:
template <>
struct is_device_copyable<int_wrapper> : std::true_type {};
From now on we can pass variables of type int_wrapper
into the SYCL kernel.
atomic_ref on device
See the SYCL specification (4.15.3. Atomic references) for more details.
// acc is some write-enabled accessor
auto syclKernel = [acc](id<1> i) {
auto atomicRef =
atomic_ref<float, memory_order::relaxed, memory_scope::system>{acc[99]};
atomicRef.fetch_add(3.f);
};
info::kernel_device_specific
See the SYCL specification (A.5. Kernel information descriptors) for more details.
device exampleDevice;
auto workSize =
exampleKernel.get_info<info::kernel_device_specific::global_work_size>(
exampleDevice);
Address Space Inference
We've been working on replacing our current device compiler with a newer one, based on latest LLVM instead of LLVM 8, and with a brand new Address Space Inference engine. See Compute++ Experimental Compiler for more info.
Even though this compiler is called experimental and only ships in the experimental ComputeCpp package, with ComputeCpp 2.11 we consider it to be feature complete (barring any bugs we've missed). The new compiler is already able to pass our entire test suite designed with the current compiler, and also able to compile new tests that aren't possible with the current compiler.
This section highlights some of the latest changes. You can download the experimental compiler from our website.
Support for hierarchical parallelism
See the SYCL specification (4.9.4.2.3. Parallel for hierarchical invoke) for more details.
In most SYCL kernels, any variables created inside the kernel
use the work-item private memory,
and functions called within the kernel will execute once per work-item.
SYCL hierarchical parallelism introduces parallel_for_work_group
where the variables are stored in local memory,
and function calls are performed once per work-group.
It's then possible to have sections of the kernel that operate per work-item
by calling parallel_for_work_item
.
These layers of parallelism pose multiple challenges for the SYCL implementation. The ComputeCpp experimental compiler needs to be able to properly infer address spaces based on the current scope, place variables into the correct memory, and duplicate functions according to the address space. This took a while to get right as it exposed a lot of corner cases that weren't covered before, and now the experimental compiler is able to pass the hierarchical code test in the SYCL 1.2.1 CTS.
// buf is some buffer<int>
exampleQueue.submit([&](handler& cgh) {
accessor globalAcc{buf, cgh};
local_accessor<int> localAcc{range{32}, cgh};
cgh.parallel_for_work_group<hierarchical_kernel>(
range{1024}, [=](group<1> groupId) {
groupId.parallel_for_work_item(localAcc.get_range(), [&](h_item<1> i) {
localAcc[i.get_local_id()] += 7;
});
globalAcc[0] =
std::accumulate(std::begin(localAcc), std::end(localAcc), 0);
});
});
Improved diagnostics
A compiler's job is not just to compile valid code, but also to issue diagnostics on invalid code. The latest experimental compiler features much better diagnostics for common warnings and errors triggered when writing SYCL code. Most of these diagnostics have been taken from DPC++ in order to try to provide a consistent experience between implementations.
Let's take a look at a (very wrong) example SYCL program:
#include "sycl/sycl.hpp"
using namespace sycl;
class kernel_name;
void some_func() {}
int main() {
queue testQueue;
testQueue.submit([&](handler& cgh) {
auto fnPtr = &some_func;
cgh.single_task<kernel_name>([=]() {
throw(5);
fnPtr();
asm("mov %1, %0\n\t");
});
});
testQueue.wait_and_throw();
}
The SYCL kernel contains three lines of code, all of which are forbidden by the SYCL 2020 specification. Here's what an example compiler invocation might produce:
sycl-main.cpp:13:7: error: SYCL kernel cannot call through a function pointer
fnPtr();
^
sycl-main.cpp:14:7: error: SYCL kernel cannot use inline assembly
asm("mov %1, %0\n\t");
^
sycl-main.cpp:12:7: error: SYCL kernel cannot use exceptions
throw(5);
^
/opt/Codeplay/ComputeCpp/include/SYCL/compiler_hooks.h:132:3: note: called by 'kernelgen_single_task<kernel_name, (lambda at sycl-main.cpp:11:34)>'
functor();
^
3 errors generated.
Process only code reachable by SYCL device code
The device compiler is able to see all SYCL code, both on host and device (unless hidden by a macro). For the experimental compiler, this used to mean to also generate and optimize IR for host code. This isn't relevant to producing device code, though, so the latest release ensures that only SYCL device code is being processed.
This should remove any issues caused by code accessible only by the host erroneously being rejected due to using features unsupported on the SYCL device (inline assembly, SSE, etc.). It should also lead to faster compile times.