ComputeCpp Release Highlights

ComputeCpp version releases are currently focused on the following goals:

ComputeCpp 2.11.0 brings a bunch of improvements on all these fronts.

New features

Initial prototype for USM/Buffer inter-op

SYCL 2020 brought Unified Shared Memory as an alternative to the buffer/accessor model used for managing device data. However, while the two models can co-exist within the same SYCL application, the standard didn't specify an efficient way to use data from one model in the other. This becomes problematic when using multiple libraries which were programmed with different models in mind.

For example, if you're writing a new SYCL 2020 application that uses USM pointers, but want to perform calls into a SYCL library that was written in a way that uses buffers as input parameters, you need to have a way to get a buffer from a USM pointer in order to pass that to the library. Similarly, if that library constructs a buffer for external use, you need to be able to obtain a USM pointer from that buffer so you can use it elsewhere in your new USM-aware application.

We did consider some existing proposals from other SYCL vendors for USM/Buffer interop, but because of the differences in our own implementation decided on a different approach. We do intend to collaborate with other proposals in order to standardize the extension once we have something that we know works well for ComputeCpp.

Please note that USM only works with the experimental compiler.

The initial prototype focuses only on one part: retrieving a buffer from a USM pointer. This is done using the get_buffer extension function:

queue exampleQueue;
context exampleContext = exampleQueue.get_context();
float* usmPtr = malloc_shared<float>(1024, exampleQueue);
buffer<float, 1> buf = ext::codeplay::get_buffer(usmPtr, exampleContext);

From this point on you can generally use buf as you would any other buffer. Calling get_buffer multiple times with the same USM pointer and context will always return the same buffer object.

There are currently some limitations to the prototype that we plan to remove with further updates to this interop.

For more information on this extension please refer to ComputeCpp Extensions.

link

New way of specifying arguments to offline compilation

Offline Compilation is a Professional Edition feature

The usual compilation flow for ComputeCpp is to use the device compiler to generate some Intermediate Representation (IR) (e.g. SPIR-V), which is at run-time passed on to the OpenCL driver for the final compilation. ComputeCpp PE users can perform Offline Compilation (aka. Ahead-Of-Time Compilation) of device kernels using the device compiler option --sycl-custom-tool. This generates code using the native device instructions instead of producing an IR binary.

Previously, it was possible to specify certain arguments for the provided tool by using the flag -sycl-custom-args, which passes on any provided arguments to the offline compilation tool. Additionally, the device compiler adds a few hidden arguments when invoking the tool, namely the output and the input in the form of -o {output} {input}

However, not all tools can follow this interface. For these tools, ComputeCpp 2.11.0 adds a new device compiler flag, -sycl-custom-command-line.

Here is an example usage to invoke the Intel ocloc tool (OpenCL Offline Compiler):

compute++ <other arguments> \
  --sycl-custom-tool=/path/to/tool/ocloc \
  -sycl-custom-command-line "-out_dir {output_dirname} -output {output_basename} -output-no-suffix -file {input} -device skl -spirv_input" \
  -sycl-custom-output-file-suffix ".gen"

Please refer to ComputeCpp Integration Guide to learn more about the topic. The new flag is discussed ComputeCpp Integration Guide.

link

SYCL 2020 improvements

This section will provide a short code sample for all the new SYCL 2020 features introduced in ComputeCpp 2.11.0.

link

Work-group algorithms in device code

See the SYCL specification (4.17.4. Group algorithms library) for more details.

exampleQueue.submit([&](handler& cgh) {
  // accIn and accOut are accessors
  auto accIn = accessor{inputBuf, cgh};
  auto accOut = accessor{outputBuf, cgh};
  cgh.parallel_for<kernel_name>(
      nd_range{range{size}, range{32}}, [=](nd_item<1> i) {
        // Setup
        auto linearID = i.get_global_linear_id();
        auto workGroup = i.get_group();
        auto beginPtr = accIn.data();
        auto endPtr = beginPtr + size;

        // Use different work-group algorithms
        accOut[linearID] = group_broadcast(workGroup, 7);
        accOut[linearID] =
            joint_reduce(workGroup, beginPtr, endPtr, plus<int>{});
        joint_exclusive_scan(workGroup, beginPtr, endPtr, accOut.data(),
                              plus<int>{});
        bool isAnyTrue = any_of_group(workGroup, linearID == 17);
      });
});

link

Kernel bundles

See the SYCL specification (section 4.11.2. Synopsis) for more details.

// Retrieve the kernel bundle in input state for the request context
// and some devices, associated with the context
auto inputBundle =
    get_kernel_bundle<bundle_state::input>(exampleContext, contextDevices);
auto objectBundle = compile(inputBundle);
// Linking same bundle multiple times here, but could link different bundles
std::vector<kernel_bundle<bundle_state::object>> manyBundles{objectBundle,
                                                              objectBundle};
auto builtBundle = link(manyBundles, testDevices);
exampleQueue.submit([&](handler& cgh) {
  // Need to instruct the handler to use this kernel bundle
  cgh.use_kernel_bundle(builtBundle);
  cgh.parallel_for<kernel_name>(range{1024}, ([=](item<1>) {
    // Do some computation
  }));
});

link

marray

See the SYCL specification (4.14.3.1. Math array interface) for more details.

marray<float, 8> arr1, arr2;
marray<float, 17> arr3{arr1, arr2, 99.f};
arr1 *= arr2;

link

Backend interop improvements

See the SYCL specification (4.5.1.2. Template function get_native) and (4.5.1.3. Template functions make_*) for more details.

context syclContext;
backend_return_t<backend::opencl, platform> clPlatform =
    get_native<backend::opencl>(syclContext.get_platform());
platform syclPlatform = make_platform<backend::opencl>(clPlatform);

link

is_device_copyable

See the SYCL specification (4.12.3. is_device_copyable type trait) for more details.

Let's assume we have a type that is not standard layout and not trivially copyable. The SYCL specification doesn't guarantee this kind of type can be copied from host to the device, and encourages SYCL implementations to raise a warning/error if the user tries to do so.

Let's construct a simple type that cannot be copied to the device, at least according to the SYCL specification:

struct int_wrapper {
 private:
  int value;
 public:
  bool dummy; // Breaks standard layout
  int_wrapper(int value) : value(value) {}
  int_wrapper(const int_wrapper&) = default;
  int_wrapper& operator=(int_wrapper Other) { // Breaks trivially copyable
    value = Other.value;
    return *this;
  }
  operator int() const { return value; }
};

However, for this particular type we know that it's OK to copy it to the device - it's essentially just an integer, so it should be safe to copy and use it.

With this knowledge SYCL allows us to specialize is_device_copyable for the type:

template <>
struct is_device_copyable<int_wrapper> : std::true_type {};

From now on we can pass variables of type int_wrapper into the SYCL kernel.

link

atomic_ref on device

See the SYCL specification (4.15.3. Atomic references) for more details.

// acc is some write-enabled accessor
auto syclKernel = [acc](id<1> i) {
  auto atomicRef =
      atomic_ref<float, memory_order::relaxed, memory_scope::system>{acc[99]};
  atomicRef.fetch_add(3.f);
};

link

info::kernel_device_specific

See the SYCL specification (A.5. Kernel information descriptors) for more details.

device exampleDevice;
auto workSize =
    exampleKernel.get_info<info::kernel_device_specific::global_work_size>(
        exampleDevice);

link

Address Space Inference

We've been working on replacing our current device compiler with a newer one, based on latest LLVM instead of LLVM 8, and with a brand new Address Space Inference engine. See Compute++ Experimental Compiler for more info.

Even though this compiler is called experimental and only ships in the experimental ComputeCpp package, with ComputeCpp 2.11 we consider it to be feature complete (barring any bugs we've missed). The new compiler is already able to pass our entire test suite designed with the current compiler, and also able to compile new tests that aren't possible with the current compiler.

This section highlights some of the latest changes. You can download the experimental compiler from our website.

link

Support for hierarchical parallelism

See the SYCL specification (4.9.4.2.3. Parallel for hierarchical invoke) for more details.

In most SYCL kernels, any variables created inside the kernel use the work-item private memory, and functions called within the kernel will execute once per work-item. SYCL hierarchical parallelism introduces parallel_for_work_group where the variables are stored in local memory, and function calls are performed once per work-group. It's then possible to have sections of the kernel that operate per work-item by calling parallel_for_work_item.

These layers of parallelism pose multiple challenges for the SYCL implementation. The ComputeCpp experimental compiler needs to be able to properly infer address spaces based on the current scope, place variables into the correct memory, and duplicate functions according to the address space. This took a while to get right as it exposed a lot of corner cases that weren't covered before, and now the experimental compiler is able to pass the hierarchical code test in the SYCL 1.2.1 CTS.

// buf is some buffer<int>
exampleQueue.submit([&](handler& cgh) {
  accessor globalAcc{buf, cgh};
  local_accessor<int> localAcc{range{32}, cgh};
  cgh.parallel_for_work_group<hierarchical_kernel>(
      range{1024}, [=](group<1> groupId) {
        groupId.parallel_for_work_item(localAcc.get_range(), [&](h_item<1> i) {
          localAcc[i.get_local_id()] += 7;
        });
        globalAcc[0] =
            std::accumulate(std::begin(localAcc), std::end(localAcc), 0);
      });
});

link

Improved diagnostics

A compiler's job is not just to compile valid code, but also to issue diagnostics on invalid code. The latest experimental compiler features much better diagnostics for common warnings and errors triggered when writing SYCL code. Most of these diagnostics have been taken from DPC++ in order to try to provide a consistent experience between implementations.

Let's take a look at a (very wrong) example SYCL program:

#include "sycl/sycl.hpp"
using namespace sycl;

class kernel_name;
void some_func() {}

int main() {
  queue testQueue;
  testQueue.submit([&](handler& cgh) {
    auto fnPtr = &some_func;
    cgh.single_task<kernel_name>([=]() {
      throw(5);
      fnPtr();
      asm("mov %1, %0\n\t");
    });
  });
  testQueue.wait_and_throw();
}

The SYCL kernel contains three lines of code, all of which are forbidden by the SYCL 2020 specification. Here's what an example compiler invocation might produce:

sycl-main.cpp:13:7: error: SYCL kernel cannot call through a function pointer
      fnPtr();
      ^
sycl-main.cpp:14:7: error: SYCL kernel cannot use inline assembly
      asm("mov %1, %0\n\t");
      ^
sycl-main.cpp:12:7: error: SYCL kernel cannot use exceptions
      throw(5);
      ^
/opt/Codeplay/ComputeCpp/include/SYCL/compiler_hooks.h:132:3: note: called by 'kernelgen_single_task<kernel_name, (lambda at sycl-main.cpp:11:34)>'
  functor();
  ^
3 errors generated.

link

Process only code reachable by SYCL device code

The device compiler is able to see all SYCL code, both on host and device (unless hidden by a macro). For the experimental compiler, this used to mean to also generate and optimize IR for host code. This isn't relevant to producing device code, though, so the latest release ensures that only SYCL device code is being processed.

This should remove any issues caused by code accessible only by the host erroneously being rejected due to using features unsupported on the SYCL device (inline assembly, SSE, etc.). It should also lead to faster compile times.

ComputeCpp™ Menu

Main Menu

Products

menu_bookGuides

ComputeCpp Release Highlights

New features

Initial prototype for USM/Buffer inter-op

New way of specifying arguments to offline compilation

SYCL 2020 improvements

Work-group algorithms in device code

Kernel bundles

marray

Backend interop improvements

is_device_copyable

atomic_ref on device

info::kernel_device_specific

Address Space Inference

Support for hierarchical parallelism

Improved diagnostics

Process only code reachable by SYCL device code

Interactive SYCL Tutorial

Changelog

assignmentJump to Section

Select a Product

oneAPI

Dark Mode

Light Mode

Also,

part of our network

Codeplay.com

SYCL.tech

Codeplay Developer

Codeplay Open Source