Migration

Migrating from CUDA to SYCL

In this chapter we provide mapping tables for the main nomenclature differences between SYCL, CUDA, and OpenCL. Knowing the equivalent nomenclature for each platform is essential to migrate a CUDA code to a SYCL code. The actual function call for CUDA terminology can be found in CUDA C programming guide. For OpenCL, the function call syntax can be found in OpenCL 1.2 reference pages. The SYCL syntax function call can be found in Khronos SYCL 1.2.1 specification.

Execution model equivalence

CUDA SYCL OpenCL
SM CU CU
SM core PE PE
thread work-item work-item
block work-group work-group

Memory model equivalence

CUDA SYCL OpenCL
register private memory private memory
shared memory local memory local memory
constant memory constant memory constant memory
global memory global memory global memory
local memory N/A(device specific) N/A(device specific)

Host API equivalence

Platform API equivalence

CUDA SYCL OpenCL Description
cudaStreamCreate() queue class clCreateCommmandQueue() By default the CUDA stream will be constructed by the CUDA driver.
cudaStreamDestroy() N/A clReleaseCommandQueue() In CUDA this function is required if the stream is created explicitly.
In SYCL this is handled by the run-time
cuStreamSynchronize()
cudaDeviceSynchronize()
queue::wait() clEnqueueBarrierWithWaitList()
cuStreamWaitEvent() event::wait() clWaitForEvents()
cudaStreamAddCallback() N/A clSetEventCallback() SYCL does not support a callback function.
N/A platform::get() clGetPlatformIDs() This is optional in SYCL and it can be handled by the SYCL runtime via device_selector
cuCtxCreate() context class clCreateContext() By default it will be constructed by the CUDA driver.
By default this will be constructed by SYCL via the device selector
CUDAGetDevice() device class clGetDeviceInfo()
CUDASetDevice() device_selector class N/A

Memory management API equivalence

CUDA SYCL OpenCL Description
cudaMalloc()
cudaMallocHost()
cudaHostAlloc()
buffer class clCreateBuffer()
clEnqueueMapBuffer()
cudaHostGetDevicePointer() accessor class N/A OpenCL does not support a unified memory system.
cudaMemset() handler::fill() clEnqueueFillBuffer()
cudaMemcpyAsync()
cudaMemcpy()
handler::copy() clEnqueueReadBuffer()
clEnqueueWriteBuffer()
clEnqueueCopyBuffer()
In SYCL explicit copy is optional. This means that a user can explicitly copy the data between host or device. If a user does not provide an explicit copy, the data transfer is handled implicitly.
cudaFree() N/A clReleaseMemObject() The buffer deletion is handled by the SYCL runtime, when an application exits the end of the SYCL scope {}.

Runtime API equivalent

CUDA SYCL OpenCL
<<<....>>> nd_range class global_work_size
local_work_size variables
"kernel function name"<<<...>>>() queue::submit() clCreateProgramWithSource/Binary()
clBuildProgram()
clCreateKernel()
clCreateKernel()
clSetKernelArg()
clEnqueueNDRangeKernel()

Device API equivalence

Kernel functions qualifiers

Qualifiers are not needed in SYCL as they are all abstracted by the SYCL runtime classes, but OpenCL ones are provided for reference.

CUDA SYCL OpenCL
__global__ function N/A __kernel function
__device__ function N/A N/A
__constant__ variable declaration N/A __constant variable declaration
__device__ variable declaration N/A __global variable declaration
__shared__ variable declaration N/A __local variable declaration

Indexing equivalence

Please note that this is the general mapping between CUDA, OpenCL, and SYCL. Some architectures my not support 3 dimensions.

CUDA SYCL OpenCL
N/A nd_item class N/A
gridDim.{x, y, z} nd_item::get_num_group({0,1,2}) get_num_group({0,1,2})
blockDim.{x, y, z} nd_item::get_local_range({0,1,2}) get_local_size({0,1,2})
blockIdx.{x, y, z} nd_item::get_group({0,1,2}) get_group_id({0,1,2})
threadIdx.{x, y, z} nd_item::get_local_id({0,1,2}) get_local_id({0,1,2})
N/A nd_item::get_global_id({0,1,2}) get_global_id({0,1,2})
N/A nd_item::get_linear_group_id() N/A
N/A nd_item::get_linear_local_id() N/A
N/A nd_item::get_linear_global_id() N/A

Synchronization equivalence

CUDA SYCL OpenCL
__syncthread() nd_item::barrier() barrier()
__threadfence_block() nd_item::mem_fence() mem_fence()
N/A nd_item::mem_fence() read_mem_fence()
N/A nd_item::mem_fence() write_mem_fence()
__threadfence() N/A N/A
__threadfence_system() N/A N/A

How to rewrite CUDA multi-GPU code for SYCL

NVIDIA CUDA supports using multiple GPUs from versions >= 4.0. CUDA can dispatch CUDA kernels, data movement, communication and synchronization among all NVIDIA GPUs.

OpenCL supports using multiple OpenCL-enabled devices for dispatching kernels, data movement, communication and synchronization only if they reside in the same OpenCL context. NVIDIA devices do not have the context restriction as all NVIDIA devices share the same context created for the NVIDIA platform. In general, a computer or a device can accommodate more than one OpenCL-enabled device, provided by different vendors. An example could be a desktop computer with both Intel and AMD GPUs where each of them has its own OpenCL driver. Therefore, each device belongs to a different OpenCL platform and must have a different context. In such case, two separate OpenCL buffers must be created. Data movement, communication and synchronization must be handled manually. SYCL, on the other hand, not only supports dispatching the kernels to multiple OpenCL devices, but it can also implicitly handle the data movement, synchronization and communication among all OpenCL-enabled devices, irrespective of the platform they belong to.

CUDA multiple GPU

A single CPU thread can be used for handling multiple GPUs in CUDA. The following function is used to get the number of available NVIDIA GPU devices in CUDA:

cudaError_t cudaGetDeviceCount(int* num_gpu);

cudaSetDevice can be used to choose a specific device for dispatching a CUDA kernel. The following pseudo-code demonstrates dispatching a kernel on multiple NVIDIA GPUs:

....
int num_gpu;
// finding available number of NVIDIA GPU devices
cudaGetDeviceCount(&num_gpu);

//looping over number of devices and dispatching a kernel per device.
for (int i = 0; i < ngpus; i++) {
    // selecting the current device
    cudaSetDevice(i);
    // executing a my_kernel on the selected device
    my_kernel<<<num_blocks, block_size>>>(...);
    // transfering data between the host and the selected device
    cudaMemcpy(...);
}
....

Note that the above code dispatches a kernel by using a single thread. It is possible to parallelize the for loop launching the kernels. However, it is the user's responsibility to handle locking when multiple threads use multiple devices. See the CUDA C programming guide for further information.

SYCL multiple devices

The list of SYCL supported platforms can be obtained with the list of devices for each platform by calling get_platforms() and platform.get_devices() respectively. Once we have all the devices, we can construct a queue per device and dispatch different kernels to different queues.

When there is only one device, kernels can be submitted to the same device. The following code snippet represents dispatching multiple kernels to a single SYCL device:

...;

// constructing the quue for an specefic device
auto my_queue = cl::sycl::queue(device_selector);

// submitting a kernel to a the sycl queue
my_queue.submit([&](cl::sycl::handler &cgh) {
    ....
  // sycl kernel 1
  cgh.parallel_for(....);
});

my_queue.submit([&](cl::sycl::handler &cgh) {
    ....
  // sycl kernel 2
  cgh.parallel_for(....);
});

my_queue.submit([&](cl::sycl::handler &cgh) {
    ....
  // sycl kernel 3
  cgh.parallel_for(....);
});

...;

Moreover, when there are multiple devices, the kernels can be distributed among all devices. The following code snippet represents dispatching a kernel on multiple SYCL devices:

...;

// getting the list of all supported sycl platforms
auto platfrom_list = cl::sycl::platform::get_platforms();
// getting the list of devices from the platform
auto device_list = platform.get_devices();
// looping over platforms
for (const auto &platform : platfrom_list) {
  // looping over devices
  for (const auto &device : device_list) {
    auto queue = cl::sycl::queue(device);
    // submitting a kernel to a the sycl queue
    queue.submit([&](cl::sycl::handler &cgh) {
      ....
          // sycl kernel
          cgh.parallel_for(....);
    });
  }
}
...;

Unlike CUDA, dispatching kernels to multiple SYCL devices is a thread safe operation, and can be done from different threads of execution.

Porting CUDA CPU library functions to SYCL

At the time of writing this document, the cuBLAS library, which is part of the CUDA 6.0 (and above) ecosystem, is the only library that can be used as a replacement for a CPU BLAS library directly by replacing the link flags in the build system. In particular, NVBLAS is an interception layer that replaces calls to a CPU-based BLAS library into its cuBLAS counterparts. No source code modification is required.

Although technically possible, at the time of writing this document, such intercept layer for SYCL-BLAS has not been added to the SYCL ecosystem. In general, the interception layer mechanism is suitable for C-based interfaces, where the symbol name is clearly identifiable in the generated binary. C++-based interfaces rely on template meta-programing, which makes the interception of symbols at linki time difficult. However, a new BLAS library project with a C interface could implement this feature.

Porting CUDA debug code to SYCL

Error-handling and reporting

In CUDA, all synchronous functions return an error code. Note that asynchronous functions (such as kernel execution) cannot possibly return an error code via its interface, so a later invocation to a function may return an error that comes from a previous API call. Typically, the error code returned by cudaDeviceSynchronize is used to query for the error of a previous kernel execution, although more advanced functionality is available. Developers are encouraged to write their own checking macros to wrap API calls and catch synchronous errors.

In SYCL, error reporting is done via the C++ exception mechanism. Synchronous operations will throw exceptions derived from cl::sycl::exception and can be catch in-place by developers, like in the example below:

try {
    mySyclProgram.build_with_kernel_type<class myKernel>()
} catch (cl::sycl::compile_program_error e) {
    std::cout << " The program failed to build " << std::endl;
}

Asynchronous operations are captured and stored automatically by the SYCL runtime. Multiple exceptions are stored in the order they are captured in a list, of type cl::sycl::exception_list. On construction of a queue, an optional functor - the asynchronous handler - can be specified. Whenever a user calls the wait_and_throw or async_throw methods of the SYCL queue, this functor is called. Developers can use this function to handle custom behavior for dealing with exceptions, such as reducing the multiple exceptions in a list to the most severe one and re-throwing it. If neither of those methods are called, the exceptions are discarded when the queue object is destroyed. The async_handler example of the ComputeCpp SDK illustrates the usage of this mechanism to rethrow and display exceptions.

In both synchronous and asynchronous cases, when a SYCL-related error occurs, an object of type cl::sycl::exception will be thrown. SYCL exception objects are derived from the standard exception type std::exception. Whenever a specific exception type is available, the SYCL exception type can inherit from the particular exception type and throw a more elaborate report regarding the captured error. Refer to the SYCL specification Section 4.9.2 for details on the Exception Class Interface. All SYCL exception objects contain a description message, whose contents are implementation defined. Whenever possible, a low-level OpenCL error and/or a pointer to a SYCL context is also provided.

Debugging SYCL code

The CUDA ecosystem includes a CUDA debugger, cuda-gdb, which is based on gdb with some extensions to facilitate debugging of GPU code. Developers can also use printf inside CUDA kernels to display intermediate results. In addition, a functional correctness checking suite called cuda-memcheck is available to check different aspects of CUDA applications.

In SYCL, we have to distinguish two kinds of debugging: (A) Debugging the host application or (B) debugging the device-specific behavior. There is no standard solution for (B), as the code executed on the device can only be debugged within a debugger that contains supports for the given platform. This is inherently device-specific. On the other hand, (A) can be dealt with easily in SYCL. Any code that executes on a device must run on the host using a host queue. Hence, replacing a device-queue with a host queue will have the same expected output (minus device-specific behavior) as the original device queue. Since the host queue is implemented in pure C++, normal C++ debuggers can be used to inspect the behavior of the kernels on host. A host queue can be created on the host simply by creating a queue from a host device. A simple option to enable simple debugging on the host is to use a pre-processor macro when creating a custom debug build, as shown below:

#if __DEBUG_IN_HOST_DEVICE
// Construct a host device
host_device hd;
// Create a queue with a host device
queue myQueue(hD);
#else
// Use the default selector of the platform
queue myQueue;
#endif

The rest of the code remains unchanged.

Using the same mechanism, any C++ check tool can be used to inspect the behavior of the application on the host. For example, valgrind can be used to detect out of bounds access to arrays when using the host device without requiring special configuration options.

To display intermediate results, SYCL developers can still use the low-level printf function, with equivalent behavior as the OpenCL printf function. For device kernels, C++ iostream operations (e.g, cout or cerr) are not available. However, a replacement SYCL stream class with equivalent functionality is available on the device. The SYCL stream class is defined in Section 4.12 of the SYCL 1.2.1 specification. The example hello-world in the ComputeCpp SDK illustrates the usage of the stream object. The output of the stream object is displayed when the kernel execution completes.

Measuring SYCL application performance

SYCL application performance can be measured by enabling profiling in each SYCL queue. Queue profiling can be enabled by passing the SYCL property list with

cl::sycl::property::queue::enable_profiling() to a SYCL queue.

Once profiling is enabled, the kernel execution submit time, kernel execution start time, and the kernel execution end time can be extracted.

The following code snippet represents how to enable profiling and extract the execution time from a sycl application.

int main() {

  // ...

  // enabling SYCL queue profiling
  auto property_list =
      cl::sycl::property_list{cl::sycl::property::queue::enable_profiling()};

  // adding the property list with profiling enabled option to the sycl queue. 
  auto sycl_queue = cl::sycl::queue(property_list);

  //....

  // submitting sycl kernel
  auto event = sycl_queue.submit([&](cl::sycl::handler& cgh){

    //...

  });

  // waiting for the kernel to finish
  event.wait();

  // getting kernel submission time
  auto submit_time =
        event.get_profiling_info<
                cl::sycl::info::event_profiling::command_submit>();

  // getting kernel start time
  auto start_time = event.get_profiling_info<
                              cl::sycl::info::event_profiling::command_start>();

  // getting kernel end time
  auto end_time = event.get_profiling_info<
                            cl::sycl::info::event_profiling::command_end>();

  // calculating the duration between submitting a sycl kernel and executing 
  // sycl kernel in milliseconds.
  auto submission_time = (start_time - submit_time) / 1000000.0f;

  // calculating the kernel execution time in milliseconds
  auto execution_time = (end_time - start_time) / 1000000.0f;

  // .....

}

Porting CUDA build system to SYCL

The CUDA platform offers a single unified compiler that is capable of generating a final application binary from CUDA code, nvcc. Different versions of the CUDA platform offer different capabilities, refer to The nvcc documentation for details. The CUDA compiler can also be used to build only the device code, and use another standard compiler (e.g, gcc) to link and compile on the host.

In the SYCL standard, two mechanisms of compilation are described: (A) Single unified compiler and (B) Multiple compilers for the same source. Different SYCL implementations may implement (A), (B) or both. In this document we focus on Codeplay's ComputeCpp implementation. The details of the build system can be found in the ComputeCpp integration guide. The integration guide contains detailed examples on integration of different build systems and different compilation options.

Switching from nvcc to compute++

The simplest way to switch from the NVIDIA CUDA compiler to the ComputeCpp SYCL compiler is to replace the former with the later. compute++ is a clang driver, hence any standard clang and/or llvm options will be valid. For example, building a simple hello world cuda file:

$ nvcc hello_world.cu -o hello_world.exe

Is equivalent to using compute++ to build the SYCL C++ source:

$ compute++ hello_world.cpp -o hello_world.exe

Note compute++ builds C++ files, whereas CUDA builds .cu files. CUDA is not considered standard C++, hence uses a different file type. This implies that CUDA files need to be converted to C++ sources before they can be built with SYCL C++.

The following table shows the equivalence between nvcc-specific options and compute++. nvcc-specific options not listed here are not suported or not applicable to compute++. For all the standard options, refer to the clang command line documentation

nvcc compute++ Notes
(default behavior) -sycl-driver Compiles both host and device code into a single output binary
--fatbin , --ptx, --cubin --device-c -sycl -c or -sycl-device-only Outputs only the integration header
--gpu-architecture, --gpu -sycl-target ptx64 --cuda-gpu-arch Only when using ptx64 target

If the SYCL compiler-driver is used, there is no need to install any host-side compiler. If Single-source multiple-compiler passes is used, the installation of a host-side compiler, like g++, is required.

Porting CUDA stream to SYCL queue

CUDA stream

A CUDA stream is a sequence of CUDA operations, submitted from host code. These operations are executed asynchronously in order of submission. It is always users' responsibility to synchronize the operation before using the result. If no CUDA stream is given, a default CUDA stream is created and all operations are submitted to the default stream.

It is possible to overlap the execution of multiple CUDA operations by creating multiple CUDA streams and submitting them to different streams. This can be used to overlap submission of kernels with data transfer operations.

The following code snippet demonstrates how to define, create, and release a CUDA stream object. See the CUDA C programming guide for more information.

// defining a CUDA stream object
cudaStream_t stream;
// creating a CUDA stream object
cudaError_t err = cudaStreamCreate(&stream);
// releasing a CUDA stream object
cudaError_t err = cudaStreamDestroy(stream);

SYCL queue

In a similar fashion to CUDA streams, SYCL queues submit command groups for execution asynchronously. However, SYCL is a higher-level programming model, and data transfer operations are implicitly deduced from the dependencies of the kernels submitted to any queue. Furthermore, SYCL queues can map to multiple OpenCL queues, enabling transparent overlapping of data-transfer and kernel execution. The SYCL runtime handles the execution order of the different command groups (kernel + dependencies) automatically across multiple queues in different devices.

We can create a SYCL queue by instantiating the queue class.

cl::sycl::queue myQueue{device_selector}

The device_selector determines the SYCL device we are going to use.The system resources required by the queue are released automatically after it goes out of scope following C++ RAII rules.

Event mechanism

CUDA event function

A CUDA event is a marker associated with a certain point in the stream. A CUDA event can be used either to synchronize the stream execution or to monitor the progress in the device. The following code snippet represents the declaration, construction and releasing of a CUDA event. More information can be found in CUDA C programming guide

// declaration of a CUDA event
cudaEvent_t event;
// creation of a CUDA event
cudaError_t cudaEventCreate(&event);

// using the event here ...

// releasing a CUDA event
cudaError_t cudaEventDestroy(event);

SYCL event object

An event in SYCL is an abstraction of the OpenCL cl_event object. An OpenCL event is used to synchronize memory transfers, dispatch of kernels and signaling barriers. As an abstraction layer on OpenCL events, SYCL events accommodate synchronization between different contexts, devices and platforms. The submit method in a SYCL queue returns a cl::sycl::event. SYCL events can be used to manually define synchronization points irrespective of the underlying device and, thus, can be used also to synchronize operations on host queues.

{
// SYCL events are returned from a command group submission
auto event = syclQueue.submit([&](cl::sycl::handler &cgh) {
        /** SYCL command group **/
      });
// Users can wait on a specific event, regardless on where is submitted
event.wait();
// Event is released automatically at the end of the scope
}

Triggering user functions from device events

CUDA callback function

A stream callback function enables the execution of a host function provided by the application after all the previous operations in the stream have completed. A stream callback can be submitted to a CUDA stream by calling the following function:

cudaError_t cudaStreamAddCallback(cudaStream_t stream,
cudaStreamCallback_t callback, void *userData, unsigned int flags);

Where the callback is the callback function and the userData is the parameter passed to the callback function.

A CUDA stream callback has the following restrictions:

  • A callback function cannot call any CUDA API function
  • A callback function cannot contain any synchronization function.

The following code snippet represents the usage of the CUDA callback example.

void CUDART_CB callback_func(cudaStream_t stream, cudaError_t status, void *) {
  printf("callback function is called);
}
// defining a CUDA stream object
cudaStream_t stream;
// creating a CUDA stream object
cudaError_t err = cudaStreamCreate(&stream);
// launching the first lernel
first_kernel<<<numBlock, blockSize, 0, stream>>>();
second_kernel<<<numBlock, blockSize, 0, stream>>>>();
third_kernel<<<numBlock, blockSize, 0, stream>>>>();
cudaStreamAddCallback(stream, callback_func, NULL, 0);
// releasing a CUDA stream object
cudaError_t err = cudaStreamDestroy(stream);

SYCL callback function

There is no direct functionality in SYCL to trigger a host callback from a device function. Developers can manually wait on an event and call the callback function afterwards. The following code snippet represents an example of a callback function implementation using this mechanism.

std::function<void(std::string)> my_callback =
    [](std::string my_string) { std::cout << my_string << std::endl; }
// constructing the SYCL queue
auto queue = cl::sycl::queue{default_selector};
// data on host
auto vec = std::vector<float>{100, 1};
// constructing a SYCL buffer
auto buffer = cl::sycl::buffer<float, 1>{vec.data(), cl::sycl::range<1>{100}};

// submitting the kernel to the queue
auto event = queue.submit([&](cl::sycl::handler &cgh) {
  // getting a read_write access over SYCL buffer
  auto acc = buffer.get_access<cl::sycl::access::mode::read_write>(cgh);
  // SYCL device kernel
  cgh.parallel_for<class kernel>(
      cl::sycl::range<1>{100},
      [](cl::sycl::id<1> id) { acc[id] += acc[id]; });
});

// The future can be waited or stored elsewhere, and the callback will execute 
//asynchronously.
auto fut = std::async([&]() {
  // wait for the kernels to finish
  event.wait();
  // calling the callback function after the kernel execution
  my_callback("This is a callback function");
});

This achieves the same result as a CUDA callback, using pure C++ features.

Codeplay has proposed to the SYCL standard a new type of handler, the host_handler, that is capable of executing a single task on the host and can be submitted directly to the SYCL device queue. It ensures that this task is executed asynchronously and in-order w.r.t the other command groups on the same queue. See the Enqueuing host tasks on SYCL queues proposal in SYCL for further information. This feature is available since ComputeCpp 0.5.1 and above as a vendor extension.

Sections

    Select a Product

    Please select a product

    ComputeCpp enables developers to integrate parallel computing into applications using SYCL and accelerate code on a wide range of OpenCL devices such as GPUs.

    ComputeSuite for R-Car enables developers to accelerate their applications on a wide range of Renesas R-Car based hardware such as the H3 and V3M, using widely supported open standards such as Khronos SYCL and OpenCL.

    Also,

    part of our network