Examples

This guide was created for versions: v0.1.0 - Latest

In this chapter we show different SYCL and CUDA examples and demonstrate the similarities and differences between them.

Depending on how the code has been written, there are three approaches for how to maintain it.

  • In the first approach, for the maintenance of CUDA/SYCL applications we encapsulate SYCL and CUDA using C++ abstractions. Therefore, a developer can have a single source file that can be compiled with a CUDA compiler for NVIDIA GPUs, or ComputeCpp for SYCL. A pre-processor macro can be used to specify which code is used for SYCL and which is used for CUDA, as shown below:
const size_t N = 10u;
float * ptrA = nullptr;
#ifdef SYCL_ENABLED
// Create a SYCL buffer of 10 floats
// This pointer is a number that identifies the buffer
// in the pointer mapper structure
ptrA = static_cast<float *>(SYCLmalloc(N * sizeof(float), pMap));
#elif __CUDA__
cudaMalloc(&ptrA, N*sizeof(float)); 
#endif
  • In the second approach, we use template specialization where the CUDA and SYCL versions implement their own back-ends and at compile time, based on the chosen compiler, either the CUDA or SYCL back-ends will be selected. In the example below, different back-ends implement different functionality for the library, but the main functionality and the interface is the same. This is the approach followed by the standard C++ Parallel STL library, among others.
#if SYCL_ENABLED
    // SYCL specific library backend
    #include <SYCL/sycl.hpp>
    #include <sycl_backend.hpp>
#else 
    // CUDA specific library backend
    #include <cuda.h>
    #include <cuda_backend.hpp>
#endif

// General library interface
// The actual implementation is provided by the specific
// backends.
#include <library_interface.hpp>

/*.... rest of the code ....*/

int main () {
  Library<Backend> library;
}
  • A third possible approach is simply to port all the CUDA code to SYCL, and use a SYCL implementation capable of running on NVIDIA devices via a PTX backend. This is the simplest to maintain, as once the code is on C++ and SYCL, there is only one single source to maintain. However, it may not offer all the features of the latest NVIDIA hardware, since they may not be available on the SYCL implementation.

The first two approaches require the SYCL and the CUDA source separated in the source using pre-processor macros and different build systems. In the following subsections we describe how to convert existing CUDA code into its SYCL equivalent, which can be used for either option above.

For the last case, there is only one source code that can be executed on NVIDIA GPU devices.

Vector Addition

This section represents a step-by-step CUDA and SYCL example for adding two vectors together. The purpose of this example is to compare the CUDA and SYCL programming model, demonstrating how to map both API and concepts from the former to the latter.

The completed runnable code samples for CUDA and SYCL are available at end of this section.

Memory Objects

In SYCL, the buffer represents the memory storage, however accessing memory is handled via an accessor object. The following code snippet from the SYCL code for vector addition represents the buffer creation for SYCL.

    auto A_buff =
        cl::sycl::buffer<float>(A.data(), cl::sycl::range<1>(array_size));
    // input SYCL buffer B
    auto B_buff =
        cl::sycl::buffer<float>(B.data(), cl::sycl::range<1>(array_size));
    // output SYCL buffer C
    auto C_buff =
        cl::sycl::buffer<float>(C.data(), cl::sycl::range<1>(array_size));

A SYCL accessor can be either a host accessor or a device accessor. A device accessor is created within a queue submit scope and it is only usable inside the kernel lambda. For creating different types of accessor see the section 4.7.6 of the SYCL 1.2.1 specification. The following code snippet represents the device accessor creation for the SYCL vector addition code.

    auto A_acc = A_buff.get_access<cl::sycl::access::mode::read>(cgh);
    // getting read access over the sycl buffer B inside the device kernel
    auto B_acc = B_buff.get_access<cl::sycl::access::mode::read>(cgh);
    // getting write access over the sycl buffer C inside the device kernel
    auto C_acc = C_buff.get_access<cl::sycl::access::mode::write>(cgh);

However, CUDA uses unified memory so the device memory will be created as a standard pointer using the cudaMalloc function. The following code snippet from CUDA code for vector addition represents the device memory allocation for CUDA.

    // allocating device memory                                                   
    float *A_dev;                                                                 
    float *B_dev;                                                                 
    float *C_dev;                                                                 
    cudaMalloc((void **)&A_dev, array_size * sizeof(float));                      
    cudaMalloc((void **)&B_dev, array_size * sizeof(float));                      
    cudaMalloc((void **)&C_dev, array_size * sizeof(float));  

However, unlike SYCL which handles the memory transfer implicitly, in CUDA we need to explicitly declare the data transfer between host and device.

The following code snippets from CUDA vector addition code represent the data transfer between host and device.

Transferring data from host to device:

 // explicitly copying data from host to device                                
  cudaMemcpy(A_dev, A.data(), array_size * sizeof(float),                       
             cudaMemcpyHostToDevice);                                           
  cudaMemcpy(B_dev, B.data(), array_size * sizeof(float),                       
             cudaMemcpyHostToDevice); 

Transferring data from device to host:

    cudaMemcpyDeviceToHost);                                           
  // releasing the cuda memory objects                                          
  cudaFree(A_dev);   

Device kernel

The following code snippet represents the CUDA kernel for vector addition, which is passed as a function annotated with __global__.

__global__ void vector_add(const float *A, const float *B, float *C,            
                           size_t array_size) {                                 
  // local thread id                                                            
  size_t id = threadIdx.x;                                                      
  // calculating global id                                                      
  size_t total_threads = gridDim.x * blockDim.x;                                
  for (size_t i = id; i < array_size; i += total_threads) {                     
    C[i] = A[i] + B[i];                                                         
  }                                                                             
}   

For SYCL, the vector addition is passed as a lambda expression to the parallel-for. The SYCL kernel for vector addition is represented in the following code snippet from the SYCL vector addition code.

cgh.parallel_for<class vec_add>(
          cl::sycl::range<1>{total_threads}, [=](cl::sycl::item<1> itemId) {
            auto id = itemId.get_id(0);
            for (auto i = id; i < C_acc.get_count(); i += itemId.get_range()[0])
              C_acc[i] = A_acc[i] + B_acc[i];
          });

Device kernel execution

In CUDA, we need to set up the number of blocks per kernel and the block size in order to distribute the workload among the threads. In SYCL, we need to set the total number of threads executing the kernel and the work group size in order to distribute the workload among the threads. We use the same approach in CUDA and SYCL for launching the kernel. The following code snippet from the SYCL vector addition sample represents the number of threads calculation for SYCL.

    auto num_groups =
        sycl_queue.get_device()
            .get_info<cl::sycl::info::device::max_compute_units>();
    // getting the maximum work group size per thread
    auto work_group_size =
        sycl_queue.get_device()
            .get_info<cl::sycl::info::device::max_work_group_size>();
    // building the best number of global thread
    auto total_threads = num_groups * work_group_size;

The following code snippet from a CUDA vector addition represents the number of threads calculation for CUDA.

// getting device property in order to query device parameters                
  cudaDeviceProp prop;                                                          
  cudaGetDeviceProperties(&prop, 0);                                                                         
  const size_t max_thread_per_block = prop.maxThreadsPerBlock;                  
  const size_t num_thread_per_block =                                           
      std::min(max_thread_per_block, array_size);                               
  const size_t num_block_per_grid =                                                                                                  
               (size_t)std::ceil(((float)array_size) / num_thread_per_block);  
  // constructing block size                                                    
  dim3 block_size(num_thread_per_block, 1, 1);                                  
  // constructing number of blocks (grid size)                                  
  dim3 num_blocks(num_block_per_grid, 1, 1);

In CUDA, we calculate the maximum thread per block which is equivalent of the SYCL local size (work group size). SYCL has the concept of global size (total number of thread). However, in CUDA, they use the concept of number of blocks per grid where:

(SYCL global size)= (CUDA number of block per grid) * (CUDA thread per block).

In SYCL, dispatching the kernel is handled via queue submit command, while in CUDA dispatching the kernel is performed via <<...>>. In both SYCL and CUDA the kernel execution will be handled automatically through the run-time system. The following code snippet from the CUDA vector addition sample code represents dispatching the kernel for CUDA.

// launching and executing cuda kernel                                        
  vector_add<<<num_blocks, block_size>>>(A_dev, B_dev, C_dev, array_size);      

The following code snippet from the SYCL vector addition sample code shows how to
dispatch the kernel with SYCL.

    // submitting the SYCL kernel to the cvengine SYCL queue.
    sycl_queue.submit([&](cl::sycl::handler &cgh) {

Otherwise, in CUDA the output result will be returned to the host by using memcpy, as shown in the following code snippet.

       cudaMemcpyDeviceToHost);                                           
  // releasing the cuda memory objects                                          
  cudaFree(A_dev); 

By leveraging SYCL C++ scopes, the data will be automatically returned to the host and there is no need for explicit copy. The following code snippets from the SYCL vector addition sample code use C++ scopes (marked by { and }) to trigger memory synchronization points.

{ // beginning of SYCL objects' scope
 } // end of SYCL objects' scope

Releasing Object

CUDA follows an explicit approach to resource management and relies on developers to allocate and de-allocate memory. SYCL follows the RAII (Resource Adquisition is Initialization) rules of C++ to manage resources, hence, the scope of objects will automatically handle the low level resource managements. The following vector addition code snippet in the CUDA vector addition sample code demonstrates the releasing of CUDA memory objects.

 // releasing the cuda memory objects                                          
  cudaFree(A_dev);                                                              
  cudaFree(B_dev);                                                              
  cudaFree(C_dev);  

In SYCL, ending the scope represented in the following code snippet will automatically destroy all SYCL objects and return the output data back to the host vector C.

} // end of SYCL objects' scope

The following code represents vector addition for SYCL.

#include <SYCL/sycl.hpp>
int main() {
  // array size
  auto array_size = 256;
  // We initialised the A and B as an input vector
  std::vector<float> A(array_size, 1.0f);
  std::vector<float> B(array_size, 1.0f);
  // The output vector does not need to be initialized
  std::vector<float> C(array_size);
  { // beginning of SYCL objects' scope
    // constructing a SYCL queue for CVengine OpenCL device where automatically
    // build the underlying context and command_queue for the chosen device.
    auto sycl_queue = cl::sycl::queue;
    // input SYCL buffer A
    auto A_buff =
        cl::sycl::buffer<float>(A.data(), cl::sycl::range<1>(array_size));
    // input SYCL buffer B
    auto B_buff =
        cl::sycl::buffer<float>(B.data(), cl::sycl::range<1>(array_size));
    // output SYCL buffer C
    auto C_buff =
        cl::sycl::buffer<float>(C.data(), cl::sycl::range<1>(array_size));
    // getting the total number of compute untis
    auto num_groups =
        sycl_queue.get_device()
            .get_info<cl::sycl::info::device::max_compute_units>();
    // getting the maximum work group size per thread
    auto work_group_size =
        sycl_queue.get_device()
            .get_info<cl::sycl::info::device::max_work_group_size>();
    // building the best number of global thread
    auto total_threads = num_groups * work_group_size;
    // submitting the SYCL kernel to the cvengine SYCL queue.
    sycl_queue.submit([&](cl::sycl::handler &cgh) {
      // getting read access over the sycl buffer A inside the device kernel
      auto A_acc = A_buff.get_access<cl::sycl::access::mode::read>(cgh);
      // getting read access over the sycl buffer B inside the device kernel
      auto B_acc = B_buff.get_access<cl::sycl::access::mode::read>(cgh);
      // getting write access over the sycl buffer C inside the device kernel
      auto C_acc = C_buff.get_access<cl::sycl::access::mode::write>(cgh);
      // constructing the kernel
      cgh.parallel_for<class vec_add>(
          cl::sycl::range<1>{total_threads}, [=](cl::sycl::item<1> itemId) {
            auto id = itemId.get_id(0);
            for (auto i = id; i < C_acc.get_count(); i += itemId.get_range()[0])
              C_acc[i] = A_acc[i] + B_acc[i];
          });
    });
  } // end of SYCL objects' scope
  return 0;

The following code represents the CUDA equivalent of vector addition.

#include <stdio.h>                                                              
#include <vector>                                                               

// CUDA device kernel                                                           
__global__ void vector_add(const float *A, const float *B, float *C,            
                           size_t array_size) {                                 
  // local thread id                                                            
  size_t id = threadIdx.x;                                                      
  // calculating global id                                                      
  size_t total_threads = gridDim.x * blockDim.x;                                
  for (size_t i = id; i < array_size; i += total_threads) {                     
    C[i] = A[i] + B[i];                                                         
  }                                                                             
}                                                                               

int main() {                                                                    
  const size_t array_size = 256;                                                
  std::vector<float> A(array_size, 1.0f);                                       
  std::vector<float> B(array_size, 1.0f);                                       
  std::vector<float> C(array_size);                                             

  // allocating device memory                                                   
  float *A_dev;                                                                 
  float *B_dev;                                                                 
  float *C_dev;                                                                 
  cudaMalloc((void **)&A_dev, array_size * sizeof(float));                      
  cudaMalloc((void **)&B_dev, array_size * sizeof(float));                      
  cudaMalloc((void **)&C_dev, array_size * sizeof(float));                      

  // explicitly copying data from host to device                                
  cudaMemcpy(A_dev, A.data(), array_size * sizeof(float),                       
             cudaMemcpyHostToDevice);                                           
  cudaMemcpy(B_dev, B.data(), array_size * sizeof(float),                       
             cudaMemcpyHostToDevice);                                           

  // getting device property in order to query device parameters                
  cudaDeviceProp prop;                                                          
  cudaGetDeviceProperties(&prop, 0);                                                                         
  const size_t max_thread_per_block = prop.maxThreadsPerBlock;                  
  const size_t num_thread_per_block =                                           
      std::min(max_thread_per_block, array_size);                               
  const size_t num_block_per_grid =                                                                                                  
               (size_t)std::ceil(((float)array_size) / num_thread_per_block);  
  // constructing block size                                                    
  dim3 block_size(num_thread_per_block, 1, 1);                                  
  // constructing number of blocks (grid size)                                  
  dim3 num_blocks(num_block_per_grid, 1, 1);                                    
  // launching and executing cuda kernel                                        
  vector_add<<<num_blocks, block_size>>>(A_dev, B_dev, C_dev, array_size);      
  // retruning result to the host vector                                        
  cudaMemcpy(C.data(), C_dev, array_size * sizeof(float),                       
             cudaMemcpyDeviceToHost);                                           
  // releasing the cuda memory objects                                          
  cudaFree(A_dev);                                                              
  cudaFree(B_dev);                                                              
  cudaFree(C_dev);                                                              
  return EXIT_SUCCESS;                                                          
}

Reduction

This section complements the vector addition section that introduces different concepts of a queue, buffer, and kernel in SYCL. This section covers applying step-by-step optimization strategy for a SYCL reduction algorithm. The CUDA equivalent of step-by-step optimization for a reduction algorithm is located in the CUDA Toolkit Documentation under NVIDIA license. Please refer to CUDA Code Samples for further information about downloading and installing CUDA samples on different platforms.

The reduction algorithm is one of the most common data parallel algorithms that uses a collective communication primitive to combine multiple elements of a vector into a single one, using the associative binary operator. The reduction algorithm requires local memory to communicate the partial result calculated by different work items within a work-group. We are using a tree-based algorithm to calculate the reduction. Figure @fig:8 illustrates the different concepts behind the reduction. The optimization steps for parallel reduction algorithm are based on the CUDA parallel reduction.

parallel Reduction algorithm

For simplicity, we are using the input size of power of two. However, it is easy to use the non-power of two as data size by checking the boundaries of the kernel before executing.

Naive algorithm

In the naive algorithm, we simply use the interleaved thread access. Therefore, at each step, we are disabling half of the thread until one thread is left.

The following picture represents the architectural view of the naive algorithm.

Architectural view of the naive algorithm

The following code represents the naive reduction algorithm in SYCL. Following the tree represented in the figure @fig:8 , we are using n threads and in log n step, we are able to reduce the output into one element.

// Naive algorithm using tree based architecture, and using even threads
// to calculate the result. The algorithm time is log(n), and we use n threads.
// we use modulo operator to distinguish the even threads.
template <>
class reduce_t<0> {
 public:
  template <typename index_t, typename read_accessor_t,
            typename local_accessor_t, typename write_accessor_t>
  static void inline reduce(const index_t &in_size,
                            const read_accessor_t &in_acc,
                            local_accessor_t &local_acc,
                            write_accessor_t &out_acc,
                            cl::sycl::nd_item<1> &item_id) {
    // in_size is equivalent of total number of thread
    index_t global_id = item_id.get_global(0);
    index_t local_id = item_id.get_local(0);
    local_acc[local_id] = (global_id < in_size) ? in_acc[global_id] : 0;
    for (index_t i = 1; i < item_id.get_local_range(0); i *= 2) {
      // wait for all thread to put the data in the local memory
      item_id.barrier(cl::sycl::access::fence_space::local_space);
      if (local_id % (2 * i) == 0)
        local_acc[local_id] += local_acc[local_id + i];
    }
    if (item_id.get_local(0) == 0) {
      out_acc[item_id.get_group(0)] = local_acc[0];
    }
  }
};

This kernel uses interleaved threads which is costly, especially in SIMD(Simple Instruction Multiple Data) architectures, where in each step half of the threads that can be group together and executed in one lockstep are disabled. Also, we are using the modulo algorithm and the cost of calculating this algorithm is high.

Using contiguous threads

By using contiguous threads, the expensive modulo access has been removed.

The following picture represents the architectural view of the naive algorithm.

Architectural view of using contigues threads

The following code represents the modified algorithm that uses the contiguous threads and eliminates the modulo operation.

// using consecutive thread to calculate the result instead of even threads.
template <>
class reduce_t<1> {
 public:
  template <typename index_t, typename read_accessor_t,
            typename local_accessor_t, typename write_accessor_t>
  static void inline reduce(const index_t &in_size,
                            const read_accessor_t &in_acc,
                            local_accessor_t &local_acc,
                            write_accessor_t &out_acc,
                            cl::sycl::nd_item<1> &item_id) {
    // in_size is equivalent of total number of thread
    index_t global_id = item_id.get_global(0);
    index_t local_id = item_id.get_local(0);
    local_acc[local_id] = (global_id < in_size) ? in_acc[global_id] : 0;
    for (index_t i = 1; i < item_id.get_local_range(0); i *= 2) {
      // wait for all thread to put the data in the local memory
      item_id.barrier(cl::sycl::access::fence_space::local_space);
      // replacing odd threads with contiguous threads
      auto id = local_id * 2 * i;
      if (id < item_id.get_local_range(0)) local_acc[id] += local_acc[id + i];
    }
    if (item_id.get_local(0) == 0) {
      out_acc[item_id.get_group(0)] = local_acc[0];
    }
  }
};

However, the memory access is still not coalesced, as each thread accesses every i-th (i is the power of 2 and increased by for loop step) memory. Also, the way that each thread accesses the data can result in bank conflict. For example, suppose that we have 32 threads and the local memory on a device has 16 banks. For each work-group, the first 16 threads are used to reduce the data. The first 8 threads will use the same 8 banks as the second 8 threads, leaving 8 banks un- used. Therefore, the access to the local memory for the second 8 threads would be serialized. This could reduce the level of parallelism by a factor of 2.

Using Coalesced memory access

By modifying contiguous threads to access the memory contiguously, we can avoid both bank conflict and have coalesced memory access.

The following picture represents the architectural view of the reduction algorithm, using contiguous threads and coalesced memory access.

Architectural view of sequential thread accessing memory in a coalesced way

The following code shows how to use contiguous threads with coalesced memory accesses.

// using consecutive thread to calculate the result instead of even threads.
template <>
class reduce_t<2> {
 public:
  template <typename index_t, typename read_accessor_t,
            typename local_accessor_t, typename write_accessor_t>
  static void inline reduce(const index_t &in_size,
                            const read_accessor_t &in_acc,
                            local_accessor_t &local_acc,
                            write_accessor_t &out_acc,
                            cl::sycl::nd_item<1> &item_id) {
    // in_size is equivalent of total number of thread
    index_t global_id = item_id.get_global(0);
    index_t local_id = item_id.get_local(0);
    local_acc[local_id] = (global_id < in_size) ? in_acc[global_id] : 0;
    for (index_t i = item_id.get_local_range(0) / 2; i > 0; i >>= 1) {
      // wait for all thread to put the data in the local memory
      item_id.barrier(cl::sycl::access::fence_space::local_space);
      // replacing odd threads with contiguous threads
      if (local_id < i) local_acc[local_id] += local_acc[local_id + i];
    }
    if (item_id.get_local(0) == 0) {
      out_acc[item_id.get_group(0)] = local_acc[0];
    }
  }
};

Adding more workload per thread

Applying Brent’s theorem, each thread sequentially reduces k elements and sums the in private memory. Then, each thread writes the result into the local memory. This will increase the workload per thread by a factor of k. For simplicity, we set the k to the power of 2.

// using consecutive thread to calculate the result instead of even threads.
template <>
class reduce_t<3> {
 public:
  template <typename index_t, typename read_accessor_t,
            typename local_accessor_t, typename write_accessor_t>
  static void inline reduce(const index_t &in_size,
                            const read_accessor_t &in_acc,
                            local_accessor_t &local_acc,
                            write_accessor_t &out_acc,
                            cl::sycl::nd_item<1> &item_id) {
    using output_type = typename write_accessor_t::value_type;
    // in_size is equivalent of total number of thread
    index_t global_id = item_id.get_global(0);
    index_t local_id = item_id.get_local(0);
    output_type private_sum = output_type(0);
    // per thread reduction
    for (int i = global_id; i < in_size; i += item_id.get_global_range(0)) {
      private_sum += ((i < in_size) ? in_acc[i] : output_type(0));
    }
    local_acc[local_id] = private_sum;
    for (index_t i = item_id.get_local_range(0) / 2; i > 0; i >>= 1) {
      // wait for all thread to put the data in the local memory
      item_id.barrier(cl::sycl::access::fence_space::local_space);
      // replacing odd threads with contiguous threads
      if (local_id < i) local_acc[local_id] += local_acc[local_id + i];
    }
    if (item_id.get_local(0) == 0) {
      out_acc[item_id.get_group(0)] = local_acc[0];
    }
  }
};

Unrolling the for loop

By converting the local size to a compile-time value, the start, end, and step for the for loop is defined at compile-time. Therefore, the compiler is able to automatically unroll the for loop. The following kernel receives the local size as a template parameter.

// with static value for local size to allow compiler to unroll the parallel for
// loop
template <>
class reduce_t<4> {
 public:
  template <int local_size, typename index_t, typename read_accessor_t,
            typename local_accessor_t, typename write_accessor_t>
  static void inline reduce(const index_t &in_size,
                            const read_accessor_t &in_acc,
                            local_accessor_t &local_acc,
                            write_accessor_t &out_acc,
                            cl::sycl::nd_item<1> &item_id) {
    using output_type = typename write_accessor_t::value_type;
    // in_size is equivalent of total number of thread
    index_t global_id = item_id.get_global(0);
    index_t local_id = item_id.get_local(0);
    output_type private_sum = output_type(0);
    // per thread reduction
    for (int i = global_id; i < in_size; i += item_id.get_global_range(0)) {
      private_sum += ((i < in_size) ? in_acc[i] : output_type(0));
    }
    local_acc[local_id] = private_sum;
    // reduction for loop
    for (index_t i = local_size / 2; i > 0; i >>= 1) {
      // wait for all thread to put the data in the local memory
      item_id.barrier(cl::sycl::access::fence_space::local_space);
      // replacing odd threads with contiguous threads
      if (local_id < i) local_acc[local_id] += local_acc[local_id + i];
    }
    if (item_id.get_local(0) == 0) {
      out_acc[item_id.get_group(0)] = local_acc[0];
    }
  }
};

Launching multiple kernels

When the number of work-groups is bigger than one, the partial result of each work-group must be written to a temporary global buffer. This is due to the fact that there is no global barrier in OpenCL among the work-groups. Therefore, we need more than one step to reduce the data. In this case, we use the n-step reduction to reduce the result, where the result of each reduction step i is stored in a temporary buffer temp_buff. This temporary buffer is passed as an input buffer to the reduction step i+1. However, launching too many kernels can be expensive. In order to optimize the number of kernels in the n-step reduction algorithm, we set a threshold value, called work_group_load. If the size of temp_buff is bigger than work_group_load we launch a new kernel and pass the temp_buff as an input to it. Otherwise, we bring the data on the host and reduce it in a for loop.

The following picture represents the architectural view of launching reduction kernels recursively.

Recursively launching of Multiple reduction Kernel

The following code demonstrates launching n-step reduction kernels recursively with the assist of threshold value work_group_load.

// launching multiple kernel where the partial result is bigger than work group
// load
template <int work_group_load, int k_factor, int reduction_class,
          typename index_t, typename data_t>
void reduction(index_t in_size, cl::sycl::queue &q,
               cl::sycl::buffer<data_t> in_buff,
               cl::sycl::buffer<data_t> out_buffer) {
  using read_accessor_t =
      cl::sycl::accessor<data_t, 1, cl::sycl::access::mode::read,
                         cl::sycl::access::target::global_buffer>;
  using write_accessor_t =
      cl::sycl::accessor<data_t, 1, cl::sycl::access::mode::write,
                         cl::sycl::access::target::global_buffer>;

  using local_accessor_t =
      cl::sycl::accessor<data_t, 1, cl::sycl::access::mode::read_write,
                         cl::sycl::access::target::local>;

  const constexpr index_t local_size = work_group_load / k_factor;
  const index_t global_size = round_up(in_size / k_factor, local_size);
  index_t num_group = global_size / local_size;

  bool condition = (num_group > work_group_load) ? true : false;
  auto temp_buff = get_out_buffer(num_group, out_buffer);

  // submitting the SYCL kernel to the cvengine SYCL queue.
  q.submit([&](cl::sycl::handler &cgh) {
    // getting read access over the sycl buffer A inside the device kernel
    auto in_acc =
        in_buff.template get_access<cl::sycl::access::mode::read>(cgh);
    // getting write access over the sycl buffer C inside the device kernel

    auto out_acc =
        temp_buff.template get_access<cl::sycl::access::mode::write>(cgh);

    auto local_acc = local_accessor_t(local_size, cgh);

    // constructing the kernel
    cgh.parallel_for(
        cl::sycl::nd_range<1>{cl::sycl::range<1>{size_t(global_size)},
                              cl::sycl::range<1>{size_t(local_size)}},
        reduction_t<reduction_class, local_size, index_t, read_accessor_t,
                    local_accessor_t, write_accessor_t>(in_acc, local_acc,
                                                        out_acc, in_size));
  });
  if (condition) {
    // launching a new kernel and passing tem_buff as an input
    reduction<work_group_load, k_factor, reduction_class>(
        num_group, q, temp_buff, out_buffer);
  } else if (num_group > 1) {
    // The temp_buff size is smaller than the work_group_load
    auto host_out_acc =
        out_buffer.template get_access<cl::sycl::access::mode::write>();
    auto host_in_acc =
        temp_buff.template get_access<cl::sycl::access::mode::read>();
    // reduce the remaining on the host
    for (index_t i = 0; i < num_group; i++) {
      host_out_acc[0] += host_in_acc[i];
    }
  }
}

Full Reduction Application

The following code demonstrates the whole reduction application example containing all the above step-by-step optimizations.

#include <SYCL/sycl.hpp>
#include <iostream>
#include <sstream>

template <int reduction_class>
class reduce_t;

// Naive algorithm using tree based architecture, and using even threads
// to calculate the result. The algorithm time is log(n), and we use n threads.
// we use modulo operator to distinguish the even threads.
template <>
class reduce_t<0> {
 public:
  template <typename index_t, typename read_accessor_t,
            typename local_accessor_t, typename write_accessor_t>
  static void inline reduce(const index_t &in_size,
                            const read_accessor_t &in_acc,
                            local_accessor_t &local_acc,
                            write_accessor_t &out_acc,
                            cl::sycl::nd_item<1> &item_id) {
    // in_size is equivalent of total number of thread
    index_t global_id = item_id.get_global(0);
    index_t local_id = item_id.get_local(0);
    local_acc[local_id] = (global_id < in_size) ? in_acc[global_id] : 0;
    for (index_t i = 1; i < item_id.get_local_range(0); i *= 2) {
      // wait for all thread to put the data in the local memory
      item_id.barrier(cl::sycl::access::fence_space::local_space);
      if (local_id % (2 * i) == 0)
        local_acc[local_id] += local_acc[local_id + i];
    }
    if (item_id.get_local(0) == 0) {
      out_acc[item_id.get_group(0)] = local_acc[0];
    }
  }
};

// using consecutive thread to calculate the result instead of even threads.
template <>
class reduce_t<1> {
 public:
  template <typename index_t, typename read_accessor_t,
            typename local_accessor_t, typename write_accessor_t>
  static void inline reduce(const index_t &in_size,
                            const read_accessor_t &in_acc,
                            local_accessor_t &local_acc,
                            write_accessor_t &out_acc,
                            cl::sycl::nd_item<1> &item_id) {
    // in_size is equivalent of total number of thread
    index_t global_id = item_id.get_global(0);
    index_t local_id = item_id.get_local(0);
    local_acc[local_id] = (global_id < in_size) ? in_acc[global_id] : 0;
    for (index_t i = 1; i < item_id.get_local_range(0); i *= 2) {
      // wait for all thread to put the data in the local memory
      item_id.barrier(cl::sycl::access::fence_space::local_space);
      // replacing odd threads with contiguous threads
      auto id = local_id * 2 * i;
      if (id < item_id.get_local_range(0)) local_acc[id] += local_acc[id + i];
    }
    if (item_id.get_local(0) == 0) {
      out_acc[item_id.get_group(0)] = local_acc[0];
    }
  }
};

// using consecutive thread to calculate the result instead of even threads.
template <>
class reduce_t<2> {
 public:
  template <typename index_t, typename read_accessor_t,
            typename local_accessor_t, typename write_accessor_t>
  static void inline reduce(const index_t &in_size,
                            const read_accessor_t &in_acc,
                            local_accessor_t &local_acc,
                            write_accessor_t &out_acc,
                            cl::sycl::nd_item<1> &item_id) {
    // in_size is equivalent of total number of thread
    index_t global_id = item_id.get_global(0);
    index_t local_id = item_id.get_local(0);
    local_acc[local_id] = (global_id < in_size) ? in_acc[global_id] : 0;
    for (index_t i = item_id.get_local_range(0) / 2; i > 0; i >>= 1) {
      // wait for all thread to put the data in the local memory
      item_id.barrier(cl::sycl::access::fence_space::local_space);
      // replacing odd threads with contiguous threads
      if (local_id < i) local_acc[local_id] += local_acc[local_id + i];
    }
    if (item_id.get_local(0) == 0) {
      out_acc[item_id.get_group(0)] = local_acc[0];
    }
  }
};

// using consecutive thread to calculate the result instead of even threads.
template <>
class reduce_t<3> {
 public:
  template <typename index_t, typename read_accessor_t,
            typename local_accessor_t, typename write_accessor_t>
  static void inline reduce(const index_t &in_size,
                            const read_accessor_t &in_acc,
                            local_accessor_t &local_acc,
                            write_accessor_t &out_acc,
                            cl::sycl::nd_item<1> &item_id) {
    using output_type = typename write_accessor_t::value_type;
    // in_size is equivalent of total number of thread
    index_t global_id = item_id.get_global(0);
    index_t local_id = item_id.get_local(0);
    output_type private_sum = output_type(0);
    // per thread reduction
    for (int i = global_id; i < in_size; i += item_id.get_global_range(0)) {
      private_sum += ((i < in_size) ? in_acc[i] : output_type(0));
    }
    local_acc[local_id] = private_sum;
    for (index_t i = item_id.get_local_range(0) / 2; i > 0; i >>= 1) {
      // wait for all thread to put the data in the local memory
      item_id.barrier(cl::sycl::access::fence_space::local_space);
      // replacing odd threads with contiguous threads
      if (local_id < i) local_acc[local_id] += local_acc[local_id + i];
    }
    if (item_id.get_local(0) == 0) {
      out_acc[item_id.get_group(0)] = local_acc[0];
    }
  }
};

// with static value for local size to allow compiler to unroll the parallel for
// loop
template <>
class reduce_t<4> {
 public:
  template <int local_size, typename index_t, typename read_accessor_t,
            typename local_accessor_t, typename write_accessor_t>
  static void inline reduce(const index_t &in_size,
                            const read_accessor_t &in_acc,
                            local_accessor_t &local_acc,
                            write_accessor_t &out_acc,
                            cl::sycl::nd_item<1> &item_id) {
    using output_type = typename write_accessor_t::value_type;
    // in_size is equivalent of total number of thread
    index_t global_id = item_id.get_global(0);
    index_t local_id = item_id.get_local(0);
    output_type private_sum = output_type(0);
    // per thread reduction
    for (int i = global_id; i < in_size; i += item_id.get_global_range(0)) {
      private_sum += ((i < in_size) ? in_acc[i] : output_type(0));
    }
    local_acc[local_id] = private_sum;
    // reduction for loop
    for (index_t i = local_size / 2; i > 0; i >>= 1) {
      // wait for all thread to put the data in the local memory
      item_id.barrier(cl::sycl::access::fence_space::local_space);
      // replacing odd threads with contiguous threads
      if (local_id < i) local_acc[local_id] += local_acc[local_id + i];
    }
    if (item_id.get_local(0) == 0) {
      out_acc[item_id.get_group(0)] = local_acc[0];
    }
  }
};

template <int reduction_class, int local_size>
struct reduction_factory {
  template <typename index_t, typename read_accessor_t,
            typename local_accessor_t, typename write_accessor_t>
  static void inline reduce(const index_t &in_size,
                            const read_accessor_t &in_acc,
                            local_accessor_t &local_acc,
                            write_accessor_t &out_acc,
                            cl::sycl::nd_item<1> &item_id) {
    reduce_t<reduction_class>::template reduce<local_size>(
        in_size, in_acc, local_acc, out_acc, item_id);
  }
};

template <int reduction_class>
struct reduction_factory<reduction_class, -1> {
  template <typename index_t, typename read_accessor_t,
            typename local_accessor_t, typename write_accessor_t>
  static void inline reduce(const index_t &in_size,
                            const read_accessor_t &in_acc,
                            local_accessor_t &local_acc,
                            write_accessor_t &out_acc,
                            cl::sycl::nd_item<1> &item_id) {
    reduce_t<reduction_class>::reduce(in_size, in_acc, local_acc, out_acc,
                                      item_id);
  }
};

template <int reduction_class, int local_size, typename index_t,
          typename read_accessor_t, typename local_accessor_t,
          typename write_accessor_t>
class reduction_t {
 private:
  const read_accessor_t in_acc;
  local_accessor_t local_acc;
  write_accessor_t out_acc;
  const index_t in_size;

 public:
  reduction_t(const read_accessor_t in_acc_, local_accessor_t local_acc_,
              write_accessor_t out_acc_, const index_t in_size_)
      : in_acc(in_acc_),
        local_acc(local_acc_),
        out_acc(out_acc_),
        in_size(in_size_) {}
  // kernel code
  void inline operator()(cl::sycl::nd_item<1> item_id) {
    reduction_factory<reduction_class,
                      ((reduction_class > 3) ? local_size
                                             : -1)>::reduce(in_size, in_acc,
                                                            local_acc, out_acc,
                                                            item_id);
  }
};

//#define static_reduction_class 4;
template <typename index_t, typename data_t>
cl::sycl::buffer<data_t> inline get_out_buffer(
    const index_t num_group, cl::sycl::buffer<data_t> out_buffer) {
  return (num_group > 1)
             ? cl::sycl::buffer<data_t>(cl::sycl::range<1>{size_t(num_group)})
             : out_buffer;
}
// to make global size multiple of local size
template <typename index_t>
inline index_t round_up(const index_t x, const index_t y) {
  return ((x + y - 1) / y) * y;
}
// launching multiple kernel where the partial result is bigger than work group
// load
template <int work_group_load, int k_factor, int reduction_class,
          typename index_t, typename data_t>
void reduction(index_t in_size, cl::sycl::queue &q,
               cl::sycl::buffer<data_t> in_buff,
               cl::sycl::buffer<data_t> out_buffer) {
  using read_accessor_t =
      cl::sycl::accessor<data_t, 1, cl::sycl::access::mode::read,
                         cl::sycl::access::target::global_buffer>;
  using write_accessor_t =
      cl::sycl::accessor<data_t, 1, cl::sycl::access::mode::write,
                         cl::sycl::access::target::global_buffer>;

  using local_accessor_t =
      cl::sycl::accessor<data_t, 1, cl::sycl::access::mode::read_write,
                         cl::sycl::access::target::local>;

  const constexpr index_t local_size = work_group_load / k_factor;
  const index_t global_size = round_up(in_size / k_factor, local_size);
  index_t num_group = global_size / local_size;

  bool condition = (num_group > work_group_load) ? true : false;
  auto temp_buff = get_out_buffer(num_group, out_buffer);

  // submitting the SYCL kernel to the cvengine SYCL queue.
  q.submit([&](cl::sycl::handler &cgh) {
    // getting read access over the sycl buffer A inside the device kernel
    auto in_acc =
        in_buff.template get_access<cl::sycl::access::mode::read>(cgh);
    // getting write access over the sycl buffer C inside the device kernel

    auto out_acc =
        temp_buff.template get_access<cl::sycl::access::mode::write>(cgh);

    auto local_acc = local_accessor_t(local_size, cgh);

    // constructing the kernel
    cgh.parallel_for(
        cl::sycl::nd_range<1>{cl::sycl::range<1>{size_t(global_size)},
                              cl::sycl::range<1>{size_t(local_size)}},
        reduction_t<reduction_class, local_size, index_t, read_accessor_t,
                    local_accessor_t, write_accessor_t>(in_acc, local_acc,
                                                        out_acc, in_size));
  });
  if (condition) {
    // launching a new kernel and passing tem_buff as an input
    reduction<work_group_load, k_factor, reduction_class>(
        num_group, q, temp_buff, out_buffer);
  } else if (num_group > 1) {
    // The temp_buff size is smaller than the work_group_load
    auto host_out_acc =
        out_buffer.template get_access<cl::sycl::access::mode::write>();
    auto host_in_acc =
        temp_buff.template get_access<cl::sycl::access::mode::read>();
    // reduce the remaining on the host
    for (index_t i = 0; i < num_group; i++) {
      host_out_acc[0] += host_in_acc[i];
    }
  }
}

int main(int argc, char *argv[]) {
  using data_t = double;
  using index_t = int;
  index_t in_size;
  static constexpr index_t work_group_load = 256;
  static constexpr index_t reduction_class = 4;
  const constexpr index_t k_factor = (reduction_class > 2) ? 2 : 1;
  std::istringstream ss(argv[1]);
  if (!(ss >> in_size))
    std::cerr << "Invalid input size " << argv[1]
              << ". Please insert the correct input size " << '\n';
  // auto global_size = round_up(in_size / k_factor, local_size);
  // We initialised the A and B as an input vector
  std::vector<data_t> input(in_size, data_t(1));
  // The output vector does not need to be initialized
  std::vector<data_t> output(1);
  {  // beginning of SYCL objects' scope
    // constructing a SYCL queue for CVengine OpenCL device where automatically
    // build the underlying context and command_queue for the chosen device.
    auto q = cl::sycl::queue(
        (cl::sycl::default_selector()), [&](cl::sycl::exception_list l) {
          bool error = false;
          for (auto e : l) {
            try {
              std::rethrow_exception(e);
            } catch (const cl::sycl::exception &e) {
              auto clError = e.get_cl_code();
              std::cout << e.what() << "CLERRORCODE : " << clError << std::endl;
              error = true;
            }
          }
          if (error) {
            throw std::runtime_error("SYCL errors detected");
          }
        });
    // input SYCL buffer A
    auto in_buff =
        cl::sycl::buffer<data_t>(input.data(), cl::sycl::range<1>(in_size));
    // output SYCL buffer C
    auto out_buff = cl::sycl::buffer<data_t>(output.data(), 1);

    // call reduction function
    reduction<work_group_load, k_factor, reduction_class>(in_size, q, in_buff,
                                                          out_buff);
  }  // end of SYCL objects' scope
  auto reference = 0;
  for (int i = 0; i < in_size; i++) {
    reference += input[i];
  }
  if (output[0] != reference) {
    std::cout << "The result is wrong. expected : " << reference
              << " vs  calculated: " << output[0] << "\n";
    return 1;
  } else {
    std::cout << "The result is correct."
              << "\n";
  }
  return 0;
}

Build and Execute

  • Set the data_t, index_t, work_group_load, reduction_class, and k_factor for the device you want to run the code. They can be found in the following lines.
  using data_t = double;
  using index_t = int;
  index_t in_size;
  static constexpr index_t work_group_load = 256;
  static constexpr index_t reduction_class = 4;
  const constexpr index_t k_factor = (reduction_class > 2) ? 2 : 1;
  • Compile the code using:

/path/to/computecpp/bin/compute++ reduction.cpp -O2 -mllvm -inline-threshold=1000 -sycl-driver -sycl-target spir -no-serial-memop -I/path/to/computecpp/include -std=c++11 -o reduction -L/path/to/computecpp/lib -lComputeCpp -lOpenCL

  • Execute the code by calling:

./reduction {input_size}