Version Latest

Developing a Sobel Algorithm

Developing a Sobel algorithm for the R-Car V3M2 using ComputeCpp and SYCL

Embedded systems vary in their architectures and for the Renesas R-Car hardware there are some specific considerations to be taken into account when writing code for this platform. This article focuses on the specifics of writing code to perform Sobel edge detection on the V3M2 architecture and the following needs to be taken into account when developing for this particular hardware. These vary between R-Car hardware designs so may be different for the V3H for example.

  • The OpenCL image built-ins are not supported on the CVengine processor
  • For buffer allocation on V3M hardware the CVengine processor is limited to 112MB
  • The stack size of each work-item is limited to 1024 bytes
  • Floating points: there is no support for double precision

The code discussed in this article is based on the optimized tiled convolution kernel code from our R-Car optimization guide that was optimized for the the CVEngine processor and takes into account the limitations of the hardware. Convolutions are one of the most fundamental operations used in vision processing, so this code provides a good starting point for many common algorithms including Sobel.

Since Yocto, the operating system used for the R-Car hardware, can be customized we have implemented some utilities that make it possible to manipulate and display image files on the device using either OpenCV or CImg. The CImg library is particularly useful for embedded platforms, because it is header only and doesn't need any source code to be built to produce the binaries. It enables file data to be written straight to the frame buffer of the device so we can easily see the results of any algorithms. Yocto does not support a display manager so although OpenCV can be used when running the code on a desktop machine it is not used on the device. One of the benefits of SYCL is that the same code can be compiled, executed and debugged on a desktop platform before the need to deploy this same code onto an embedded device.

The objective of writing this code is to experiment with the implementation of a Sobel edge detection algorithm and then to enhance it using Gaussian blur in order to remove noise and enhance the colors of the image. Throughout the article we are will explain how performance is achieved. If you want to dive into the details of performance for CVengine on R-Car read our optimization guide.

Designing the Application

For the regular Sobel edge detection code the code is structured as follows. You can see the "host code" that runs on the device CPU and the "device code" that runs on the CVengine processor. The CVengine processor is handling the parallel execution of the various algorithms being used to process the image data. Application Architecture

Host code

The Host is run on the CPU and is used to configure the SYCL kernel execution. Let's examine how the host code is put together.

Setting up the data

This sample uses some hard coded values in order to simplify the code as much as possible. In the code section below we set up the global constants to be used by the program, these include dimensions of the images to be processed, initializing the masks to be used on the images and setting some values to be used for the Sobel operation.

// initializing the image size dimensions
namespace dimensions {
constexpr size_t cols = 1024;
constexpr size_t rows = 1024;
constexpr size_t channels = 1;
}  // namespace dimensions

// initializing the image mask memories
namespace filters {
// sobel filter array (x and y axis)
constexpr float sobelX[9] = {-1.0f, 0.0f,  1.0f, -2.0f, 0.0f, 2.0f,  -1.0f, 0.0f, 1.0f};
constexpr float sobelY[9] = {-1.0f, -2.0f, -1.0f, 0.0, 0.0f, 0.0f,  1.0f,  2.0f,  1.0f};
}  // namespace filters

// initializing type conversion ratios
namespace converters {
constexpr float floatToByte = 255.0f / 1.0f;
constexpr float byteToFloat = 1.0f / 255.0f;
}  // namespace converters

// initializing custom intensity multipliers
namespace intensity {
constexpr float gaussian = 25.0f;
constexpr float sobel = 50.0f;
}  // namespace intensity

Loading the data

This code sets up the display object and loads the input image data.

// initialize object to handle image displaying capabilities: window manager / frame buffer
utils::displaying::image_disp<dimensions::cols, dimensions::rows,            
      dimensions::channels> disp_handler;

// initialize IO handler for the images: read, save, get data, etc.
auto image_io = disp_handler.get_image_io();

// Load the input image in grayscale
auto input = image_io->read_image(file_path, utils::displaying::channels::grayscale);

We construct a class object that handles everything related to the image such as input and output operations, displaying the processed data on the screen and getting the image data as a pointer. The image is then read as gray scale using a single channel. Under the hood, the image_io object is a smart pointer managed by the display handler and the input image is an image object (either CImg<T> or cv::Mat).

Below we create a pointer to store the output image data.

// initializing pointers to store the processed (in stages) image data.
// used for the final result that is converted back to cl::sycl::cl_uchar
std::shared_ptr<cl::sycl::cl_uchar> output_ptr(
      new cl::sycl::cl_uchar[dimensions::cols * dimensions::rows *
                            dimensions::channels],
      [](cl::sycl::cl_uchar* data_mem) { delete[] data_mem; });

// initializing an output image
auto output = image_io->create_image(output_ptr.get(), dimensions::cols,
      dimensions::rows, dimensions::channels);

The create_image method creates and returns an output image object, with its data memory initialized using the resources of output_ptr and its life span managed by the lifetime of the shared pointer, because smart pointers in C++ follow RAII. Under the hood when the image object is created, the image is also re-sized to a width equal to dimensions::cols and a height equal to dimensions::rows. This output image object is going to be used to display the resulting data from the edge detection computation.

Device code

The Device contains the processing units that will execute the kernel, in this case the Renesas CVengine.

The device side code is more involved than the host code and led us to use the the buffer::reinterpret function to manage buffers better. It is very similar to the standard C++ reinterpret_cast function and you can learn more about it in this blog post that includes a step-by-step guide on how to use it. In short, buffer::reinterpret was used to keep buffers alive for as long as possible and made it possible to re-use them in kernels that accept data with a different type or range. This is a key factor in writing the best performing SYCL code for the CVengine. In order to demonstrate this we examine two versions of code showing the differences between making use of the reinterpret function and the naive approach of destroying and creating new buffers. These approaches have a different effect on the performance of the code.

Initialize the queue

// selecting device using SYCL as backend (and pass SYCL debug options)
auto queue = framework::device_helper::create_queue(option,
     cl::sycl::default_selector{});
{
  // initializing total (global) buffer size - 1D range
  const cl::sycl::range<1> num_items1d{dimensions::cols * dimensions::rows};
  // initializing total (global) buffer size - 2D range
  const cl::sycl::range<2> num_items2d{dimensions::cols, dimensions::rows};

  ... <<buffers allocation and kernels execution scopes here>> ...
}

SYCL is based on modern C++ meaning that all you have to know is that the resources of the buffers you allocate are automatically being released when out of scope. This is made possible by the use of lightweight smart pointers that follow C++ RAII and in particular it becomes more clear if we say that the SYCL resource management style is also known as Scope-based Resource Management (SBRM), so the scope of objects will automatically handle the low level resource management. SYCL will automatically release all the created objects when it exits the objects scope. Therefore, as the SYCL code is wrapped inside {} brackets, the SYCL buffers and SYCL queue will be destroyed automatically after exiting the scope and the output data will be available inside the host container.

Non-optimized code

Setup device storage

In the non-optimized version of the code, we set up data buffers on the host and these host buffers are used every time data needs to be read from or written to resulting in performance overheads.

// used for storing the converted to float image data
std::shared_ptr<float> fin_ptr(
    new float[constrains::COLS * constrains::ROWS * constrains::CHANNELS],
    [](float* data_mem) { delete[] data_mem; });
// used for storing the data after applying sobel filter in the x-axis
std::shared_ptr<float> sobel_x_ptr(
    new float[constrains::COLS * constrains::ROWS * constrains::CHANNELS],
    [](float* data_mem) { delete[] data_mem; });
// used for storing the data after applying sobel filter in the y-axis
std::shared_ptr<float> sobel_y_ptr(
    new float[constrains::COLS * constrains::ROWS * constrains::CHANNELS],
    [](float* data_mem) { delete[] data_mem; });
// used for storing the normalized values from the sobel-x and sobel-y data
std::shared_ptr<float> sobel_norm_ptr(
    new float[constrains::COLS * constrains::ROWS * constrains::CHANNELS],
    [](float* data_mem) { delete[] data_mem; });
// used for the final result that is converted back to unsigned char
std::shared_ptr<unsigned char> output_ptr(
    new unsigned char[constrains::COLS * constrains::ROWS *
                      constrains::CHANNELS],
    [](unsigned char* data_mem) { delete[] data_mem; });
Execute kernels
Convert data to float : type conversion kernel

New buffers in_buff and out_buff are set up for storing the input and output data. This creates an overhead within the code because data needs to be copied from the host to the device.

{
  cl::sycl::buffer<unsigned char, 1> in_buff(image_io->get_image_ptr(input),
        num_items1d);
  cl::sycl::buffer<float, 1> out_buff(fin_ptr.get(), num_items1d);

  std::cout << "\nConverting to float ...";
  // begin uchar to float conversion
  framework::kernels::convert_type<unsigned char, float>(
            queue, option, in_buff, out_buff, num_items1d, converters::byteToFloat);
}
Apply edge detection on x- and y-axis : convolution kernel

Again more buffers are set up with data, creating further overhead due to the data copy from the host device.

{
  cl::sycl::buffer<float, 2> in_buff(fin_ptr.get(), num_items2d);

  // apply over x-axis / cols
  {
    cl::sycl::buffer<float, 2> fil_x_buff(filters::sobelX, num_fil3x3);
    cl::sycl::buffer<float, 2> out_x_buff(sobel_x_ptr.get(), num_items2d);

    std::cout << "\nApplying sobel filter (x-axis) ...";
    // compute sobel x-axis
    framework::kernels::convolution<float, constrains::COLS,
            constrains::ROWS, 3, 3>(queue, option, &in_buff, fil_x_buff, &out_x_buff);
  }
  // apply over y-axis / rows
  {
    cl::sycl::buffer<float, 2> fil_y_buff(filters::sobelY, num_fil3x3);
    cl::sycl::buffer<float, 2> out_y_buff(sobel_y_ptr.get(), num_items2d);

    std::cout << "\nApplying sobel filter (y-axis) ...";
    // compute sobel y-axis
    framework::kernels::convolution<float, constrains::COLS,
            constrains::ROWS, 3, 3>(queue, option, in_buff, fil_y_buff, out_y_buff);
  }
}
Normalize the data : magnitude calculation kernel

Three new buffers are set up to hold the data for processing.

{
  cl::sycl::buffer<float, 1> in_x_buff(sobel_x_ptr.get(), num_items1d);
  cl::sycl::buffer<float, 1> in_y_buff(sobel_y_ptr.get(), num_items1d);
  cl::sycl::buffer<float, 1> out_buff(sobel_norm_ptr.get(), num_items1d);

  std::cout << "\nCalculating normalization ...";
  // apply normalization the magnitude operator
  framework::kernels::calc_magnitude<float>(
        queue, option, in_x_buff, in_y_buff, out_buff,
            num_items1d, intensity::sobel);
}
Convert data to uchar

More buffers are created.

{
  cl::sycl::buffer<float, 1> in_buff(sobel_norm_ptr.get(), num_items1d);
  cl::sycl::buffer<unsigned char, 1> out_buff(
        image_io->get_image_ptr(output), num_items1d);

  std::cout << "\nConverting to uchar ...";
  // begin uchar to float conversion
  framework::kernels::convert_type<float, unsigned char>(
            queue, option, in_buff, out_buff,
                    num_items1d, converters::floatToByte);
}

Optimized version

Setup device storage

In the optimized version of the code, the data buffers are set up in device-only memory and used for the reading and writing of data. Performance is better because data is only read from and written to the host at the beginning and end of the code.

// read from: uchar
const cl::sycl::buffer<cl::sycl::cl_uchar, 1> uin_buff(
      image_io->get_image_ptr(input), num_items1d);
// write to: cl::sycl::cl_float -> uchar
cl::sycl::buffer<cl::sycl::cl_uchar, 1> uout_buff(
      image_io->get_image_ptr(output), num_items1d);

// used as temp/local memory
cl::sycl::buffer<cl::sycl::cl_float, 1> fin_buff(num_items1d, {property::buffer::context_bound()});
cl::sycl::buffer<cl::sycl::cl_float, 1> fout_buff(num_items1d);
cl::sycl::buffer<cl::sycl::cl_float, 2> sobel_x_buff(num_items2d);
cl::sycl::buffer<cl::sycl::cl_float, 2> sobel_y_buff(num_items2d);
const cl::sycl::buffer<cl::sycl::cl_float, 2> sobelX_fil_buff(
      filters::sobelX, cl::sycl::range<2>(3, 3));
const cl::sycl::buffer<cl::sycl::cl_float, 2> sobelY_fil_buff(
      filters::sobelY, cl::sycl::range<2>(3, 3));
Execute kernels
Convert data to float : type conversion kernel

Existing buffers are passed to the kernel so no data is copied or requires any management.

{
  // begin uchar to cl::sycl::cl_float conversion
  framework::kernels::cve_convert_type<cl::sycl::cl_uchar,
        cl::sycl::cl_float>(queue, option, uin_buff, fin_buff, num_items1d, 
            converters::byteToFloat);
}
Apply edge detection on x- and y-axis : convolution kernel

Buffers are re-used again by taking advantage of the "reinterpret" method and the overhead seen in the non-optimized version is removed.

const auto fin_buff_re =
      fin_buff.reinterpret<cl::sycl::cl_float>(num_items2d);
{
  // compute sobel x-axis
  {
    // apply convolution
    framework::kernels::cve_convolution<
          cl::sycl::cl_float, dimensions::cols, dimensions::rows, 3, 3>(
                queue, option, fin_buff_re, sobelX_fil_buff, sobel_x_buff);
  }
  // compute sobel y-axis
  {
    // apply convolution
    framework::kernels::cve_convolution<
          cl::sycl::cl_float, dimensions::cols, dimensions::rows, 3, 3>(
                queue, option, fin_buff_re, sobelY_fil_buff, sobel_y_buff);
  }
}
Normalize the data : magnitude calculation kernel

The existing buffers are accessed using the "reinterpret" method.

const auto sobel_x_buff_re =
      sobel_x_buff.reinterpret<cl::sycl::cl_float>(num_items1d);
const auto sobel_y_buff_re =
      sobel_y_buff.reinterpret<cl::sycl::cl_float>(num_items1d);
{
  // apply normalization the magnitude operator
  framework::kernels::cve_calc_magnitude<cl::sycl::cl_float>(
          queue, option, sobel_x_buff_re, sobel_y_buff_re, fout_buff,
                num_items1d, intensity::sobel);
}
Convert data to uchar

The existing buffers can be used.

{
  // begin uchar to cl::sycl::cl_float conversion
  framework::kernels::cve_convert_type<cl::sycl::cl_float,
        cl::sycl::cl_uchar>(queue, option, fout_buff, uout_buff, num_items1d, converters::floatToByte);
}

The next section will provide a more thorough explanation of the actual kernels being used, but first let's focus on the difference between the two versions of the device side code.

In this code each consecutive kernel in the execution chain accepts buffer data different to the previous one and this is why we are using buffer::reinterpret in the optimized version. It is very helpful, not just because it brings convenience and saves extra typing on both the host side and device side, but also because in this situation it is the only way to keep our buffers alive instead of destroying them and creating new ones that read new memory from the host. This increases both the memory and run-time complexity of the application. Thanks to the buffer::reinterpret function, it is easy to overcome that issue.

In general when developing in any heterogeneous environment, memory operatoins between the host and device are expensive and can affect performance significantly. Always think about this when developing your software.

A closer look inside the kernels

Before diving into the source code, for context here's a little information on how the memory model of OpenCL/SYCL is mapped to the CVEngine processor.

  • OpenCL/SYCL private memory is mapped to the CVEngine local working memory.
  • OpenCL/SYCL local memory is mapped to the CVEngine scratchpad memory.
  • OpenCL/SYCL constant memory is mapped to the CVEngine DRAM.
  • OpenCL/SYCL global memory is mapped to the CVEngine DRAM.

Each CVEngine thread has access to a 32 byte zero-level cache. When an access to a location in a global memory is requested, the 32 bytes of data containing the data and its surrounding area will be loaded to the zero-level cache. This is private to each OpenCL work-item. The cache line size for the first-level cache in the R-Car V3M CVengine is 128 bytes. Each CVEngine thread has access to a 32 byte zero-level cache. Zero-level cache is given to each CVEngine thread. In general, it is recommended to allocate the entire cache line data to one thread in order to achieve the maximum cache hit ratio, however, this is not a solid rule for all cases. Several factors can affect the performance including the data cache size, work group size, and data load per threads amongst other things.

CVEngine threads are executed independently from each other, instead of the typical GPU execution model where all threads execute in lockstep. This can cause a significant performance impact, so in order to achieve the best performance it is recommended to reserve access for each thread to an entire cache line.

Optimized OpenCL memory size constraints

template <typename index_t>
struct opencl_configuration_t {
 // represents the number of items per row
 // best case 32
 static constexpr int cache_line = 32;
 // best case : 1024 cols
 static constexpr int col_per_thread = 1024;
 // best case : (rows / global_work_size) 32
 static constexpr int row_per_tread = 32;
 // reducing factor for total work-item per work-group
// best case 2
 static constexpr int work_group_reduction_factor = 2;
 static constexpr int row_per_work_item = 2;
};

Note it is advisable to pass int (signed) as a type instead of index_t (unsigned) to optimize for the best performance.

Convolution kernel

The source code below shows the implementation of a tiled-based convolution kernel using SYCL for the CVEngine processor.

The kernel is used to apply a mask filter on different areas of an image. It can be used for achieving effects such as blurring, sharpening or in our case edge detection. On the host API, the image (M, N) can be divided into (M/m, N/n) tiles when using applying the convolution to large images.

For the case of tile based algorithms, the use of coalesced memory access reduces the performance, therefore we aim to avoid this. The reason for that is because we have neighboring operations involved.

Performance gain methods and techniques used

for loop unrolling via #pragma_unroll is used whenever possible because loop iterations can cause a cycle stall when the branch condition is satisfied but it has to be used carefully as too much unrolling can exhaust the number of available resources per work item and significantly drop the performance.

mem_fence can be used to control the ordering of loads and stores for all the work-items in a work group executing the kernel.

Private memory is used to allocate the partial input/output data per work item. By using private memory to load the input data first, and then calculate the result, the data locality is retained and so the performance is improved.

// tile based convolution kernel for best performance on CVEngine
template <typename data_t, typename index_t>
class CVEConvolutionKernel {
private:
 const read_accessor_t<data_t, 2> _fil_ptr;
 const read_accessor_t<data_t, 2> _in_ptr;
 write_accessor_t<data_t, 2> _out_ptr;
 const matrix_size_t<index_t> _total_size;
 const matrix_size_t<index_t> _mat_size;
 const matrix_size_t<index_t> _data_offset;
 const matrix_size_t<index_t> _num_group;

public:
 // constructing the functor
 CVEConvolutionKernel(read_accessor_t<data_t, 2> fil_ptr,
                      read_accessor_t<data_t, 2> in_ptr,
                      write_accessor_t<data_t, 2> out_ptr,
                      matrix_size_t<index_t> total_size,
                      matrix_size_t<index_t> mat_size,
                      matrix_size_t<index_t> data_offset,
                      matrix_size_t<index_t> num_group)
     : _fil_ptr(fil_ptr),
       _in_ptr(in_ptr),
       _out_ptr(out_ptr),
       _total_size(total_size),
       _mat_size(mat_size),
       _data_offset(data_offset),
       _num_group(num_group) {}
 void inline operator()(cl::sycl::nd_item<1> item_id) {
   using opencl_config = opencl_configuration_t<index_t>;

   const index_t group_m = item_id.get_group(0) / _num_group.n;
   const index_t group_n = item_id.get_group(0) - group_m * _num_group.n;
   const index_t work_item =
       (group_m * (item_id.get_local_range()[0])) + item_id.get_local_id(0);
   const index_t total_threads_m =
       (item_id.get_local_range()[0]) * _num_group.m;
   constexpr auto row_per_work_item =
       opencl_config::row_per_work_item;
   constexpr auto col_per_work_item = opencl_config::cache_line - 2;
   constexpr auto fil_size_m = 3;
   constexpr auto fil_size_n = 3;
   // the LWM tile of 2 * 6 for output that reads 2*8 inout
   data_t private_result[row_per_work_item][col_per_work_item];
   data_t private_in[row_per_work_item + fil_size_m - 1]
                    [opencl_config::cache_line];
   // this is used to keep the filter in LWM to prevent the input zero level
   // cache to be flushed before being used by all threads
   data_t filter[fil_size_m][fil_size_n];
   // set private result to zero
#pragma unroll
   for (index_t p_m = 0; p_m < row_per_work_item; p_m++) {
#pragma unroll 6
     for (index_t p_n = 0; p_n < col_per_work_item; p_n++) {
       private_result[p_m][p_n] = data_t(0);
     }
   }
   // set filter to LWM to prevent the level zero cache to be modified
#pragma nounroll
   for (index_t p_m = 0; p_m < fil_size_m; p_m++) {
#pragma nounroll
     for (index_t p_n = 0; p_n < fil_size_n; p_n++) {
       filter[p_m][p_n] =
           _fil_ptr[p_m][p_n] / static_cast<data_t>(fil_size_m * fil_size_n);
     }
   }
   item_id.mem_fence(cl::sycl::access::fence_space::global_and_local);
   // the outer for-loop which dedicate 2 consecutive output rows per
   // thread. We dedicate two consecutive rows because we can only read 4
   // rows to calculate the 2 output rows.
#pragma nounroll
   for (index_t index_m = (work_item * row_per_work_item);
        index_m < _mat_size.m;
        index_m += total_threads_m * row_per_work_item) {
     const index_t row = index_m + _data_offset.m;
     const index_t index_n_offset = group_n * opencl_config::col_per_thread;
#pragma nounroll
     for (index_t index_n = index_n_offset;
          index_n < index_n_offset + opencl_config::col_per_thread;
          index_n += col_per_work_item) {
       index_t m, f_m;
       const index_t base_col = _data_offset.n + index_n;
#pragma nounroll
       for (f_m = 0, m = -(fil_size_m >> 1);
            f_m < row_per_work_item + fil_size_m - 1; m++, f_m++) {
         index_t in_id_m = (row + m >= 0) ? row + m : 0;
         in_id_m = (in_id_m < _total_size.m) ? in_id_m : _total_size.m - 1;
         index_t p_n, g_n;
#pragma unroll
         for (p_n = 0, g_n = -(fil_size_n >> 1);
              p_n < opencl_config::cache_line; p_n++, g_n++) {
           index_t in_id_n = (base_col + g_n >= 0) ? base_col + g_n : 0;
           in_id_n = (in_id_n < _total_size.n) ? in_id_n : _total_size.n - 1;
           private_in[f_m][p_n] = _in_ptr[in_id_m][in_id_n];
         }
       }
       item_id.mem_fence(cl::sycl::access::fence_space::global_and_local);
       // loop over the image column
#pragma unroll 6
       for (index_t private_n = 0; private_n < col_per_work_item;
            private_n++) {
         // loop over the filter column
#pragma unroll
         for (index_t f_n = 0; f_n < fil_size_n; f_n++) {
#pragma unroll
           for (index_t in_id_m = 0;
                in_id_m < fil_size_m + row_per_work_item - 1; in_id_m++) {
             // compute both rows output for the read input element. This if
             // statement is flattened at compile time as the for loop is
             // static
             auto input = private_in[in_id_m][private_n + f_n];
             if (row_per_work_item == 2) {
               if (in_id_m == 0) {
                 private_result[0][private_n] +=
                     (input * filter[in_id_m][f_n]);
               } else if (in_id_m != 0 &&
                          in_id_m != fil_size_m + row_per_work_item - 2) {
                 private_result[0][private_n] +=
                     (input * filter[in_id_m][f_n]);
                 private_result[1][private_n] +=
                     (input * filter[in_id_m - 1][f_n]);
               } else if (in_id_m == fil_size_m + row_per_work_item - 2) {
                 private_result[1][private_n] +=
                     (input * filter[in_id_m - 1][f_n]);
               }
             } else if (row_per_work_item == 1) {
               private_result[0][private_n] += (input * filter[in_id_m][f_n]);
             }
           }
         }
       }
       // flush the partial tile of LWM to the global output memory
#pragma unroll
       for (index_t p_m = 0; p_m < row_per_work_item; p_m++) {
#pragma unroll 6
         for (index_t p_n = 0; p_n < col_per_work_item; p_n++) {
           _out_ptr[row + p_m][base_col + p_n] = private_result[p_m][p_n];
           private_result[p_m][p_n] = 0;
         }
       }
     }
   }
 }
};

Type conversion kernel

Type conversion is a memory-bound kernel without any neighboring operations involved. The number of loads and stores per data item is increased as each work item works in a separate area. It's worth pointing out that CVEngine uses a multi-port implementation of first-level cache in order to mitigate a performance decrease. When using coalesced memory access the way it is used in the following kernel that mostly does memory load and store operations rather than heavy computation.

The same methods for performance gains can be applied to this code as well and in fact a naive implementation of this kernel resulted in slower execution than the following version which uses loop unrolling and private memory to load the data, do the work on it, and memory fences.

The code below shows the implementation of a type conversion kernel for SYCL on CVEngine. The kernel does type conversion or type casting changing the data1_t to data2_t by using a ratio of conversion that helps to convert mathematically correct values from or to single/double precision floating point.

// type convertion kernel (data1_t -> data2_t) kernel that achieves best
// performance on CVEngine
template <typename data1_t, typename data2_t, typename index_t>
class CVETypeConvertKernel final {
private:
 const read_accessor_t<data1_t, 1> _in_ptr;  // input global memory accessor
 discard_write_accessor_t<data2_t, 1>
   _out_ptr;  // output global memory accessor

 const float _ratio;

public:
 CVETypeConvertKernel(read_accessor_t<data1_t, 1> in_ptr,
                      discard_write_accessor_t<data2_t, 1> out_ptr,
                      float ratio)
     : _in_ptr(in_ptr), _out_ptr(out_ptr), _ratio(ratio) {}
void operator()(cl::sycl::nd_item<1> item_id) {
  using opencl_config = opencl_configuration_t<index_t>;

  // consecuitive_elements per work-item
  constexpr index_t elem_per_work_item = opencl_config::cache_line;
  const index_t id = item_id.get_global_id(0) * elem_per_work_item;
  // num. of elements of the buffer that '_out_ptr' accessor is accessing
  const index_t size = _out_ptr.get_count();
  data1_t private_in[elem_per_work_item];
  data2_t private_out[elem_per_work_item];

#pragma nounroll
  for (index_t i = id; i < size;
    i += (item_id.get_global_range()[0] * elem_per_work_item)) {
#pragma unroll
    for (index_t k = 0; k < elem_per_work_item;
      k += opencl_config::cache_line) {
       #pragma unroll
      for (index_t j = 0; j < opencl_config::cache_line; j++) {
        private_in[j] = _in_ptr[k + i + j];
      }
      item_id.mem_fence(cl::sycl::access::fence_space::global_and_local);
#pragma unroll
      for (index_t j = 0; j < opencl_config::cache_line; j++) {
         private_out[j] = static_cast<data2_t>(private_in[j] * _ratio);
       }
      item_id.mem_fence(cl::sycl::access::fence_space::global_and_local);
#pragma unroll
       for (index_t j = 0; j < opencl_config::cache_line; j++) {
         _out_ptr[k + i + j] = private_out[j];
       }
      item_id.mem_fence(cl::sycl::access::fence_space::global_and_local);
     }
   }
 }
};

Magnitude calculation kernel

Similar to the type conversion kernel above, the magnitude calculation kernel is memory bound and has the same or similar limitations. Since both the type conversion and magnitude calculation loads memory in a similar way, applying a simple computation and then storing the result in an output buffer means they both can benefit from the same performance gain optimization techniques.

The source shows the implementation of magnitude calculation kernel for SYCL tailored for the CVEngine processor. The kernel normalizes a vector and applies a custom user specified intensity value if the desired end result is either a brighter or darker image.

// type convertion kernel (data1_t -> data2_t) kernel using local memory
template <typename data_t, typename index_t>
class CVEMagnitudeKernel final {
private:
 const read_accessor_t<data_t, 1>
     _in_x_ptr;  // input (cols/x-axis) global memory accessor
 const read_accessor_t<data_t, 1>
     _in_y_ptr;  // input (rows/y-axis) global memory accessor
 discard_write_accessor_t<data_t, 1>
     _out_ptr;  // output global memory accessor

 const float _intensity;

private:
 const data_t normalize(data_t l_pixel, data_t r_pixel) {
   return cl::sycl::clamp(cl::sycl::sqrt(_intensity * l_pixel * l_pixel +
                                         _intensity * r_pixel * r_pixel),
                          0.0f, 1.0f);
 }

public:
 CVEMagnitudeKernel(read_accessor_t<data_t, 1> in_x_ptr,
                    read_accessor_t<data_t, 1> in_y_ptr,
                    discard_write_accessor_t<data_t, 1> out_ptr,
                    float intensity)
     : _in_x_ptr(in_x_ptr),
       _in_y_ptr(in_y_ptr),
       _out_ptr(out_ptr),
       _intensity(intensity) {}

 void operator()(cl::sycl::nd_item<1> item_id) {
   using opencl_config = opencl_configuration_t<index_t>;

   // consecuitive_elements per work-item
   constexpr index_t elem_per_work_item = opencl_config::cache_line;
   const index_t id = item_id.get_global_id(0) * elem_per_work_item;
   // num. of elements of the buffer that '_out_ptr' accessor is accessing
   const index_t size = _out_ptr.get_count();
   data_t private_in_x[elem_per_work_item];
   data_t private_in_y[elem_per_work_item];
   data_t private_out[elem_per_work_item];

#pragma nounroll
   for (index_t i = id; i < size;
     i += (item_id.get_global_range()[0] * elem_per_work_item)) {
#pragma unroll
     for (index_t k = 0; k < elem_per_work_item;
          k += opencl_config::cache_line) {
#pragma unroll
       for (index_t j = 0; j < opencl_config::cache_line; j++) {
         private_in_x[j] = _in_x_ptr[k + i + j];
       }
       item_id.mem_fence(cl::sycl::access::fence_space::global_and_local);
#pragma unroll
      for (index_t j = 0; j < opencl_config::cache_line; j++) {
        private_in_y[j] = _in_y_ptr[k + i + j];
      }
      item_id.mem_fence(cl::sycl::access::fence_space::global_and_local);
#pragma unroll
      for (index_t j = 0; j < opencl_config::cache_line; j++) {
        private_out[j] = normalize(private_in_x[j], private_in_y[j]);
      }
      item_id.mem_fence(cl::sycl::access::fence_space::global_and_local);
#pragma unroll
      for (index_t j = 0; j < opencl_config::cache_line; j++) {
        _out_ptr[k + i + j] = private_out[j];
      }
      item_id.mem_fence(cl::sycl::access::fence_space::global_and_local);
     }
   }
 }
};

We've shown how to write a simple Sobel algorithm with some additional methods of enhancing the final output image. By doing this we have been able to examine how to write code and kernels that will perform well on the R-Car CVengine processor using ComputeCpp and SYCL.

The Results

To summarise some of the advice contained in this guide, there are some recommendations we have discovered that should be followed in order to write the most optimal code for the CVengine processor.

  • Avoid memory movement and loading
  • Use integers over unsigned integers
  • Float division or multiplication must be done with another float value and implicit conversion from other types like int for example should be avoided because it affects the performance.
  • For example for float by int division
    • float result = array[3] / 10;
    • should be avoided and be rewritten as:
    • constexpr auto some_float_value = 10.0f;
    • float result = array[3] / some_float_value;
  • Use private memory where possible
  • Avoid using coalesced memory access
  • Use unrolling of for loops

If you want to learn more about optimizing performance for R-Car hardware take a look at our guides.

Sections

    Select a Product

    Please select a product

    ComputeCpp enables developers to integrate parallel computing into applications using SYCL and accelerate code on a wide range of OpenCL devices such as GPUs.

    ComputeSuite for R-Car enables developers to accelerate their applications on a wide range of Renesas R-Car based hardware such as the H3 and V3M, using widely supported open standards such as Khronos SYCL and OpenCL.

    Also,

    part of our network