Version Latest

Examples

This chapter represents a step-by-step OpenCL example for adding two vectors together. The purpose of this example is to cover different aspects of the OpenCL programming model. Our main focus in this example is to explain how to write an OpenCL program and evenly distribute the workload among the different threads of CVengine.

Preparing platform, devices and context

In order to run a kernel in OpenCL, we need to first set up the device we want to run the code on.

There are two types of OpenCL platform for both R-Car V3M and R-Car V3H. The table below describes the differences between the two packages at the time of writing.

Edition R-Car ComputeSuite community edition R-Car ComputeSuite professional edition
Devices available CVengine ARM Host CPU and CVengine
Additional Features N/A Offline compilation, Others TBA
Support level Community based Professional support included (TBA)
Tooling N/A Integrated with ComputeSuite tools (TBA)

Contact Codeplay (sales@codeplay.com) for details of the Professional Edition.

Note that, at the time of writing this document, the CVengine device is not OpenCL 1.2 conformant.

A user can query the list of available devices by calling clGetDeviceIDs on an OpenCL platform. For the devices of a platform, a user must create an OpenCL context. In order to choose a specific device, a user can use the clGetDeviceIDs function. The cl_device_id for Host CPU can be retrieved by passing CL_DEVICE_TYPE_CPU to the clGetDeviceIDs function and the CVengine device can be retrieved by passing CL_DEVICE_TYPE_ACCELERATOR to the clGetDeviceIDs function. To retrieve both devices a user must pass the CL_DEVICE_TYPE_ALL to the clGetDeviceIDs function. The following code snippet from the OpenCL example code for vector addition represents the creation of platform, device and context on CVengine.

// retrieving the CVengine device
  cl_device_id cvengine_device;  // the CVengine device
  error_report = clGetDeviceIDs(rcar_platform, CL_DEVICE_TYPE_ACCELERATOR, 1,
                                &cvengine_device, NULL);
  // creating the context for RCar OpenCL devices
  cl_context rcar_context =
      clCreateContext(0, 1, &cvengine_device, NULL, NULL, &error_report);

Command queue object

In the following code, ComputeAorta Community Edition for R-Car is used and the selected device for executing the example is the CVengine. The next step is to create a command queue for the CVengine device. The following code snippet from the OpenCL example code for vector addition illustrates how to construct a command queue for the CVengine.

  // creating a command queue for CVengine accelerator.
  cl_command_queue cvengine_command_queue = clCreateCommandQueue(
      rcar_context, cvengine_device, CL_QUEUE_PROFILING_ENABLE, &error_report);  

OpenCL C kernel

The OpenCL kernels are written in C language, and they are passed as strings to the OpenCL host API.
The following code, extracted from the OpenCL example code, shows an OpenCL C kernel that sums the two input buffers A and B and puts the result in the output buffer C.

 const char *vector_add =
      "__kernel void vector_add_kernel(                                      \n"
      "   __global const " OPENCL_KERNEL_DATA_TYPE
      " * restrict A,                                                        \n"
      "   __global const " OPENCL_KERNEL_DATA_TYPE
      " * restrict B,                                                        \n"
      "   __global " OPENCL_KERNEL_DATA_TYPE
      " * restrict C,                                                        \n"
      "   " OPENCL_KERNEL_INDEX_TYPE
      " array_size   )                                                       \n"
      "{                                                                     \n"
      "    " OPENCL_KERNEL_DATA_TYPE " private_A[" OPENCL_KERNEL_CHACHE_LINE
      "];                                                                    \n"
      "    " OPENCL_KERNEL_DATA_TYPE " private_B[" OPENCL_KERNEL_CHACHE_LINE
      "];                                                                    \n"
      "    " OPENCL_KERNEL_DATA_TYPE " private_res[" OPENCL_KERNEL_CHACHE_LINE
      "];                                                                    \n"
      "   " OPENCL_KERNEL_INDEX_TYPE
      "   id = get_global_id(0) * " OPENCL_KERNEL_CHACHE_LINE
      ";                                                                     \n"
      "   for (" OPENCL_KERNEL_INDEX_TYPE
      "         i = id; i < array_size; i +=" OPENCL_KERNEL_CHACHE_LINE
      "   * get_global_size(0) ) {                                           \n"
      "   #pragma unroll                                                     \n"
      "     for (" OPENCL_KERNEL_INDEX_TYPE
      "          j = 0; j <" OPENCL_KERNEL_CHACHE_LINE
      "; j++) {                                                              \n"
      "          private_A[j] = A[i + j];                                    \n"
      "        }                                                             \n"
      "   #pragma unroll                                                     \n"
      "     for (" OPENCL_KERNEL_INDEX_TYPE
      "          j = 0; j <" OPENCL_KERNEL_CHACHE_LINE
      "; j++) {                                                              \n"
      "          private_B[j] = B[i + j];                                    \n"
      "        }                                                             \n"
      "   #pragma unroll                                                     \n"
      "     for (" OPENCL_KERNEL_INDEX_TYPE
      "          j = 0; j <" OPENCL_KERNEL_CHACHE_LINE
      "; j++) {                                                              \n"
      "       private_res[j] = private_A[j] + private_B[j];                  \n"
      "     }                                                                \n"
      "   #pragma unroll                                                     \n"
      "     for (" OPENCL_KERNEL_INDEX_TYPE
      "          j = 0; j < " OPENCL_KERNEL_CHACHE_LINE
      "      ; j++) {                                                        \n"
      "       C[i + j] = private_res[j];                                     \n"
      "        }                                                             \n"
      "  }                                                                   \n"
      "}                                                                     \n"
      "                                                                     \n";
  IndexType kernel_size = strlen(vector_add);     

Developers can also import C Kernels in source by reading a text file using standard IO operations.

Program object

In order to use the OpenCL kernels encapsulated in a string variable, a user needs to create a program object inside a pre-existing context. Once the program object has been created, the program can then be built for all the devices within the context. The following code snippet from the OpenCL example code for vector addition creates and builds an OpenCL program object.

    // constructing the OpenCL program for the vector add function
  cl_program vec_add_program = clCreateProgramWithSource(
      rcar_context, 1, (const char **)&vector_add, NULL, &error_report);
  // building the OpenCL program for all the objects
  error_report =
      clBuildProgram(vec_add_program, 1, &cvengine_device, NULL, NULL, NULL);

Buffer object

In order to achieve the best performance for CVengine, a user should use CL_MEM_ALLOC_HOST_PTR and map/unmap the memory to/from the host. Using this option, it is possible to access the buffer on both host and device by using the zero-copy technique. The A and B buffers are CL_MEM_READ_ONLY, and C is CL_MEM_WRITE_ONLY. The following code snippet from the OpenCL example code for vector addition creates the OpenCL buffer objects.

  cl_mem A =
      clCreateBuffer(rcar_context, CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR,
                     array_size * sizeof(DataType), NULL, &error_report);

  cl_mem B =
      clCreateBuffer(rcar_context, CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR,
                     array_size * sizeof(DataType), NULL, &error_report);
  cl_mem C =
      clCreateBuffer(rcar_context, CL_MEM_WRITE_ONLY | CL_MEM_ALLOC_HOST_PTR,
                     array_size * sizeof(DataType), NULL, &error_report); 

Kernel object

The built program object contains all the required kernels for the OpenCL application. In order to choose a particular kernel for execution, a user needs to create an OpenCL kernel object. Note that, an OpenCL program object may contain more than one kernel. Therefore, OpenCL kernel objects are handlers that enable access to specific instances of device kernels that a user wants to run on a particular device. Constructing the kernel object is represented in the following code snippet from the OpenCL example code for vector addition.

  cl_kernel kernel_vector_add =
      clCreateKernel(vec_add_program, "vector_add_kernel", &error_report);

Setting up kernel arguments

The next step is to map OpenCL buffers and other variables to the kernel parameters using clSetKernelArg. The parameters are the input/output arguments consumed by the OpenCL C kernel. Here, we determine the way that a buffer/variable maps to an input parameter of an OpenCL C kernel. Setting up the kernel arguments is demonstrated in the following code snippet from the OpenCL example code for vector addition.

  error_report = clSetKernelArg(kernel_vector_add, 0, sizeof(cl_mem), &A);

  error_report = clSetKernelArg(kernel_vector_add, 1, sizeof(cl_mem), &B);

  error_report = clSetKernelArg(kernel_vector_add, 2, sizeof(cl_mem), &C);

  error_report =
      clSetKernelArg(kernel_vector_add, 3, sizeof(cl_uint), &array_size);       

Executing kernel

For executing a kernel, first a user needs to calculate the global and local work size. Once calculated, a user can launch the kernel on the CVengine.

Calculating local and global work size

The maximum value of local_work_size must not exceed CL_DEVICE_MAX_WORK_GROUP_SIZE, as described in the OpenCL specification. In CVEngine, the maximum number of threads per OpenCL work group is 32.

The number can be obtained using the OpenCL API, as shown in the following code snippet:

   // getting max workgroup size
  IndexType local_size;
  error_report = clGetDeviceInfo(cvengine_device, CL_DEVICE_MAX_WORK_GROUP_SIZE,
                                 sizeof(IndexType), &local_size, NULL);       

In order to efficiently distribute the workload among the processing threads of the CVengine, it is necessary to calculate the number of global work-items, which we refer to as global_work_size. Note that, the global_work_size must be divisible by the local_work_size, in order to accommodate the same number of work items in each work group.

In order to fully utilize all processing threads in the CVengine, the global_work_size must be at least local_work_size * total_number_of_compute_unit. The total_number_of_compute_unit in the CVengine represents the total number of CVengine clusters. For the R-Car V3M CVengine, the total_number_of_compute_unit is 2 and for R-Car V3H CVengine, the total_number_of_compute_unit is 5.

The following code snippet shows how to obtain the number of total_threads.

  cl_uint compute_unit;
  error_report = clGetDeviceInfo(cvengine_device, CL_DEVICE_MAX_COMPUTE_UNITS,
                                 sizeof(cl_uint), &compute_unit, NULL);
  // total number of threads for executing the kernel
  IndexType global_size = compute_unit * local_size;

Launching an OpenCL kernel

The function clEnqueueNDRangeKernel is used to dispatch the OpenCL kernel on the CVengine. The following code snippet from the OpenCL vector addition code represents the
kernel dispatching.

    // dispatching the kernel
    error_report = clEnqueueNDRangeKernel(
        cvengine_command_queue, kernel_vector_add, 1, NULL, &global_size,
        &local_size, 0, NULL, &event);

Data transfer

In OpenCL, the data transfer between host and device is explicit. In CVengine, in order to achieve the best performance, a user should use CL_MEM_ALLOC_HOST_PTR and map/unmap the memory to/from host. Once mapped, the memory is pinned on the host and the device can directly access this region of memory. Once the host operation on the mapped buffer is performed, the host pointer returned by the clEnqueueMapBuffer must be unmapped. This will return the control of the buffer back to the device. A mapped memory can be unmapped by using ClEnqueueUnmapMemObject function. No copy between host and device is performed, which is called zero-copy data transfer.

For the CVengine, leveraging zero-copy functionality is important, as the input data of applications running on the CVengine will reside in DDR SDRAM. This DDR SDRAM is shared among the CVengine and the ARM CPU. Using zero-copy eliminates any unnecessary copies from one part of DDR SDRAM to another. The following code snipped represents the enqueueing of a map buffer and the copying of the input data to this region for kernel usage.

  // Enqueueing mapbuffer to put the input data A on the map region between
  // host and device
  DataType *hptr_a = (DataType *)clEnqueueMapBuffer(
      cvengine_command_queue, A, CL_TRUE, CL_MAP_WRITE_INVALIDATE_REGION, 0,
      sizeof(DataType) * array_size, 0, 0, NULL, &error_report);

  // Enqueueing mapbuffer to put the input data B on the map region between host
  // and device
  DataType *hptr_b = (DataType *)clEnqueueMapBuffer(
      cvengine_command_queue, B, CL_TRUE, CL_MAP_WRITE_INVALIDATE_REGION, 0,
      sizeof(DataType) * array_size, 0, 0, NULL, &error_report);
  // Copy from host memory to pinned host memory which is used by the CVengine
  // automatically`

  for (IndexType i = 0; i < array_size; i++) {
    hptr_a[i] = (DataType)1;
    hptr_b[i] = (DataType)1;
  }
  // Unmapping the pointer, this will return the control to the device
  error_report =
      clEnqueueUnmapMemObject(cvengine_command_queue, A, hptr_a, 0, NULL, NULL);
  // Unmapping the pointer, this will return the control to the device
  error_report =
      clEnqueueUnmapMemObject(cvengine_command_queue, B, hptr_b, 0, NULL, NULL);

Similarly, we can use clEnqueueMapBuffer to read the result from device without any additional copying of the data back to the host.

The following code snippet illustrates how to enqueue a map operation for the C buffer that can be directly accessed on host.

  DataType *host_C = (DataType *)clEnqueueMapBuffer(
      cvengine_command_queue, C, CL_TRUE, CL_MAP_WRITE, 0,
      sizeof(DataType) * array_size, 0, 0, NULL, &error_report);    

Releasing objects

Once the execution is finished and the data returned to the host we will release the OpenCL objects created for the execution. The following code snippet from the OpenCL example code for vector addition demonstrates releasing OpenCL objects.

  error_report = clReleaseMemObject(A);
  error_report = clReleaseMemObject(B);
  error_report = clReleaseMemObject(C);
  error_report = clReleaseKernel(kernel_vector_add);
  error_report = clReleaseProgram(vec_add_program);
  error_report = clReleaseCommandQueue(cvengine_command_queue);
  error_report = clReleaseContext(rcar_context);   

OpenCL vector addition

The OpenCL C implementation of a vector addition that runs on both R-Car V3M CVengine and R-Car V3H CVengine is available in our GitHub repository.

Instructions for building and executing the vector addition on R-Car V3M/V3H CVengine

Prerequisite * Source file: vector_add.c * libOpenCL.so is located at /usr/local/lib

Build binary

  • gcc -O3 vector_add.c -L/usr/local/lib -lOpenCL -o vec_add

Run

  • ./vec_add
Sections

    Select a Product

    Please select a product

    ComputeCpp enables developers to integrate parallel computing into applications using SYCL and accelerate code on a wide range of OpenCL devices such as GPUs.

    ComputeSuite for R-Car enables developers to accelerate their applications on a wide range of Renesas R-Car based hardware such as the H3 and V3M, using widely supported open standards such as Khronos SYCL and OpenCL.

    Also,

    part of our network