info

Please note that you are viewing a guide targeting an older version of ComputeCpp™ Community Edition. This guide was designed for version 2.9.0.

Step by Step Profiler Guide

link

Profiling SYCL applications

Profiling is an important activity when optimizing any application, it can help to pinpoint where the most time is being spent and identify where improvements can be made that will have the biggest impact on performance. This article will provide guidance on how to profile SYCL applications using both ComputeCpp Community Edition and ComputeCpp Professional Edition.

ComputeCpp Professional Edition includes built-in profiling support that automates the whole process by using hardware counters, and provides readable profiling data for you as a developer. The ComputeCpp run-time takes the responsibility of injecting SYCL events in the source code for each SYCL related function call and writes the profiling data gathered from the hardware counters in a JSON format which can be viewed in a nice web GUI inside a chromium based browser's built-in tracing tool. This includes the ability to view the state of each buffer object as well as the different stages of SYCL queue from initialization to the end of kernel execution.

By the end of this article you will be equipped with the skills to optimize your own SYCL code and increase the performance of your applications.

link

ComputeCpp Community Edition Profiling: Manual profiling using SYCL events

The performance of SYCL code can be profiled via event objects that can synchronize API calls on the device and provide time points for each of them on queue submission. This is possible because the SYCL events contain OpenCL event objects that can be used to obtain accurate profiling information to measure the execution time of a command using the hardware counter for the device.

In SYCL we can return event objects from the submit method of a queue which makes it easy to get all information from submission to end of execution. However, in order to get the profiling data we need to enable profiling with events when initializing the queue by adding sycl::property::queue::enable_profiling() as a property_list argument. This will enable profiling for memory and kernel objects.

The code below demonstrates how to manually profile a simple SYCL program. We are using the simple-vector-addition.cpp sample from the computecpp-sdk.

void simple_vadd(const std::array<T, N>& VA, 
                 const std::array<T, N>& VB, 
                 std::array<T, N>& VC) {
  // Choose device to run the kernels on
  cl::sycl::default_selector deviceSelector{};

  // Initialize property list with profiling information
  cl::sycl::property_list propList{cl::sycl::property::queue::enable_profiling()};

  // Build the command queue (constructed to handle event profling)
  cl::sycl::queue deviceQueue(deviceSelector, propList);

  // set up profiling data containers
  using wall_clock_t = std::chrono::high_resolution_clock;
  using time_point_t = std::chrono::time_point<wall_clock_t>;
  std::vector<cl::sycl::event> eventList(profiling_iters);
  std::vector<time_point_t> startTimeList(profiling_iters);

  // Submit a kernel to the queue, returns a SYCL event
  for (size_t i = 0; i < profiling_iters; ++i) {
    startTimeList.at(i) = wall_clock_t::now();
    eventList.at(i) = deviceQueue.submit([&](cl::sycl::handler& cgh) {
      auto accessorA = bufferA.template get_access<sycl_read>(cgh);
      auto accessorB = bufferB.template get_access<sycl_read>(cgh);
      auto accessorC = bufferC.template get_access<sycl_write>(cgh);

      auto kern = [=](sycl::handler& cgh) {
        accessorC[wiID] = accessorA[wiID] + accessorB[wiID];
      };

      cgh.parallel_for<class SimpleVadd<T>>(numOfItems, kern);
    });
  }

  /*use the event data to compute submission and execution times*/
}

When using events to profile the time taken for a command to execute on the device, we have to consider the following four states that are valid for that command:

QUEUED
SUBMITTED
RUNNING
COMPLETE

We can get a time point for each state to calculate the submission and kernel execution time, as well as the overhead and total elapsed time (real / wall_clock_time).

Creating a generic wrapper for the profiling features using events will require parsing and exporting the information gathered from the underlying cl_events and some sort of graphical user interface front-end that interprets the data and displays it in a meaningful way to the end user for analyses.

The wrapper class should be able to automatically handle the profiling results computation.

For example:

[...] // after the command group submission scope

sycl_profiler profiler(eventList, startList);
cout << "Kernel exection:\t" << profiler.get_kernel_execution_time() << endl;

The complete implementation of such wrapper, referenced at /*use the event data to compute submission and execution times*/, is not going to be fully discussed here, but a simple example can be seen in the following GitHub gist.

This is a very good example to get you started in building your SYCL profiling wrapper class or utility, however, ComputeCpp addresses the need for a more complete profiler that is both powerful and requires no code modifications through features in the Professional Edition that are explained below. It is also important to note that this profiling strategy only works when running on an OpenCL devices, whereas if running on the host device, the ComputeCpp implementation cannot provide us with any profiling information from the SYCL events (use standard host timer for that purpose instead).

link

ComputeCpp Professional Edition: Automatic Profiling

ComputeCpp PE provides an automatic profiling feature that contains information about the run-time and the underlying OpenCL implementation. This means it will not only enable the profiling property in the queue and inject events for us, but also give us a detailed overview of the entire SYCL program configuration. This information is conveniently provided in a single JSON file that can be viewed in the Chromium tracer (chrome://tracing), allowing us to see a graphical representation of the profiling data.

link

Setup

The ComputeCpp PE profiler is configurable, meaning that you have to define a configuration file for the run-time to know it needs to provide profiling data. This also enables some flexibility as to what event information you can capture.

Set the option enable_profile = true in the configuration file to enable profiling in your application. For a full list of options to configure the profiler see the Profiling Manual.

In order to enable the JSON profiler, we need to set up the following environment variable - COMPUTECPP_CONFIGURATION_FILE and point this to the configuration file that was created. Inside the configuration file we can define the profiler options described in the table above.

After the profiler behavior is set up, just run your application and the JSON file that contains the profiling data will be generated at the end of the program execution.

By default, when the application finishes, the run-time will write the JSON file in the current working directory. This is usually in the same directory as the binary of the application in the following format:

[executable_name]_[current_date].json

You can also re-write the default output file via the COMPUTECPP_PROFILING_OUTPUT environment variable (the file doesn't need to exist but must have read-write permissions).

link

Profiling Information

Based on the selected option for profiling, the chrome tracing view will display information about the memory objects (buffers) and the queue activity.

All times are measured in nanoseconds and converted to microseconds for the JSON, this is then shown in the Chrome interface in milliseconds, and Start Time is also provided alongside the Wall Time.

Below is a table that shows the meaning of the different categories in the Buffers and Queue views.

Buffers View	Queue View
Count of the elements	CREATED: Initialization of the kernel function object (includes setting the buffer arguments).
Size in bytes for the type of object stored in the buffer	RUNNING: Transaction that indicates the kernel execution on the device. Memory transfer are included here as well under the Requisites section.
Range of the buffer (1, 2, or 3 dimensions)	COMMAND: Shows the type of SYCL command (e.g., "kernel enqueue" - Enqueues a command to execute a kernel on a device)
Size in bytes for the entire buffer	COMMAND: Shows the type of SYCL command (e.g., "kernel enqueue" - Enqueues a command to execute a kernel on a device)
	kernels: Snapshot of the kernels that are executed by the command group handler

There are several more states represented by the Queue view that describe all of the states a transaction can go through in order to be resolved.

Here is a screenshot of the complete transactions state machine including brief overview of each state:

Now, let's re-visit the vector addition sample again. In this experiment we will try to optimize it by using the full mode of the automated JSON profiler to see the performance impact of our changes.

First we have to set up the profiler:

# in the sdk root
cd build && mkdir config
touch config/sycl_config.txt && echo "enable_profiling = true" >> config/sycl_config.txt
export COMPUTECPP_CONFIGURATION_FILE=config/sycl_config.txt

and set desired output files for the original and optimized versions of the program.

# in the sdk root
cd build && mkdir profiling
touch profiling/vadd_orig.json
export COMPUTECPP_PROFILING_OUTPUT=profiling/vadd_orig.json

Now we only have to compile and run the SYCL vector addition program and the ComputeCpp run-time will generate the file with all the profiling data.

Let's have a look at the visual output by starting with buffer information first:

These are all buffer objects for the six vectors we use for the vadd operation - 3 int and 3 float, where 2 of each are input vectors and 1 is the output.

Clicking on the green circle for Buffer 1 allows us to inspect the object that includes information on the time it was created and the arguments that define it, which were explained in the table for Buffer View.

Here is the selection output:

args:   {Count: "8",
         Element Size: "4",
         Range: "(8)",
         Size in Bytes: "32"}

It can be interpreted as follows:

The buffer object holds an array of eight elements
Each element of the array is 4 bytes
The buffer range is with 8 elements in the first dimension
The total size of the buffer object is 32 bytes

Additionally, if we were to use the two and three dimensional buffer counterparts, they will be represented almost identically with with the only difference being the range of the elements in the corresponding dimension(s).

2 Dimensional buffer:

args:   {Count: "8",
         Element Size: "4",
         Range: "(2, 4)",
         Size in Bytes: "32"}

3 Dimensional buffer:

args:   {Count: "8",
         Element Size: "4",
         Range: "(2, 2, 2)",
         Size in Bytes: "32"}

Next up is the "Queue" view where we can inspect the performance of the kernel-related commands.

We can also see the access mode and the address space for the buffer (device) data accessors. Clicking on the blue circle for Buffer 1 gives:

args:   {Mode: "Read",
         Space: "Global"}

In order to see how long it takes to enqueue the kernel, we click on the COMMAND slice in the Req. & Cmds. category, which is shortened to C there. You can zoom-in on the tracer window if you wish to focus on a particular item in the trace.

This says that the enqueue took 1.879 ms. Here are the more important details from the selection:

Title             COMMAND
Category          command
Start               52.766 ms
Wall Duration       1.879 ms
Args:
    Type          "Enqueue Kernel"

The overall device run-time (submission + kernel execution) can be viewed in the "RUNNING" category by clicking on the purple R there. This shows Wall Duration of 2.041 ms.

Again, here are the important bits of the selection output:

Title             RUNNING
Category          transaction
Start               52.729 ms
Wall Duration       2.041 ms
Args:
    Buffer 0      Snapshot of Buffer 0 object @ 52.729 ms
    Buffer 1      Snapshot of Buffer 1 object @ 52.756 ms
    Buffer 2      Snapshot of Buffer 2 object @ 52.761 ms
    kernel        "SYCL_class_SimpleVadd_int_"

The RUNNING transaction starts at 52.729 ms while the enqueue kernel command we discussed above was fired at 52.766 ms. This is because of the access acquisition to the data in buffers 0, 1, and 2.

And now let's look at the actual kernel execution time by clicking on the kernel slice in the kernels category. In the image below you can see that the slice is really small but you can select it by using the free selection tool of the tracer.

As you can see, you can track every start and duration of a command on the device. The same can be done for Queue (pid 1) which is the second reference of the queue with the SimpleVadd kernel for vectors of type float.

If you are looking for more summarized information of the execution rather than inspecting the timeline, you can make use of another great feature - the "Metadata for Trace" interface. You can click the M button in the top right of the tracer UI and a window that contains information about the Application, the Device it was ran on, the Queue instances and their properties, and Kernel Execution Statistics, will pop out. Here is an example from the same application that we profiled until now.

Having profiled the current version of the vector addition kernel, let's attempt an optimization to the vector addition kernel.

First, we need to change the profiling output destination:

# in the sdk build/
export COMPUTECPP_PROFILING_OUTPUT=profiling/vadd_optimized.json

There isn't much that can be done for such a simple kernel, but we can try to do memory coalescing for global memory and run the kernel on a single work-group.

Here is how the kernel code looks like after the modification:

auto kern = [=](cl::sycl::nd_item<1> wi) {
  size_t wiID = wi.get_global_id(0);
  size_t groupSize = wi.get_global_range(0);
  size_t elementsCount = N;
  for (auto i = wiID; i < elementsCount; i += groupSize) {
    accessorC[wiID] = accessorA[wiID] + accessorB[wiID];
  }
};

Briefly, what hopefully happens with this modification is that as each work-item reads its next elements, the reads are combined by the hardware so that we are able to get 64 bytes for each read. This is very specific for the device used to execute this kernel which in this case is an Intel integrated GPU with 64 bytes Global Memory cache line size. This should work equally well for both int and float type elements that are used to test the program.

Now let's have a look at the profiling output in the chrome tracer.

The RUNNING transaction starts at 1.832 ms as opposed to the 2.041 ms with the non-optimized version.

As for the kernel execution time, which was 0.009 ms for both the int and float type kernel instances it is now 0.008 ms.

What is more interesting, however, is that the memory bandwidth is 1mb higher as well - 21 mb in the optimized version versus 20 mb in the original version for int types.

The purpose of this demonstration was not to necessarily optimize the vector addition sample but to demonstrate how handy it is to use the visual representation of the profiling data through the ComputeCpp Professional Edition built-in JSON profiler. It can also benefit the process of optimization when working on SYCL code enabling developers to examine every bit of performance information.

We are adding new features to the profiler, for example, there will be a merging python tool that can merge two profiling output JSON files into a single one to better analyze differences in performance and the impact of varying approaches.

link

Tracy Profiler

Tracy is a real time, nanosecond resolution, remote telemetry, hybrid frame and sampling profiler for games and other applications. Tracy is also open-source.

Tracy support is still experimental so it is disabled by default. To enable it, use the configuration option enable_tracy_profiling.

Tracy will spawn threads that are responsible for processing and sending the information to the Tracy Server standalone application, which means leaving it disabled will prevent extra threads from being spawned by your application, thus saving resources.

Launch the Tracy server application that can be found in the bin folder of the ComputeCpp package. When opened you can select the host where the application is running or leave it to start listening for connections in the local machine.

After clicking in the Connect button, the profiler will wait for incoming connections:

Now, just run your ComputeCpp application. If you are on Windows, please see the section Known Limitations.

When an application is connected, profiling data will be streamed in real-time and you can follow the data being captured.

Tracy provides a rich interface with several interesting features. In the following image we highlight the 4 main regions of Tracy and what data they will display.

This is the region of device side events. Here you can see when a kernel was executed and how long it took to run.
This is the main thread of the application. Events shown here were produced by calls to ComputeCpp from the main thread of your application.
Tracy can display data from multiple threads. The thread highlighted in this region is actually the ComputeCpp Scheduler Thread. Here you can trace back events to things you did in your application, so hotspots can be identified and fixed.
This is the region where we display performance counters. By default, Tracy will capture the CPU Usage through the lifetime of the application, but ComputeCpp supports capturing performance counters from the selected performance counter backend. If this option is enabled and the device where the SYCL kernels are running exposes these counters, they will be displayed here.

Please refer to the Tracy Manual for a complete list of all features Tracy has to offer.

When running the application with elevated privileges you get even more information, like a visual display of context switches, i.e., in which CPU core a thread is running, thread names and system wide callstacks:

link

Known Limitations

Tracy uses a custom communication protocol between the client (your application) and the server. The protocol version is required to match in order for a connection to be established, so make sure you are using the Tracy server application that is shipped with ComputeCpp.
On Windows, for short lived applications, profiling information won't be captured due to DLL shutdown behaviour in the OS. Basically, Windows will terminate threads after the main function returns, leaving no time for the data to be sent across to the Tracy Server. To overcome this limitation, make sure the main function doesn't return before connecting to the Tracy Server.

ComputeCpp™ Menu

Main Menu

Products

menu_bookGuides

Step by Step Profiler Guide

Profiling SYCL applications

ComputeCpp Community Edition Profiling: Manual profiling using SYCL events

ComputeCpp Professional Edition: Automatic Profiling

Setup

Profiling Information

Tracy Profiler

Known Limitations

ComputeCpp Profiler

Using the Tracy Profiler with ComputeCpp

assignmentJump to Section

Select a Product

oneAPI

Dark Mode

Light Mode

Also,

part of our network

Codeplay.com

SYCL.tech

Codeplay Developer

Codeplay Open Source

menu_bookGuides

﻿Profiling SYCL applications

ComputeCpp Community Edition Profiling: Manual profiling using SYCL events

ComputeCpp Professional Edition: Automatic Profiling

Setup

Profiling Information

Tracy Profiler

Known Limitations

ComputeCpp Profiler

Using the Tracy Profiler with ComputeCpp

assignmentJump to Section

Select a Product

oneAPI

Dark Mode

Light Mode

Codeplay.com

SYCL.tech

Codeplay Developer

Codeplay Open Source

Profiling SYCL applications