Using the Tracy Profiler with ComputeCpp
When developing performance-sensitive applications, it is important to understand where are the critical parts of the code that can affect the performance. Good profiling support is paramount for any application aiming to be more efficient at solving a problem in constrained environments. Efficiency is context-dependant. It could mean lowering the power consumption of a battery in an embedded device or getting peak performance from the hardware in a supercomputer.
In the context of SYCL applications, there are a lot of things that can affect performance. How well is the application written? How well does the compiler understand your code? Am I using the right compiler flags? Could I be doing more work in parallel? Why is this kernel taking so much time to execute? What is my application doing while this kernel is running?
These are some of the questions that you might want to answer when developing your SYCL application. To help you answer these questions, we are adding native support for the Tracy Profiler to ComputeCpp Professional Edition.
This is a screenshot of Tracy showing details of a profiling session for the NBody demo available in the ComputeCpp SDK.
Tracy Profiler
Tracy is a real-time, nanosecond resolution, remote telemetry, hybrid frame and sampling profiler for games and other applications. It is an open-source profiler that supports CPU (C, C++, Lua), GPU (OpenGL, Vulkan, OpenCL, Direct3D 12), memory locks, context-switches and more.
By adding native support for the Tracy profiler in ComputeCpp, you can connect Tracy to your application by simply enabling a configuration option. When connected, your application will immediately start sending data to Tracy, forming a nanosecond resolution execution profile that can be analyzed, searched and inspected.
Tracy can handle large amounts of data and it only requires RAM to be available in the machine running the server. Being designed as a client-server application, Tracy can be used to analyze remote applications, making it suitable to be used with embedded devices and development boards.
Enabling Tracy Profiler in ComputeCpp
In scenarios where remote profiling is not an option, due to network restrictions or lack of connectivity, it is still possible to use the ComputeCpp JSON profiler. After running the application, you can load the file in Google Chrome, the new Microsoft Edge browser or even Tracy itself as it supports importing files in JSON format.
The profilers in ComputeCpp are not mutually exclusive, this means you can have both a real-time capture with Tracy and a JSON file at the end of your application execution.
To enable Tracy, just add enable_tracy_profiling = true
in your configuration file. Note that profiling is disabled by default, so you may need to add enable_profiling = true
in the configuration file as well. When enabling profiling support, ComputeCpp automatically activates the JSON profiler, to turn it off, you can use enable_json_profiling = false
.
Here is an example of a configuration file that will enable Tracy and disable the JSON output:
enable_profiling = true
enable_tracy_profiling = true
enable_json_profiling = false
The ComputeCpp integration with Tracy also supports the display of performance counter data. To enable performance counters, add enable_perf_counters_profiling = true
to your configuration file.
Profiling your application with Tracy
You will now find a binary called Tracy.exe on Windows or Tracy on Linux when you download ComputeCpp Professional Edition. This binary is the Tracy server that is guaranteed to be compatible with the ComputeCpp in the package. Tracy uses a custom communication protocol, so the protocol version used in ComputeCpp must match the protocol version in the server application. For this reason, you must use the Tracy server included as part of the ComputeCpp PE release package to avoid any compatibility issues.
Example of a Profiling Session with Tracy
To demonstrate the capabilities of the Tracy integration with ComputeCpp, we selected the NBody simulation from the ComputeCpp SDK (see the screenshot below). This application is a good example because it launches many kernels every second and doesn't finish until it is interrupted.