ComputeCpp Configuration File
Using the configuration file
A configuration file may be used to configure the behavior of several aspects of the ComputeCpp runtime.
WARNING: Please note that ComputeCpp is designed to work out-of-the-box, so a configuration file is not required. Also, note that some configuration options are highly experimental and may cause some undesired behavior in your SYCL application.
Some configuration options are only available in the ComputeCpp Professional Edition.
How to set a configuration file
To set a configuration file follow these steps:
- Create a file in your system that the application can read
- Fill the configuration file with options as specified below
- Point the environment variable
COMPUTECPP_CONFIGURATION_FILE
to the newly created file
The following are one-liner examples on how to create and set the configuration file:
Windows:
type NUL > sycl_config.txt; set COMPUTECPP_CONFIGURATION_FILE=%CD%\sycl_config.txt
Linux:
touch sycl_config.txt; export COMPUTECPP_CONFIGURATION_FILE=`pwd`/sycl_config.txt
Note that if
COMPUTECPP_CONFIGURATION_FILE
points to a file that doesn't exist the runtime will output a warning and continue using the defaults shown below.Reading the configuration file happens at program initialization, before entering main.
Example of a configuration file with all default options
The following example shows a configuration file with all default options available. Using the following configuration file won't change the default behavior of ComputeCpp but is a good summary of all available options.
verbose_output = false
use_memory_mapping = true
enable_spirv_check = false
yield_scheduler_loop = true
enable_profiling = false
enable_json_profiling = true
enable_tracy_profiling = false
enable_kernel_profiling = true
enable_buffer_profiling = false
enable_perf_counters_profiling = false
profiling_collapse_transactions = false
force_queue_finish = false
submit_memory_check = false
lock_memory_tracking = false
kernel_checks = false
host_ptr_flag = automatic
page_size = 4096
cache_line_size = 64
max_running_transactions = 0
scheduler_sleep_time_ns = 0
enable_autobinding = false
secondary_queue_mode = none
flushing_policy = on_submit
use_out_of_order_queue = false
reduction_workgroup_size = 0
Configuration options
The options in the configuration file are key value pairs in the format
[option] = [value]
where option
is the name of the configuration option and
value
is the corresponding value.
Each option has a default value that the runtime will use if no value is specified for that option. Note that invalid options will be ignored.
verbose_output
- Type:
bool
- Default:
false
Enables the runtime verbose output mode. The runtime will print debug
information, including but not limited to:
* The selected device
* Details of a thrown exception
* what()
* The location of the throw
use_memory_mapping
- Type:
bool
- Default:
true
- Professional Edition Feature
When moving data to and from devices, ComputeCpp will map memory buffers in the host to manipulate its contents. This is the best option for most platforms but it might be better to use explicit data movement through the use of reads and writes. Disabling this option will force ComputeCpp to perform explicit reads and writes instead of mapping.
enable_spirv_check
- Type:
bool
- Default:
false
- Professional Edition Feature
If enabled, ComputeCpp will try to perform additional checks on whether or not
SPIR-V is supported by a device. This includes checking for the extension
cl_khr_il_program
in devices that don't support OpenCL 2.1 and checking for
the CL_DEVICE_IL_VERSION
. SPIR-V might be supported by
platforms that don't report the cl_khr_il_program
extension hence the check is
disabled by default. Enabling this will cause the runtime to throw an exception if
spirv64
is used in a device that doesn't report SPIR-V support.
yield_scheduler_loop
- Type:
bool
- Default:
true
- Professional Edition Feature
- See
scheduler_sleep_time_ns
ComputeCpp has an advanced command group scheduler that uses its own thread to
schedule work created by the user. The scheduler's thread will, by default, call
std::this_thread::yield
in each iteration. The behavior of this
function is implementation-defined but it should be a hint to the OS that other
threads may perform some work and this thread should be rescheduled.
The call to the yield
function can be disabled using this configuration option.
enable_profiling
- Type:
bool
- Default:
false
- Professional Edition Feature
Enables ComputeCpp automatic profiling capabilities.
For more information please refer to Profiling SYCL applications.
enable_json_profiling
- Type:
bool
- Default:
true
- Professional Edition Feature
- This option only has an effect when
enable_profiling
is enabled.
Enables the ComputeCpp embedded profiler that records a timeline of events that can help identify performance issues. Enabling this option will output a JSON file that can be visualized in Chrome or Chromium by opening [chrome://tracing] and loading the file. Note that the profiling output will not be written until the application exits.
When profiling is enabled, the JSON profiler is enabled by default, however, this option can be used to disable it in favor of another profiler.
enable_tracy_profiling
- Type:
bool
- Default:
false
- Professional Edition Feature
- This option only has an effect when
enable_profiling
is enabled.
Tracy is not enabled by default in ComputeCpp and requires this option to be enabled to function.
The profilers are not mutually exclusive, meaning that both enable_json_profiling
and
enable_tracy_profiler
can be enabled at the same time. Enabling this option will allow the
application to connect with the Tracy server so profiling can be visualized in real-time while
the application is running.
Tracy support is still experimental so it is disabled by default.
enable_kernel_profiling
- Type:
bool
- Default:
true
- Professional Edition Feature
- This option only has an effect when
enable_profiling
is enabled.
When both enable_profiling
and enable_kernel_profiling
are set to true,
the ComputeCpp profiler will record kernel timings.
This works by creating OpenCL queues
with the property CL_QUEUE_PROFILING_ENABLE
,
which might affect the behavior of the underlying OpenCL queue in some platforms.
Recording kernel timings can then be disabled by turning off this option.
enable_buffer_profiling
- Type:
bool
- Default:
false
- Professional Edition Feature
- This option only has an effect when
enable_profiling
is enabled.
ComputeCpp's profiler will record events on which buffers are used by each kernel. This allows the profiler to display the size and usage of each buffer as well as where each one was used. For small applications this is useful but when applications create thousands of buffers the interface can become difficult to handle so this option is turned off by default.
enable_perf_counters_profiling
- Type:
bool
- Default:
false
- Professional Edition Feature
- This option only has an effect when
enable_profiling
is enabled.
ComputeCpp supports reading performance counters from a number of devices (see
perf_counter_backend
) but this feature is disabled by
default. Enabling this option will enable performance counter data to be
displayed in the profilers.
perf_counter_backend
- Type: multiple choice
- Default:
default
- Professional Edition Feature
- This option only has an effect when
enable_profiling
andenabled_perf_counters_profiling
are enabled.
ComputeCpp currently supports performance counters on Intel GPU devices, ARM Mali devices and devices supporting the OpenCL extension Codeplay Performance Counters. This option can be set to one of the following:
default
: Collect ARM Mali performance counter if on an ARM platform, otherwise collect Intel GPU performance counters.intel_gpu
: Collect Intel GPU performance counters.arm_mali
: Collect ARM Mali performance counters.codeplay
: Collect Codeplay Performance Counters.
profiling_collapse_transactions
- Type:
bool
- Default:
false
- Professional Edition Feature
- This option only has an effect when
enable_profiling
is enabled.
The unit of work in ComputeCpp is known as a transaction. A transaction
consists of the statements written inside a command_group
submission. Each
transaction is treated as an atomic operation by the runtime but a transaction
goes through several stages during its lifetime. By default, the ComputeCpp
profiler will record each stage of a transaction as a separate event. This is
useful to identify bottlenecks in the transaction scheduling mechanism as one
stage might be taking more time than it should due to an unsatisfied requisite.
For very large applications, when the number of submissions is on the order of tens
of thousands, it might be useful to represent all the individual transaction states
as just a single event,
which can be done by enabling this option.
force_queue_finish
- Type:
bool
- Default:
false
- Professional Edition Feature
By default, ComputeCpp will call clFlush
after each submit to
ensure execution can start as soon as possible. However, some platforms require
clFinish
to be called or the work submitted will take a long time
to start being executed so this option is provided to handle this situation.
Note that clFinish
is a blocking call that will wait for all the work in the
queue to finish before returning. In the majority of scenarios, this option
should not be enabled.
submit_memory_check
- Type:
bool
- Default:
false
- Professional Edition Feature
ComputeCpp will track the amount of memory allocated in a device and will throw an exception if the device isn't available to allocate the memory. By enabling this option, the ComputeCpp runtime will perform extra checks before the allocation takes place, and throw a different exception with a more specific error message. However, this option is not enabled by default because the checks don't take into account virtual memory, so the checks might be too restrictive.
lock_memory_tracking
- Type:
bool
- Default:
false
- Professional Edition Feature
- This option only has an effect when
submit_memory_check
is enabled
When submit_memory_check
is used in the context of
multi-threaded applications it might be difficult to identify which submission is
causing the device to run out of memory. Enabling this option will force the
runtime to block the submission of work while these checks are being performed.
It will cause a performance degradation,
but it might make it easier to find the problematic submission
so that the application can be refactored.
kernel_checks
- Type:
bool
- Default:
false
- Professional Edition Feature
- WARNING: Highly experimental feature
- Enabling this option will implicitly enable
verbose_output
Debugging compute kernels is a difficult process. To facilitate this, ComputeCpp has the capability to verify if a kernel is accessing memory outside of its available bounds and will try to report information to help developers identify where the problem is. Enabling this option will cause all memory accesses inside a kernel to be validated and will add a synchronization point after each submission. The verification needs to wait for a kernel to finish before executing the next kernel. This is a debugging feature and will cause large performance degradation, and as such should be used with caution.
host_ptr_flag
- Type: multiple choice
- Default:
automatic
- Professional Edition Feature
This option controls which host pointer flag to use when creating buffers. More
information on this flag can be found in clCreateBuffer
.
The possible values are:
automatic
: The runtime is allowed to choose the most appropriate flag.none
: Do not use any flag.cl_alloc_host_ptr
: Force the use ofCL_MEM_ALLOC_HOST_PTR
.cl_use_host_ptr
: Force the use ofCL_MEM_USE_HOST_PTR
.
page_size
- Type:
size_t
- Default:
4096
- Professional Edition Feature
- See
host_ptr_flag
Controls the page size used by ComputeCpp's default_allocator
, the allocator
used to allocate memory for buffers and images.
When CL_MEM_USE_HOST_PTR
is used the application tells the OpenCL runtime that
the memory referenced by the host pointer should be used as storage for the
OpenCL memory object. In some implementations, using this option can cause data
to be pinned so data movement between host and device is very efficient,
however, one of the requirements is that memory needs to be aligned with the
page size the device uses to transfer data. If the page size for a device is
not equal to 4096
, this option can be used to change it.
cache_line_size
- Type:
size_t
- Default:
64
- Professional Edition Feature
- See
host_ptr_flag
Controls the size multiplier used by the ComputeCpp default_allocator
to
allocate memory. This value should, ideally, be the size of the cache line of a
given compute unit so memory allocation can be properly aligned.
max_running_transactions
- Type:
unsigned int
- Default:
0
- Professional Edition Feature
Controls the maximum number of simultaneous running transactions ComputeCpp is
allowed to have. 0
means the runtime is allowed to choose this number. Setting
this number to 1
, for instance, means the runtime will wait for every kernel to
finish before submitting another.
This number can be interpreted as the maximum number of command groups the runtime will submit to the OpenCL driver before waiting for others to finish. Note that while a queue only executes one kernel at a time, depending on the platform and device, the driver might lose performance when several kernels are submitted in a short period of time so this option can be used to limit the number of simultaneous submissions.
scheduler_sleep_time_ns
- Type:
unsigned long long
- Default:
0
- Professional Edition Feature
- WARNING: Highly experimental feature
ComputeCpp's command group scheduler uses a thread to process incoming transactions. This thread uses, by default, a polling mechanism to see if the user has submitted a command group to be executed. When the application targets an accelerator device, like a GPU, this mechanism works well because the CPU is free to run the scheduler thread as fast as it can. In platforms where the device is the CPU itself, setting a sleep time for the scheduler thread can have performance benefits and can reduce power consumption as well.
The polling mechanism may not be suitable for all platforms, in particular, having a thread looping all the time will keep one CPU core at maximum utilization. This configuration option can be used to specify a time duration that the scheduler thread is allowed to sleep in each iteration. Note that the thread will not always sleep for the specified time as it will be awakened as soon as the application submits a new command group to be processed and also when a command group finishes work.
It is recommended to enable this option when running on a CPU device,
whereas for a GPU or accelerator device
it would depend on the execution time of the kernels involved.
An application with larger kernels would benefit less from this option
or might even see a performance degradation.
It is also recommended to turn off yield_scheduler_loop
when using the sleep time.
enable_autobinding
- Type:
bool
- Default:
false
- Professional Edition Feature
WARNING: The autobinding feature is not compatible with the
host
device.
Auto binding is a feature that allows ComputeCpp to automatically bind
placeholder accessors to command groups. A placeholder accessor needs to be
bound to a command group by calling cl::sycl::handler::require
inside the
command group. By turning on this feature ComputeCpp can automatically bind
placeholder accessors that a command group requires.
secondary_queue_mode
- Type: multiple choice
- Default:
none
- Professional Edition Feature
For each queue
created, ComputeCpp can create a secondary queue that can be
used according to the policy defined by this configuration option. The secondary
queue can be used to automatically overlap data movement and computation. Note
that the secondary queue will be created for the same device, thus, it might not
be beneficial to use a secondary queue in all platforms. This option can be set
to one of the following:
none
: No secondary queue is created.data_ops
: The secondary queue is allowed to take compute tasks from the primary queue.data_steal
: The secondary queue is allowed to take memory movement operations from the primary queue.any_steal
: The secondary queue is allowed to take any task from the primary queue.
flushing_policy
- Type: multiple choice
- Default:
on_submit
- Professional Edition Feature
- WARNING: Highly experimental feature
ComputeCpp will call a clFlush
operation after each command
group to ensure that the OpenCL driver starts to execute a command group as soon
as it can.
Even though this is an asynchronous call,
clFlush
can cause some OpenCL drivers to start execution immediately,
which effectively makes subsequent submissions wait
before starting the current submission.
This can also cause performance degradation
if a large number of small kernels are being submitted,
as some drivers prefer to have several submissions
before a flush to schedule work to make full use of a device.
The flushing policy can be one of the following:
on_submit
: Always perform aclFlush
after each command group submission.on_wait
: Only perform aclFlush
when necessary.clFlush
is called when the user waits on some runtime event. This wait can be an explicit call tocl::sycl::queue::wait
,cl::sycl::event::wait
, acl::sycl::buffer
destruction,cl::sycl::stream
operations, and when submissions are made to different queues. The runtime will try to flush only when required to avoid excessive calls toclFlush
.
use_out_of_order_queue
- Type:
bool
- Default:
false
- Professional Edition Feature
ComputeCpp uses an in-order OpenCL queue. This option enables the runtime to use an out-of-order queue instead, if the device supports it.
reduction_workgroup_size
- Type:
size_t
- Default:
0
- Professional Edition Feature
Controls the workgroup size used for SYCL reduction operations.
A value > 0
here will override the workgroup size calculated by the runtime.
The value set here is not checked so may result in errors
if set higher than your OpenCL implementation's maximum workgroup size.
Reductions may also use some local memory so high workgroup sizes and kernels that use a lot of local memory
may result in local memory exhaustion.