The purpose of this section is to describe Nvidia GPU architecture and related performance considerations within the context of the SYCL programming model. Readers are encouraged to consult appropriate Nvidia documentation for up to date details of Nvidia architecture specific performance considerations.
Architecture
Note
This section only provides a quick parallel between CUDA and SYCL terminology, for a more thorough description, refer to ComputeCpp’s SYCL for CUDA guide. Note however that the ComputeCpp guide only covers SYCL 1.2; therefore several important features introduced in SYCL 2020 such as USM are not discussed.
The basic compute unit of an NVIDIA GPU is called a streaming multiprocessor or SM. An SM executes sub-groups composed of 32 work-items. NVIDIA calls the sub-group a warp. The entity that executes a work-item is called a thread.
A work-group is called a thread block or a cooperative thread array (CTA). The usual rules apply: a CTA is guaranteed to run concurrently on the same SM, and can use local resources like SYCL local memory (which CUDA calls shared memory), and can synchronize between work-items.
This table summarizes the mapping between NVIDIA/CUDA terminology and SYCL terminology:
NVIDIA/CUDA |
SYCL |
---|---|
streaming multiprocessor (SM) |
compute unit (CU) |
warp |
sub-group |
thread |
work-item |
thread block |
work-group |
cooperative thread array (CTA) |
work-group |
shared memory |
local memory |
global memory |
global memory |
local memory |
private memory |
The work-group size should always be a multiple of the sub-group size of 32. The optimal work-group size is typically chosen to maximize occupancy, and depends on the resources used by the particular kernel. It cannot exceed 1024.
On NVIDIA GPUS, the device query sycl::info::device::sub_group_sizes
returns a
vector with a single element with the value 32.
Once the work-group size is chosen, we can pick the global size. It is often
useful to make the global size a multiple of the number of compute units to
help in load balancing. The device query sycl::info::device::max_compute_units
returns the number of compute units.
Then, we can write the kernel code so that each work-item (kernel instance) operates on multiple items of the problem. Thus the launch parameters can be tuned based on hardware layout without regard to the specific problem size.
For example the following snippet of SYCL code determines the launch parameters based on the hardware layout, and then uses a loop inside the kernel to adjust to the problem size. In CUDA this type of inner loop is called a grid-stride.
int N = some_big_number;
int wgsize = 256;
int ncus = dev.get_info<info::device::max_compute_units>();
int nglobal = 32 * ncus;
cgh.parallel_for(nd_range<1>(nglobal * wgsize, wgsize),
[=](nd_item<1> item)
{
int global_size = item.get_global_range()[0];
for (int i = item.get_global_id(0); i < N; i += global_size)
y[i] = a * x[i] + y[i];
});
SMs
NVIDIA SMs are partitioned into four processing blocks, called SM sub partitions. Individual warps reside in a single sub partition for their entire duration. When a warp is stalled, the hardware warp scheduler is able to switch to another ready warp. This all implies that there should be an ample number of warps available to execute at any given time. Later we will see how to use hardware metrics to evaluate the efficiency of the warp scheduler and the time spent in stalls.
Memory
Several kinds of memory are available:
SYCL global memory (NVIDIA global memory) is located on the device, though some global addresses may refer to cuda managed memory (using SYCL shared USM) that resides on the host or another device. All global memory accesses go through the L2 cache that is shared between all CUs. Data that is read only for the lifetime of the kernel can also be cached in the per-CU L1 cache using the
sycl::ext::oneapi::experimental::cuda::ldg
function.SYCL private memory (NVIDIA local memory) is accessible only by a specific work-item, but maps to global memory, so has no performance advantage over global memory. It is mapped at the same address for every work-item. It is used for work-item stacks, register spills, and other work-item local data.
SYCL local memory (NVIDIA shared memory) can be used by a work-group. It has higher bandwidth and lower latency than global memory. SYCL local memory has 32 banks. Consecutive 32-bit words are assigned to different banks.
SYCL shared memory (NVIDIA managed memory) may be located on the host, or any device (see global memory description above) where it is accessed. The migration of such memory between host and device, or device and device, is managed by the runtime. However, the user may provide performance hints in order to influence this behavior via the
prefetch
andmem_advise
SYCL 2020 APIs.
Bank conflicts
Imagine a stride-1 loop accessing a 32-byte float array in SYCL local memory. In the first sub-group, work-item 0 would access array element 0, work item 1 array element 1, and so on. Thus the sub-group as a whole would access array elements 0:31 and there would be no bank conflicts.
If on the other hand the loop was stride-32, then the work items would access elements 32, 64, 96, .. and the sub-group would access 32 locations that were 32 elements apart, resulting in a 32-way bank conflict, and reducing SYCL local memory bandwidth by a factor of 32.
This talk on Volta has a good description of shared memory bank conflicts on slides 54-72.
There are hardware metrics to measure bank conflicts.
Local memory size
By default kernels are limited to 48KB of SYCL local memory per thread block.
For platforms supporting larger allocations (compute capability 7.0 and above)
they have to be explicitly enabled by setting the
SYCL_PI_CUDA_MAX_LOCAL_MEM_SIZE
environmental variable to the desired
allocation maximum in bytes. The device query
sycl::info::device::local_mem_size
returns the current SYCL local memory
maximum.
Since compute capability 7.0 the SYCL local memory and the L1 cache are physically located on the same memory die. This means that their sum is a fixed value. As such allocating a lot of shared memory may impact performance by shrinking the available L1 cache.
Warning
As mentioned above for compute capability 7.0 and newer cards the L1 cache and local memory are linked together by hardware. While shared memory can be set to any arbitrary size within the platform capabilities L1 does not have such flexibility and can only use one of the predetermined values (called carveouts). As such allocating only slightly more local memory may cause the driver to snap to the next carveout value limiting L1 by a more noticeable amount.
The split between the memory types is chosen by the driver at runtime. It bases
its decision on the actual amount of memory allocated using
sycl::local_accessor
. As such setting an overly generous
SYCL_PI_CUDA_MAX_LOCAL_MEM_SIZE
will not have a negative impact on the chosen
L1 cache amount.
The exact values for these settings for any given compute capability can be found in the CUDA C programming guide. It is important to keep in mind that Nvidia uses the terms “L1 cache” and “texture cache” interchangeably: they mean the same thing.
Memory coalescence
Different work-items will usually access different locations in global memory. If the addresses accessed by a given sub-group in the same load instruction are to the same set of of cache lines, then the memory system will issue the minimal number of global memory accesses. This is called memory coalescing. This requirement can easily be satisfied by work-items accessing adjacent memory locations. Additional performance improvement can be achieved by aligning data structures on 32-byte boundaries. Obviously, indirect accesses and/or large strides may make this hard to achieve.
Recent NVIDIA GPUs have much more sophisticated cache and memory systems that make it more likely that global memory accesses will be coalesced. There are hardware metrics that can help to measure coalescence.
Caches
All GPU units share an L2 cache. The L2 cache is physically addressed. It includes facilities for data compression and global atomic (e.g., floating point add).
Each SM has an L1 cache which is used for multiple functions. L1 throughput can be a performance limiter.
Occupancy
NVIDIA defines occupancy as the number of simultaneously active CUs divided by the maximum possible number of simultaneous CUs. Obviously increasing occupancy will increase GPU utilization, which is expected to increase performance. Theoretical occupancy is determined by hardware limits, by the number of registers used for a kernel, and by the amount of shared memory used for a kernel.
Theoretical occupancy gives some guidelines for determining work-group sizes and launch parameters for a specific kernel. NVIDIA profiling tools can provide actual and theoretical occupancy for each kernel launch. It can be useful to vary the work-group size at runtime and benchmark the results to pick the best work-group size. However, it is often helpful to have an estimate of theoretical occupancy without having to run the code.
NVIDIA provides an online spreadsheet. While it is officially deprecated, it provides the same functionality as the Nsight Compute Occupancy section and can be very helpful when one is trying to understand how theoretical occupancy is determined. They suggest to use Nsight Compute instead.
Performance Tools
In general all the performance tools provided by Nvidia for CUDA also seamlessly handle SYCL applications using the DPC++ CUDA plugin.
This section contains a non-exhaustive introduction to some of these performance tools.
Nsight Systems (nsys)
NVIDIA Nsight Systems includes a command line tool nsys and a GUI nsys-ui. Additionally, the NVTX tracing library can be used in an application to restrict nsys analysis to regions of interest and to add additional application-specific events to the nsys analysis.
Basic Usage
nsys profile <command> <arguments>
This command will run the given command with the given arguments and collect a profile with the default settings. The output is written to a report file in an NVIDIA-specific format.
Note that nsys monitors the entire system. The command can be anything; nsys will simply start the command, start monitoring, and continue monitoring until the command exits.
nsys stats <report file>
This command will create an sqlite database from the report file and run several reports on the database. Reports include API usage, memory copies involving GPUs, and GPU kernel timings. The results are printed to the console by default. There are also options to write reports to files in csv format.
The nsys-ui command can be used to graphically examine a report file. This requires an attached display, either locally or through some sort of remote X viewer like VNC. The report file can be moved to a local system such as a laptop and the GUI run there. The report files can be quite large, and may require more memory than is available on many laptops.
With strategic use of NVTX annotation in the application and additional command line arguments, the report size can be dramatically reduced, facilitating the analysis workflow.
Reports can also be exported in JSON format. This contains the same data as the sqlite database but may be easier to parse for those who aren’t experts in SQL.
NVTX Annotations
NVTX is an instrumentation API provided by NVIDIA for use with their performance
tools. It is a header-only API made available by including nvToolsExt.h
; there
are no libraries that need to be linked in.
NVTX can be very helpful for very large application to restrict data collection of the other Nvidia tools to a specific part of the application.
Note that NVTX can only be used to instrument host code, not device code.
Refer to the Nvidia documentation for more information on how to use NVTX and what it can do.
Nsight Compute (ncu)
Previous sections discussed Nsight Systems (nsys). Nsight Compute (ncu) is a companion tool that focuses on GPU hardware performance. ncu allows access to the many available hardware counters in NVIDIA GPUs, as well as presenting a number of predefined analysis types (called sections) that aid in understanding how well a given kernel is using the GPU.
ncu has a GUI but as in the nsys discussion we will focus on using the command line together with some processing scripts. The ncu CLI is documented at https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html.
ncu Overhead
ncu can greatly inflate the runtime of an application. There are several reasons for this:
Collecting specified metrics may require that a kernel be executed multiple times.
Some metrics require binary instrumentation of the kernel which can have high overhead.
Kernel execution is usually serialized, reducing concurrency.
See the Kernel Profiling Guide for more details.
Because of this slowdown, it is usually necessary to limit the collection process. NVIDIA provides several methods to do this:
Instrument the application with NVTX as described in the previous section on nsys, and instruct ncu to include or exclude specific NVTX ranges using the command-line options
--nvtx
and--nvtx-include
or--nvtx-exclude
. Note that the syntax to specify a range in a domain is backwards from nsys:domain@range
rather thanrange@domain
. Also, the ncu syntax is quite rich and complicated. See the NVTX Filtering section.Collect data for only one kernel using
-k kernelname
. Regular expressions can be used in the kernel name. Note: C++ users may want to use the mangled name by specifying--kernel-name-base=mangled
.Collect a specified number of kernel launches using
--launch-count
and--launch-skip
.
The above methods can be combined.
Sections
ncu has a number of predefined sets of metrics called sections. Each section is designed to help answer a specific performance question, e.g., is this application memory bound?
The sections can be listed with ncu --list-sections
. Here is the output
for ncu version 2022.1.1.0:
Identifier |
Display Name |
---|---|
ComputeWorkloadAnalysis |
Compute Workload Analysis |
InstructionStats |
Instruction Statistics |
LaunchStats |
Launch Statistics |
MemoryWorkloadAnalysis |
Memory Workload Analysis |
MemoryWorkloadAnalysis_Chart |
Memory Workload Analysis Chart |
MemoryWorkloadAnalysis_Deprecated |
(Deprecated) Memory Workload Analysis |
MemoryWorkloadAnalysis_Tables |
Memory Workload Analysis Tables |
Nvlink |
NVLink |
Nvlink_Tables |
NVLink Tables |
Nvlink_Topology |
NVLink Topology |
Occupancy |
Occupancy |
SchedulerStats |
Scheduler Statistics |
SourceCounters |
Source Counters |
SpeedOfLight |
GPU Speed Of Light Throughput |
SpeedOfLight_HierarchicalDoubleRooflineChart |
GPU Speed Of Light Hierarchical Roofline Chart (Double Precision) |
SpeedOfLight_HierarchicalHalfRooflineChart |
GPU Speed Of Light Hierarchical Roofline Chart (Half Precision) |
SpeedOfLight_HierarchicalSingleRooflineChart |
GPU Speed Of Light Hierarchical Roofline Chart (Single Precision) |
SpeedOfLight_HierarchicalTensorRooflineChart |
GPU Speed Of Light Hierarchical Roofline Chart (Tensor Core) |
SpeedOfLight_RooflineChart |
GPU Speed Of Light Roofline Chart |
WarpStateStats |
Warp State Statistics |
Metrics
Instead of sections, ncu can be directed to collect specific metrics with
--metrics
. The available metrics can be queried with --query-metrics
.
Output
The examples in this section use the --csv
option to create output
that is easy to parse. The --log-file
option will send this output,
together with some other output, to a file rather than to stdout
.
The profiling report is used by the ncu gui. It can be saved using the
--export
option. That file could then be moved to a different machine
and viewed locally with ncu-gui
:.
Mangled names in the output can be selected with
--print-kernel-base=mangled
.
Extract kernel assembly
In some cases to understand the performance behavior of a given kernel, it may be helpful to inspect the assembly, or in the Nvidia case the PTX generated by the compiler for the kernel.
To extract the PTX from a SYCL application built for Nvidia GPU, simply run the
application with the environment variable SYCL_DUMP_IMAGES
set to 1
. This
will create files in the current working directory with names such as
sycl_nvptx641.bin
. These files are CUDA fat binaries, they contain PTX,
which is Nvidia’s target independent virtual assembly language, and SASS, which
is machine code for one or multiple specific targets.
From that fat binary, the CUDA tool
cuobjdump
can then be used to extract both PTX and SASS, as follows:
# Extract PTX
cuobjdump --dump-ptx sycl_nvptx641.bin
# Extract SASS
cuobjdump -sass sycl_nvptx641.bin
Note that the GUI version of NVIDIA’s Nsight Compute tool should also be able to show disassembly.
Module split and loading times
Compiled device code is embedded in the output binary (executable or shared library) in self-contained sections called “modules”. Each module may comprise one or multiple kernels and may contain just the PTX or both PTX and SASS. When a program needs to execute a kernel, it will load the module into the host memory and ask the driver to load the corresponding device code onto the device. If the right SASS for the given device architecture is available, it is loaded directly. Otherwise, just-in-time compilation from PTX to SASS is triggered. If a module contains multiple kernels, the entire module is loaded on the first use of one of its kernels.
Merging multiple kernels into a single module may be generally better for performance if most of its kernels are used over the course of the program’s run time. It may reduce the overheads of reading the device code from the binary file and launching the JIT compilation and memory transfers. On the other hand, it will lead to unnecessary work being performed if a module contains many kernels and only a small fraction of them are ever used. The unused kernels will still need to be loaded and potentially compiled, and will take up memory space. There are arguments for and against each solution, but it is worth noting that CUDA (NVCC) by default creates a separate module for each kernel, whereas DPC++ tends to combine multiple kernels into a larger module.
This may be problematic when building shared libraries with templated kernels
which need to provide many template instantiations - tens or even hundreds.
Imagine a function template with three template typename parameters supporting
five types each and resulting in 125 combinations. If the device code for all
these kernels is bundled into a single module, this will require loading all of
them even when a typical use-case only requires one combination. When analysing
the performance of such a program in nsys
, this may show up in the CUDA API
section as a significant time spent in the cuModuleLoadDataEx
call. If this
turns out to be a limitation for the application’s performance, it might be
worth splitting the kernels into separate modules.
The module splitting in DPC++ can be modified using the -fsycl-device-code-split
flag.
Its default value (auto
) instructs the compiler to use its built-in heuristics
to find an optimal split. The outcome may depend on how the source code is
structured and divided into translation units, and may not always be the best
choice for the intended use case (as in the shared library example above).
Forcing the compiler to produce a separate module for each kernel by setting
-fsycl-device-code-split=per_kernel
may be beneficial in this situation.