info

Please note that you are viewing a guide targeting an older version of oneAPI for NVIDIA® GPUs. This guide was designed for version 2024.0.2.

Performance on Nvidia GPUs

The purpose of this section is to describe Nvidia GPU architecture and related performance considerations within the context of the SYCL programming model. Readers are encouraged to consult appropriate Nvidia documentation for up to date details of Nvidia architecture specific performance considerations.

Architecture

link

Note

This section only provides a quick parallel between CUDA and SYCL terminology, for a more thorough description, refer to ComputeCpp’s SYCL for CUDA guide. Note however that the ComputeCpp guide only covers SYCL 1.2; therefore several important features introduced in SYCL 2020 such as USM are not discussed.

The basic compute unit of an NVIDIA GPU is called a streaming multiprocessor or SM. An SM executes sub-groups composed of 32 work-items. NVIDIA calls the sub-group a warp. The entity that executes a work-item is called a thread.

A work-group is called a thread block or a cooperative thread array (CTA). The usual rules apply: a CTA is guaranteed to run concurrently on the same SM, and can use local resources like SYCL local memory (which CUDA calls shared memory), and can synchronize between work-items.

This table summarizes the mapping between NVIDIA/CUDA terminology and SYCL terminology:

NVIDIA/CUDA	SYCL
streaming multiprocessor (SM)	compute unit (CU)
warp	sub-group
thread	work-item
thread block	work-group
cooperative thread array (CTA)	work-group
shared memory	local memory
global memory	global memory
local memory	private memory

The work-group size should always be a multiple of the sub-group size of 32. The optimal work-group size is typically chosen to maximize occupancy, and depends on the resources used by the particular kernel. It cannot exceed 1024.

On NVIDIA GPUS, the device query sycl::info::device::sub_group_sizes returns a vector with a single element with the value 32.

Once the work-group size is chosen, we can pick the global size. It is often useful to make the global size a multiple of the number of compute units to help in load balancing. The device query sycl::info::device::max_compute_units returns the number of compute units.

Then, we can write the kernel code so that each work-item (kernel instance) operates on multiple items of the problem. Thus the launch parameters can be tuned based on hardware layout without regard to the specific problem size.

For example the following snippet of SYCL code determines the launch parameters based on the hardware layout, and then uses a loop inside the kernel to adjust to the problem size. In CUDA this type of inner loop is called a grid-stride.

    int N = some_big_number;
    int wgsize = 256;
    int ncus = dev.get_info<info::device::max_compute_units>();
    int nglobal = 32 * ncus;
    cgh.parallel_for(nd_range<1>(nglobal * wgsize, wgsize),
      [=](nd_item<1> item)
      {
        int global_size = item.get_global_range()[0];
        for (int i = item.get_global_id(0); i < N; i += global_size)
          y[i] = a * x[i] + y[i];
      });

SMs

link

NVIDIA SMs are partitioned into four processing blocks, called SM sub partitions. Individual warps reside in a single sub partition for their entire duration. When a warp is stalled, the hardware warp scheduler is able to switch to another ready warp. This all implies that there should be an ample number of warps available to execute at any given time. Later we will see how to use hardware metrics to evaluate the efficiency of the warp scheduler and the time spent in stalls.

Memory

link

Several kinds of memory are available:

SYCL global memory (NVIDIA global memory) is located on the device, though some global addresses may refer to cuda managed memory (using SYCL shared USM) that resides on the host or another device. All global memory accesses go through the L2 cache that is shared between all CUs. Data that is read only for the lifetime of the kernel can also be cached in the per-CU L1 cache using the sycl::ext::oneapi::experimental::cuda::ldg function.
SYCL private memory (NVIDIA local memory) is accessible only by a specific work-item, but maps to global memory, so has no performance advantage over global memory. It is mapped at the same address for every work-item. It is used for work-item stacks, register spills, and other work-item local data.
SYCL local memory (NVIDIA shared memory) can be used by a work-group. It has higher bandwidth and lower latency than global memory. SYCL local memory has 32 banks. Consecutive 32-bit words are assigned to different banks.
SYCL shared memory (NVIDIA managed memory) may be located on the host, or any device (see global memory description above) where it is accessed. The migration of such memory between host and device, or device and device, is managed by the runtime. However, the user may provide performance hints in order to influence this behavior via the prefetch and mem_advise SYCL 2020 APIs.

Bank conflicts

link

Imagine a stride-1 loop accessing a 32-byte float array in SYCL local memory. In the first sub-group, work-item 0 would access array element 0, work item 1 array element 1, and so on. Thus the sub-group as a whole would access array elements 0:31 and there would be no bank conflicts.

If on the other hand the loop was stride-32, then the work items would access elements 32, 64, 96, .. and the sub-group would access 32 locations that were 32 elements apart, resulting in a 32-way bank conflict, and reducing SYCL local memory bandwidth by a factor of 32.

This talk on Volta has a good description of shared memory bank conflicts on slides 54-72.

There are hardware metrics to measure bank conflicts.

Local memory size

link

By default kernels are limited to 48KB of SYCL local memory per thread block. For platforms supporting larger allocations (compute capability 7.0 and above) they have to be explicitly enabled by setting the SYCL_PI_CUDA_MAX_LOCAL_MEM_SIZE environmental variable to the desired allocation maximum in bytes. The device query sycl::info::device::local_mem_size returns the current SYCL local memory maximum.

Since compute capability 7.0 the SYCL local memory and the L1 cache are physically located on the same memory die. This means that their sum is a fixed value. As such allocating a lot of shared memory may impact performance by shrinking the available L1 cache.

Warning

As mentioned above for compute capability 7.0 and newer cards the L1 cache and local memory are linked together by hardware. While shared memory can be set to any arbitrary size within the platform capabilities L1 does not have such flexibility and can only use one of the predetermined values (called carveouts). As such allocating only slightly more local memory may cause the driver to snap to the next carveout value limiting L1 by a more noticeable amount.

The split between the memory types is chosen by the driver at runtime. It bases its decision on the actual amount of memory allocated using sycl::local_accessor. As such setting an overly generous SYCL_PI_CUDA_MAX_LOCAL_MEM_SIZE will not have a negative impact on the chosen L1 cache amount.

The exact values for these settings for any given compute capability can be found in the CUDA C programming guide. It is important to keep in mind that Nvidia uses the terms “L1 cache” and “texture cache” interchangeably: they mean the same thing.

Memory coalescence

link

Different work-items will usually access different locations in global memory. If the addresses accessed by a given sub-group in the same load instruction are to the same set of of cache lines, then the memory system will issue the minimal number of global memory accesses. This is called memory coalescing. This requirement can easily be satisfied by work-items accessing adjacent memory locations. Additional performance improvement can be achieved by aligning data structures on 32-byte boundaries. Obviously, indirect accesses and/or large strides may make this hard to achieve.

Recent NVIDIA GPUs have much more sophisticated cache and memory systems that make it more likely that global memory accesses will be coalesced. There are hardware metrics that can help to measure coalescence.

Caches

link

All GPU units share an L2 cache. The L2 cache is physically addressed. It includes facilities for data compression and global atomic (e.g., floating point add).

Each SM has an L1 cache which is used for multiple functions. L1 throughput can be a performance limiter.

Occupancy

link

NVIDIA defines occupancy as the number of simultaneously active CUs divided by the maximum possible number of simultaneous CUs. Obviously increasing occupancy will increase GPU utilization, which is expected to increase performance. Theoretical occupancy is determined by hardware limits, by the number of registers used for a kernel, and by the amount of shared memory used for a kernel.

Theoretical occupancy gives some guidelines for determining work-group sizes and launch parameters for a specific kernel. NVIDIA profiling tools can provide actual and theoretical occupancy for each kernel launch. It can be useful to vary the work-group size at runtime and benchmark the results to pick the best work-group size. However, it is often helpful to have an estimate of theoretical occupancy without having to run the code.

NVIDIA provides an online spreadsheet. While it is officially deprecated, it provides the same functionality as the Nsight Compute Occupancy section and can be very helpful when one is trying to understand how theoretical occupancy is determined. They suggest to use Nsight Compute instead.

Performance Tools

link

In general all the performance tools provided by Nvidia for CUDA also seamlessly handle SYCL applications using the DPC++ CUDA plugin.

This section contains a non-exhaustive introduction to some of these performance tools.

Nsight Systems (nsys)

link

NVIDIA Nsight Systems includes a command line tool nsys and a GUI nsys-ui. Additionally, the NVTX tracing library can be used in an application to restrict nsys analysis to regions of interest and to add additional application-specific events to the nsys analysis.

Basic Usage

link

nsys profile <command> <arguments>

This command will run the given command with the given arguments and collect a profile with the default settings. The output is written to a report file in an NVIDIA-specific format.

Note that nsys monitors the entire system. The command can be anything; nsys will simply start the command, start monitoring, and continue monitoring until the command exits.

nsys stats <report file>

This command will create an sqlite database from the report file and run several reports on the database. Reports include API usage, memory copies involving GPUs, and GPU kernel timings. The results are printed to the console by default. There are also options to write reports to files in csv format.

The nsys-ui command can be used to graphically examine a report file. This requires an attached display, either locally or through some sort of remote X viewer like VNC. The report file can be moved to a local system such as a laptop and the GUI run there. The report files can be quite large, and may require more memory than is available on many laptops.

With strategic use of NVTX annotation in the application and additional command line arguments, the report size can be dramatically reduced, facilitating the analysis workflow.

Reports can also be exported in JSON format. This contains the same data as the sqlite database but may be easier to parse for those who aren’t experts in SQL.

NVTX Annotations

link

NVTX is an instrumentation API provided by NVIDIA for use with their performance tools. It is a header-only API made available by including nvToolsExt.h; there are no libraries that need to be linked in.

NVTX can be very helpful for very large application to restrict data collection of the other Nvidia tools to a specific part of the application.

Note that NVTX can only be used to instrument host code, not device code.

Refer to the Nvidia documentation for more information on how to use NVTX and what it can do.

Nsight Compute (ncu)

link

Previous sections discussed Nsight Systems (nsys). Nsight Compute (ncu) is a companion tool that focuses on GPU hardware performance. ncu allows access to the many available hardware counters in NVIDIA GPUs, as well as presenting a number of predefined analysis types (called sections) that aid in understanding how well a given kernel is using the GPU.

ncu has a GUI but as in the nsys discussion we will focus on using the command line together with some processing scripts. The ncu CLI is documented at https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html.

ncu Overhead

link

ncu can greatly inflate the runtime of an application. There are several reasons for this:

Collecting specified metrics may require that a kernel be executed multiple times.
Some metrics require binary instrumentation of the kernel which can have high overhead.
Kernel execution is usually serialized, reducing concurrency.

See the Kernel Profiling Guide for more details.

Because of this slowdown, it is usually necessary to limit the collection process. NVIDIA provides several methods to do this:

Instrument the application with NVTX as described in the previous section on nsys, and instruct ncu to include or exclude specific NVTX ranges using the command-line options --nvtx and --nvtx-include or --nvtx-exclude. Note that the syntax to specify a range in a domain is backwards from nsys: domain@range rather than range@domain. Also, the ncu syntax is quite rich and complicated. See the NVTX Filtering section.
Collect data for only one kernel using -k kernelname. Regular expressions can be used in the kernel name. Note: C++ users may want to use the mangled name by specifying --kernel-name-base=mangled.
Collect a specified number of kernel launches using --launch-count and --launch-skip.

The above methods can be combined.

Sections

link

ncu has a number of predefined sets of metrics called sections. Each section is designed to help answer a specific performance question, e.g., is this application memory bound?

The sections can be listed with ncu --list-sections. Here is the output for ncu version 2022.1.1.0:

Identifier	Display Name
ComputeWorkloadAnalysis	Compute Workload Analysis
InstructionStats	Instruction Statistics
LaunchStats	Launch Statistics
MemoryWorkloadAnalysis	Memory Workload Analysis
MemoryWorkloadAnalysis_Chart	Memory Workload Analysis Chart
MemoryWorkloadAnalysis_Deprecated	(Deprecated) Memory Workload Analysis
MemoryWorkloadAnalysis_Tables	Memory Workload Analysis Tables
Nvlink	NVLink
Nvlink_Tables	NVLink Tables
Nvlink_Topology	NVLink Topology
Occupancy	Occupancy
SchedulerStats	Scheduler Statistics
SourceCounters	Source Counters
SpeedOfLight	GPU Speed Of Light Throughput
SpeedOfLight_HierarchicalDoubleRooflineChart	GPU Speed Of Light Hierarchical Roofline Chart (Double Precision)
SpeedOfLight_HierarchicalHalfRooflineChart	GPU Speed Of Light Hierarchical Roofline Chart (Half Precision)
SpeedOfLight_HierarchicalSingleRooflineChart	GPU Speed Of Light Hierarchical Roofline Chart (Single Precision)
SpeedOfLight_HierarchicalTensorRooflineChart	GPU Speed Of Light Hierarchical Roofline Chart (Tensor Core)
SpeedOfLight_RooflineChart	GPU Speed Of Light Roofline Chart
WarpStateStats	Warp State Statistics

Metrics

link

Instead of sections, ncu can be directed to collect specific metrics with --metrics. The available metrics can be queried with --query-metrics.

Output

link

The examples in this section use the --csv option to create output that is easy to parse. The --log-file option will send this output, together with some other output, to a file rather than to stdout.

The profiling report is used by the ncu gui. It can be saved using the --export option. That file could then be moved to a different machine and viewed locally with ncu-gui:.

Mangled names in the output can be selected with --print-kernel-base=mangled.

Extract kernel assembly

link

In some cases to understand the performance behavior of a given kernel, it may be helpful to inspect the assembly, or in the Nvidia case the PTX generated by the compiler for the kernel.

To extract the PTX from a SYCL application built for Nvidia GPU, simply run the application with the environment variable SYCL_DUMP_IMAGES set to 1. This will create files in the current working directory with names such as sycl_nvptx641.bin. These files are CUDA fat binaries, they contain PTX, which is Nvidia’s target independent virtual assembly language, and SASS, which is machine code for one or multiple specific targets.

From that fat binary, the CUDA tool cuobjdump can then be used to extract both PTX and SASS, as follows:

# Extract PTX
cuobjdump --dump-ptx sycl_nvptx641.bin

# Extract SASS
cuobjdump -sass sycl_nvptx641.bin

Note that the GUI version of NVIDIA’s Nsight Compute tool should also be able to show disassembly.

Rate this Guide

oneAPI Menu

Main Menu

Products

menu_bookGuides

Performance on Nvidia GPUs

Architecture

SMs

Memory

Bank conflicts

Local memory size

Memory coalescence

Caches

Occupancy

Performance Tools

Nsight Systems (nsys)

Basic Usage

NVTX Annotations

Nsight Compute (ncu)

ncu Overhead

Sections

Metrics

Output

Extract kernel assembly

Performance Analysis

Common Optimizations

assignmentJump to Section

Select a Product

oneAPI

Dark Mode

Light Mode

Also,

part of our network

Codeplay.com

SYCL.tech

Codeplay Developer

Codeplay Open Source