info

Please note that you are viewing a guide targeting an older version of oneAPI for NVIDIA® GPUs. This guide was designed for version 2023.2.0.

NVIDIA has had many different GPU architectures over the years. Here we consider only GPUs with compute capability >= 7.0; that is, with the Volta, Turing, or Ampere architectures (and the recently announced Hopper architecture).

Note

This section only provides a quick parallel between CUDA and SYCL terminology, for a more thorough description, refer to ComputeCpp’s SYCL for CUDA guide.

The basic compute unit of an NVIDIA GPU is called a streaming multiprocessor or SM. An SM executes sub-groups composed of 32 work-items. NVIDIA calls the sub-group a warp. The entity that executes a work-item is called a thread.

A work-group is called a thread block or a cooperative thread array (CTA). The usual rules apply: a CTA is guaranteed to run concurrently on the same SM, and can use local resources like SYCL local memory (which CUDA calls shared memory), and can synchronize between work-items.

This table summarizes the mapping between NVIDIA/CUDA terminology and SYCL terminology:

NVIDIA/CUDA	SYCL
streaming multiprocessor (SM)	compute unit (CU)
warp	sub-group
thread	work-item
thread block	work-group
cooperative thread array (CTA)	work-group
shared memory	local memory
global memory	global memory
local memory	private memory

The work-group size should always be a multiple of the sub-group size of 32. The optimal work-group size is typically chosen to maximize occupancy, and depends on the resources used by the particular kernel. It cannot exceed 1024.

On NVIDIA GPUS, the device query sycl::info::device::sub_group_sizes returns a vector with a single element with the value 32.

Once the work-group size is chosen, we can pick the global size. It is often useful to make the global size a multiple of the number of compute units to help in load balancing. The device query sycl::info::device::max_compute_units returns the number of compute units.

Then, we can write the kernel code so that each work-item (kernel instance) operates on multiple items of the problem. Thus the launch parameters can be tuned based on hardware layout without regard to the specific problem size.

For example the following snippet of SYCL code determines the launch parameters based on the hardware layout, and then uses a loop inside the kernel to adjust to the problem size. In CUDA this type of inner loop is called a grid-stride.

    int N = some_big_number;
    int wgsize = 256;
    int ncus = dev.get_info<info::device::max_compute_units>();
    int nglobal = 32 * ncus;
    cgh.parallel_for(nd_range<1>(nglobal * wgsize, wgsize),
      [=](nd_item<1> item)
      {
        int global_size = item.get_global_range()[0];
        for (int i = item.get_global_id(0); i < N; i += global_size)
          y[i] = a * x[i] + y[i];
      });

Identifier	Display Name
ComputeWorkloadAnalysis	Compute Workload Analysis
InstructionStats	Instruction Statistics
LaunchStats	Launch Statistics
MemoryWorkloadAnalysis	Memory Workload Analysis
MemoryWorkloadAnalysis_Chart	Memory Workload Analysis Chart
MemoryWorkloadAnalysis_Deprecated	(Deprecated) Memory Workload Analysis
MemoryWorkloadAnalysis_Tables	Memory Workload Analysis Tables
Nvlink	NVLink
Nvlink_Tables	NVLink Tables
Nvlink_Topology	NVLink Topology
Occupancy	Occupancy
SchedulerStats	Scheduler Statistics
SourceCounters	Source Counters
SpeedOfLight	GPU Speed Of Light Throughput
SpeedOfLight_HierarchicalDoubleRooflineChart	GPU Speed Of Light Hierarchical Roofline Chart (Double Precision)
SpeedOfLight_HierarchicalHalfRooflineChart	GPU Speed Of Light Hierarchical Roofline Chart (Half Precision)
SpeedOfLight_HierarchicalSingleRooflineChart	GPU Speed Of Light Hierarchical Roofline Chart (Single Precision)
SpeedOfLight_HierarchicalTensorRooflineChart	GPU Speed Of Light Hierarchical Roofline Chart (Tensor Core)
SpeedOfLight_RooflineChart	GPU Speed Of Light Roofline Chart
WarpStateStats	Warp State Statistics

menu_bookGuides

Architecture

SMs

Memory

Bank conflicts

Local memory size

Memory coalescence

Caches

Occupancy

Performance Tools

Nsight Systems (nsys)

Basic Usage

NVTX Annotations

Nsight Compute (ncu)

ncu Overhead

Sections

Metrics

Output

Extract kernel assembly

Performance Analysis

Common optimizations

assignmentJump to Section

Select a Product

oneAPI

Dark Mode

Light Mode

Codeplay.com

SYCL.tech

Codeplay Developer

Codeplay Open Source