Graphics Processing Units are able to provide more floating-point operations per second and a wider memory bandwidth than CPUs thanks to their massively parallel architectures. These capabilities can be exploited only by considering GPU architecture during code development.
In this regard, a programming model that can express the GPU massive parallelism is fundamental. SYCL adopts a programming model similar to OpenCL and CUDA, where kernels (i.e., functions executed by the GPU) are expressed by means of operations performed by work-items.
From paragraph 3.7.2 of the SYCL Specification (rev. 8):
When a kernel is submitted for execution, an index space is defined. An instance of the kernel body executes for each point in this index space. This kernel instance is called a work-item and is identified by its point in the index space, which provides a global id for the work-item. Each work-item executes the same code but the specific execution pathway through the code and the data operated upon can vary by using the work-item global id to specialize the computation.
SYCL allows for the use of two different kernel execution models:
From paragraph 3.7.2.1 of the SYCL Specification (rev. 8):
A simple execution model in which a kernel is invoked over an N-dimensional index space defined by
range<N>
, whereN
is one, two or three. Each work-item in such a kernel executes independently. Each work-item is identified by a value of typeitem<N>
. The typeitem<N>
encapsulates a work-item identifier of typeid<N>
and arange<N>
representing the number of work-items executing the kernel.
From paragraph 3.7.2.2 of the SYCL Specification (rev. 8):
A ND-range execution model in which work-items can be organized into work-groups, providing a more coarse-grained decomposition of the index space. Each work-group is assigned a unique work-group id with the same dimensionality as the index space used for the work-items. Work-items are each assigned a local id, unique within the work-group, so that a single work-item can be uniquely identified by its global id or by a combination of its local id and work-group id. The work-items in a given work-group execute concurrently on the processing elements of a single compute unit. When work-groups are used in SYCL, the index space is called an ND-range. An ND-range is an N-dimensional index space, where N is one, two or three. In SYCL, the ND-range is represented via the
nd_range<N>
class. Annd_range<N>
is made up of a global range and a local range, each represented via values of typerange<N>
. Additionally, there can be a global offset, represented via a value of typeid<N>
; this is deprecated in SYCL 2020. The typesrange<N>
andid<N>
are each N-element arrays of integers. The iteration space defined via annd_range<N>
is an N-dimensional index space starting at the ND-range’s global offset whose size is its global range, split into work-groups of the size of its local range. Each work-item in the ND-range is identified by a value of typend_item<N>
. The typend_item<N>
encapsulates a global id, local id and work-group id, all of typeid<N>
(the iteration space offset also of type id<N>, but this is deprecated in SYCL 2020), as well as global and local ranges and synchronization operations necessary to make work-groups useful. Work-groups are assigned ids using a similar approach to that used for work-item global ids. Work-items are assigned to a work-group and given a local id with components in the range from zero to the size of the work-group in that dimension minus one. Hence, the combination of a work-group id and the local id within a work-group uniquely defines a work-item.
The work-items can access three distinct memory regions following the OpenCL memory model:
global memory: shared among every work-item of every work-group;
local memory: shared among every work-items in the same work-group;
private memory: private to each work-item.
Architecture
The SYCL Specification follows the OpenCL 1.2 one by considering a device made up of one or more compute units (CUs) working independently. NVIDIA calls the CUs streaming multiprocessors, whereas AMD refers to them simply as compute units. Each CU is composed of one or more processing elements (PEs) and local memory. A work-group executes on a single CU, whereas a work-item may be executed by one or more PEs. Generally speaking, the CUs execute small sets of work-items (defined as sub-groups) in SIMD fashion. The sub-groups are called warps by NVIDIA and wavefronts by AMD. The sub-group size is equal to 32 for NVIDIA and usually 64 for AMD (or also 32 for some architectures).
Compute
The work-groups forming a kernel are scheduled across the CUs. At this point, each CU executes one or more sub-groups on its processing elements. A compute unit includes different kinds of processing elements such as integer logic units and floating-point units for performing arithmetic operations, load and store units for performing memory operations, special units for executing transcendental operations (such as sin, cosine, reciprocal and square root) or operations on matrices useful in AI. The time (measured in clock cycles) required by a processing element to complete an operation is called latency. The latency depends on the kind of the operation, e.g., global memory transactions have latencies that are orders of magnitude larger than register calls, the same is also true for the different arithmetic operations.
The throughput is the ratio between the number of operation executed and the time needed to complete them. This ratio can be increased in two ways, by reducing the instruction latency or by increasing the number of instructions running concurrently. Traditionally, CPUs improve throughput via the minimization of instruction latencies by means of increasing the clock frequency. On the other hand, GPUs increase throughput by hiding the latency. In this regard, CUs are able to change “context” (registers, instruction counters, etc.) between the sub-groups with little effort. So, if an operation takes many clock cycles, the CU can hide them by changing “context” and running operations from another sub-group. Depending on the architecture, there is a maximum number of sub-groups that can be run concurrently. The ratio between the actual running sub-groups and the maximum number of running sub-group is defined as occupancy and will be discussed in a following section.
The work-item concurrent execution on a GPU is realized on multiple levels:
different work-items within the same sub-group are executed synchronously in a SIMD fashion, i.e., the same operation is executed on different data points;
a CU concurrently executes multiple sub-groups from the same or a different work-group in order to hide latencies, as discussed in the previous paragraph;
the CUs forming a GPU run concurrently different sub-groups belonging to different work-groups.
These parallel execution capabilities can be fully exploited if the launched kernels have a number of work-items large enough to keep the entire GPU busy.
Memory
The figure below depicts a connection scheme which is common in systems equipped with discrete GPUs: [1] connects host and device, [2] connects the CUs to global memory. For example, the indicative bandwidths for an NVIDIA GA100 GPU are: [1] 31 GB/s for PCIe x16 4.0, and [2] 1555 GB/s for HBM2.
The connection between the CPU and the GPU [1] can be a major bottleneck. So, it is important to carefully consider the data transfers between the host and the device, and try to maintain the data locality on the GPU as much as possible. However, it is possible to hide the latency introduced by the PCIe memory transactions by overlapping them with kernel executions.
One of the key characteristics of GPUs is the high bandwidth between CUs and global memory [3]. This is due to the number and the width of the memory controllers interconnecting them, e.g., NVIDIA GA100 GPUs have a total of twelve 512-bit HBM memory controllers. This allows for the transfer of a lot of data per clock cycle, e.g., 6144 bits per clock for NVIDIA GA100 GPUs. However, in order to fully exploit this high memory bandwidth, memory accesses need to coalesce, i.e., work-items must access memory in a cache friendly way.
There are different memory layers which are present between work-items and the global memory. Below, they are introduced from lower to higher access latency:
the registers store the data private to a work-item used as working memory;
the constant memory is a read-only memory available to the CU;
the local memory resides within each CU and is shared among the work-items in the same work-group; local memory is faster than global memory, for this reason it is used for caching global memory data that need to be reused;
the L1 and L2 caches which form the memory system connecting the global memory (DDR or HBM) to CUs.