Priorities
The main causes that can limit the performance of GPU codes are listed below in order of importance:
Non coalescence of global memory accesses. A memory access is coalescing if caches are fully exploited allowing to reach higher bandwidths. How to achieve coalescence depends on the architecture, but in general, it is possible to obtain it when the work-items within the same sub-group access sequential memory locations.
Bank conflicts in local memory. The local memory is divided into banks, which can be accessed simultaneously by different work-items. A bank conflict is generated if different work-items try to access the same memory bank causing the serialization of the transactions.
Divergent execution occurs when work-items belonging to the same sub-group execute different instructions due to conditional statements, e.g., if statements or loop-iteration counts that differ for different work-items. Recent architectures relax this assumption, reducing this performance penalty.
Different kinds of computations have different optimization priorities. For instance, let us consider a memory bound task where the arithmetic operations are fewer compared to the memory transactions. In this case it is particularly important to have coalescing memory accesses in order to fully exploit the GPU. On the other hand, there are compute bound tasks, in which the number of arithmetic operations are higher compared to the memory transactions. In such cases, it could be helpful to avoid thread divergence. The ratio between the number of arithmetic operations and the bytes of read/write data is defined as arithmetic intensity
I = (# floating-point operations) / (# bytes of read/write data) [FLOP/Byte]
The roofline model allows us to establish if a kernel is memory or compute bound by relating its arithmetic intensity to the hardware characteristics. The roofline model is depicted as a bi-dimensional plot, where the x-axis contains the arithmetic intensity, and the y-axis the floating-point operation throughput (in FLOPS, i.e. floating-point operations per second).
The first segment forming the actual roofline is due to y = x * B
where B
is the bandwidth of the global memory system. The second horizontal one
(i.e., y = Pmax
) depends on the maximum floating-point throughput (Pmax
)
for a given operation, e.g, FMA. The point at which the segments encounter each
other is called the ridge point.
The performance of a kernel is indicated by a point within the roofline plot: the x-coordinate registers the kernel arithmetic intensity, whereas the y-coordinate the kernel measured FLOPS. If this point lies on the left-hand-side of the ridge point, the corresponding kernel is memory bound, if it lies on the right it is compute bound.
Occupancy
A way to evaluate the performance of a kernel is considering its occupancy, which measures the percentage of the compute unit utilization defined as
occupancy = active-sub-groups / max-active-sub-groups
An active sub-group is a sub-group actually executed by a CU. The maximum number of active sub-groups depends on the compute unit architecture, e.g., 64 in the case of the NVIDIA GA100 CU architecture.
In order to increase the occupancy we need to maximize the number of active sub-groups, keeping in mind the constraints imposed by the compute unit architecture, which read as follows:
Maximum number of work-items per work-group
Maximum number of work-groups running simultaneously on a CU: If a work-group size is too small, the CU cannot run the maximum number of active sub-groups.
Limited number of registers: The complexity of a kernel code increases registers usage. Keeping codes simple allows developers to reduce register utilization. For this purpose, splitting a code across multiple kernels can be helpful.
Limited amount of local memory: If a work-group uses too much local memory, fewer work-groups can run simultaneously.
The utilization of too many registers or too much local memory by the work-groups can limit the occupancy. The user can improve the occupancy by modifying the work-group size, which should be a multiple of the sub-group size and a divisor of the maximum number of active sub-groups.
For example considering an NVIDIA GA100 GPU, each work-item can use at most 32 registers for achieving full occupancy:
r_max = (# regs per CU) / (max. # active sub-groups) * (sub-group size) = 32
The same applies for local memory, considering wg_max
the maximum number of
work-groups that can run concurrently on the same CU if each work-item uses
less than 32 registers, as
wg_max = (max. # active sub-groups) * (sub-group size) / (actual work-group size)
Full occupancy is obtained if each work-group allocates at most 48Kb /
wg_max
of local memory.
The effective occupancy is important, but it is not the ultimate metric of performance. Sometimes, a low occupancy could be enough to hide latency if there is enough instruction level parallelism, allowing for the concurrent execution of independent instructions belonging to the same sub-group. See this.
Moreover, the minimum number of work-items launched by a kernel for utilizing all CUs of a GPU needs to be at least:
wi_min = (max. # active sub-groups) * (sub-group size) * (#CUs) = 262144
All these parameters can be found in the specific vendor documentation for their architecture, but the table below lists the numbers for a few common GPU architectures:
Architecture |
Max. # active sub-groups (#) |
Max. # work-items (#) |
Max. # work-groups (#) |
Registers (#) |
Local mem. (bytes) |
---|---|---|---|---|---|
NVIDIA S.M. 7.0 |
64 |
2048 |
32 |
65536 |
65536 |
NVIDIA S.M. 7.5 |
32 |
1024 |
16 |
65536 |
65536 |
NVIDIA S.M. 8.0 |
64 |
2048 |
32 |
65536 |
65536 |
AMD GFX9xx |
40 [1] |
1024 |
16 |
29184 (?) |
65536 |