Performance analysis and optimization is an iterative process. The analyst measures the performance of an application, uses tools to identify performance bottlenecks, improves those bottlenecks, and repeats the process. Every iteration is likely to uncover previously hidden bottlenecks.
At some point it becomes important to identify the best possible performance of the portion of the application that is the limiting factor. This is sometimes called the speed of light or the roofline. This helps to give an estimate of the theoretical peak performance of an application and how close we are to obtaining that performance.
In the remaining sections we discuss analysis tools and limiting factors in more detail.
Analysis methods
Tools used to analyze performance are often called profilers. The term profiling is often overloaded. Here we use it in the generic sense of tools used for performance analysis. In later descriptions of specific performance tools it may be used in a more specific sense.
Methods used for performance analysis fall into two broad categories: tracing and sampling. Tracing records every occurrence of one or more events while an application is running. Sampling periodically inspects the state of a running application and records that state. For events that occur very frequently, tracing can create huge amounts of data. The data volume of sampling can be controlled by selecting the sampling interval; longer intervals reduce the data volume but may miss fine-grained behavior. There are refinements of both methods, often either tracing or sampling can be combined with on-the-fly data reduction.
With any analysis tool, there are two aspects to consider:
Overhead: How much a tool increases normal program execution time. For performance tools it is usually best for the increase to be small. However, if the increase is well understood then it may be possible to compensate for this when interpreting the data. For example, a tool may give accurate results for GPU execution but increase the time of the CPU portion of the code.
Data volume: How large are the output files. Large volume is often associated with increased overhead. Other problems with large volume include difficulty in managing very large output datasets and responsiveness of post-processing tools, especially when it is necessary to move output datasets to a remote machine for viewing.
System Level Analysis
In system-level analysis we look at the interaction between processes on the same node or on different nodes, and at the interaction between CPU and GPU.
Analyzing the interaction between CPU and GPU for complex workloads may be particularly challenging. Vendors often provide a tracing tool to assist in such analysis. It records timestamps and durations for GPU-related API calls, including memory allocation, memory transfers, kernel launches, and synchronization. Such tools often include timeline viewers to assist in visually identifying bottlenecks such as serialization or undue idle time.
Sometimes it is helpful to use OS kernel tracing (e.g, linux ftrace) and correlate that with application execution. This normally requires some kind of root privilege. Further, the particular kernel activity that correlates with a performance issue may not be understood. In this case it can be useful to use a circular buffer to record all the OS activity and to dump that buffer under application control when the performance issue is discovered (for example, if a timestep takes significantly longer than the average or expected time). The circular buffer technique can be useful in any case where recording the entire trace data stream would be prohibitively expensive.
Scaling for distributed applications (typically using Message Passing Interface) deserves special mention. There are two commonly used definitions of scaling. Strong scaling keeps the problem size constant and measures elapsed time as the number of MPI ranks increases. Weak scaling increases the problem size proportionally to the number of MPI ranks.
Strong scaling is the more difficult problem. Often there is just not enough work to keep all the MPI ranks busy. It is useful to run a sweep over different numbers of ranks while using an MPI profiling tool, then compare the MPI profiles.
Other MPI issues will often arise, especially at very large scale.
Reduction operations scale inversely as log(N); further, small
reductions (like MPI_Allreduce
to a scalar value) may be impacted
by OS noise. Networks may become congested, so even point-to-point
operations can be impacted, especially on large shared clusters.
In the strong scaling case, message sizes usually get smaller for
larger numbers of ranks, thus making MPI latency more important.
The analyst should expect application behavior to be very different at scale versus when run on only a few nodes. As always, use of MPI profiling tools can help understand this behavior; low overhead tools are particularly important at large scale.
Kernel Level Analysis
In kernel level analysis we focus on the time spent in GPU kernel execution and on the performance of individual GPU kernels.
Tools like those mentioned in the previous section usually provide a per-kernel summary of kernel execution, including launch parameters, number of launches, and time consumed by the kernel. We can often estimate the total time breakdown of an application as the elapsed time on the CPU plus the elapsed time on the GPU, and the elapsed time on the GPU is approximated by the sum of the kernel execution times. This gives us an idea how much overall improvement can be gained by improving GPU kernel execution time. This isn’t always completely accurate in the case of overlapped execution or extensive data transfer time, but is still a good rule of thumb.
Further analysis of a kernel’s performance requires:
Inspection of the kernel’s source code
Inspection of the assembly language the compiler generates for the kernel
Collection of hardware performance metrics during the kernel’s execution
Methods for extracting assembly language vary between different compilers and GPUs and are described in detail later in the document.
In the remainder of this section we describe the sorts of metrics that are available for GPUs (and often for CPUs) and general methods to interpret them and to use them in the performance improvement process. Specific detail for different GPUs is given later in the document.
Important GPU Metrics
Rate Metrics
The reason that applications use GPUs is often to increase available computational resources. Computational throughput is typically measured as a rate expressed as operations per unit time; for example, double precision floating point operations per second, or 32-bit integer operations per second. A particular GPU will have documented peak values for these rates.
Peak performance for a given application is often constrained by limits on non-computational resources, particularly access to different memory regions such as main memory or scratchpad memory. Here too there are peak values; for example, main memory bandwidth expressed as bytes per unit time.
Models like the classic roofline model attempt to quantify the achievable computational performance in the presence of other resources constraints, most typically main memory bandwidth. If an application is hitting peak performance for some non-computational metric, then it will not be able to achieve peak computational performance. This allows the analyst to have an idea as to the peak achievable performance for a given application.
Utilization Metrics
It can be useful to know how busy a given resource or functional unit is. This utilization metric is different from rate metric. A resource may be highly utilized but not achieving a high percentage of its peak performance. One example would be a kernel with a very sparse memory access pattern; utilization of the memory unit might be quite high even though memory bandwidth is not close to peak. Utilization metrics can help to understand bottlenecks that may not be apparent from roofline-type models.
Utilization metrics are commonly available for memory units and compute units. They may also exist for different micro-architectural blocks such as caches and local memory.
Divergence
As previously discussed, a GPU consists of many compute units (CUs) that execute multiple work items simultaneously in SIMD (Single Instruction, Multiple Data) fashion.
The programmer writes code that specifies the actions to be performed on a single work-item. The compiler transforms this code into instructions that process multiple work items simultaneously. Each GPU has a native minimal number of work items that execute simultaneously, called the sub-group size.
Divergence occurs when different work items follow different paths. Since many work items are executed in a given instruction, the compiler must generate instructions that account for all the possible path combinations; then work items that are not active on a given instruction will be disabled. The net effect of this is low utilization since only some of the SIMD lanes are being used at any given time.
GPUs provide metrics to measure divergence, usually in the form of the number of work-items active per sub-group; this can be compared to the native sub-group size.
Occupancy
GPU occupancy has been previously discussed; briefly it is the ratio of the actual number of active subgroups and the theoretical maximum number of active subgroups for a given kernel. Occupancy is important because it gives the analyst an idea of how close the kernel is to exploiting the maximum available parallelism.
Some GPUs have hardware to measure actual occupancy. The theoretical occupancy can be computed using properties of the compiled kernel and of the hardware.
Launch parameters
A kernel is launched with a global range and a local range. The latter is the workgroup size. The workgroup size should be a multiple of the sub-group size. Note that this may require rounding up the global problem size and adding code to the kernel to avoid processing work-items outside of the global problem size.
There may be other constraints on the workgroup size for specific GPU hardware to improve occupancy. It may also be advantageous to choose a global problem size that has some relationship to hardware aspects of a given GPU, such as the number of CUs. The point is that it is not necessary to base the global and local problem sizes on the natural size of the problem; they can be chosen to better match the hardware.
All GPUs provide a mechanism to see the actual launch parameters for every kernel launch; these include global and local problem sizes and usually kernel properties such as number of registers and local memory size.