The purpose of this section is to describe AMD GPU architecture and related performance considerations within the context of the SYCL programming model. Readers are encouraged to consult appropriate AMD documentation for up to date details of AMD architecture specific performance considerations.
AMD GPU Architecture
AMD has three architecture types that are supported by ROCm. These are GCN, RDNA and CDNA. RDNA is designed for graphics and gaming. CDNA is designed for compute performance, typically in a data center environment. The older GCN architecture was used in both graphics and compute specific cards.
The CDNA whitepaper has an overview of the CDNA architecture. Much more detail is contained in the CDNA1 ISA and CDNA2 ISA documents.
The basic compute unit of an AMD CDNA GPU is called a compute unit, or CU. For GCN and CDNA architectures, a CU executes sub-groups composed of 64 work-items. RDNA is optimized for sub-group size 32, but also supports sub-group size 64; however only the sub-group size of 32 is currently supported for RDNA GPUs in the oneAPI release. AMD calls the sub-group a wavefront or a wave.
AMD also uses the OpenCL/SYCL term workgroup. As always, a workgroup is a collection of sub-groups (or wavefronts) that are guaranteed to run concurrently on the same CU, can use local resources like SYCL local memory, and can synchronize between work-items.
The work-group size should always be a multiple of the sub-group size. The optimal work-group size is typically chosen to maximize occupancy, and depends on the resources used by the particular kernel. It cannot exceed 1024 work items.
Memory and Caches
Several kinds of memory are available. All global (device) read memory accesses go through the per-CU L1 cache and the device-wide shared L2 cache. Store operations go through a write-combining cache, then through the L2 cache which also can perform atomic operations.
Each CU has dedicated local memory (sometimes called LDS for Local Data Share) which maps directly to SYCL local memory. The amount of local memory that a CU has is architecture dependent. Local memory can be used by a work-group. It has higher bandwidth and lower latency than global memory. Local memory has 32 banks. Simultaneous accesses to different banks result in the highest performance; otherwise bank conflicts occur and reduce performance. There are hardware metrics to measure bank conflicts: see the AMD tools page for a summary of the rocProf application that can measure bank conflicts on AMD GPUs.
Different work-items will usually access different locations in global memory. If the addresses accessed by a given sub-group in the the same load instruction are to the same set of of cache lines, then the memory system will issue the minimal number of global accesses. This is called memory coalescing. It is assisted by the write-combining cache. This requirement can easily be satisfied by assigning memory-adjacent work items to adjacent threads. Additional performance improvement can be achieved by aligning large data structures on 64-byte boundaries. Obviously, indirect accesses and/or large strides may make this hard to achieve.
Occupancy
GPUs use hardware queues to keep a number of sub-groups available so that executing sub-groups that stall can be swapped for a different sub-group, thus keeping the hardware busy. The sub-groups in the hardware queue(s) are called the active sub-groups. The maximum theoretical number of active sub-groups is limited by the the workgroup size, the register and local memory usage of the kernel, and the hardware queue size. Occupancy is defined as the ratio of the actual number of active sub-groups to the theoretical maximum number. Consult appropriate AMD documentation for vendor specific details.
It is important to remember, though, that high occupancy does not always imply high performance; many other factors may come into play.