It is common for accelerators to have dedicated (e.g. on-chip/on-package) memory that is usually much faster to access than DDR (system RAM) as well as much smaller in size. This kind of memory is then used as a ‘scratchpad’ where chunks or tiles of input data are streamed to this memory from DDR, potentially modified by accelerator cores and streamed back to DDR. This differs from approaches where accelerator cores read from and write to DDR directly, which would likely be much slower due to DDR latency.
While dedicated accelerator memory comes in many configurations and serves different purposes, in this section we focus on two different use cases: dedicated global memory and dedicated shared memory.
Dedicated Global Memory
With the first kind, kernels can use global buffers in just the same way as if they were stored in DDR, with no or minimal changes to the kernel source. Performance is improved not only due to smaller latencies but also by avoiding having to transfer data between the CPU and the accelerator or perform cache management operations between kernel invocations. Managing this kind of memory is done on the host side, by passing a special flag when creating buffers at the SYCL level or using a hardware-specific entry point at the OpenCL level.
For this use case, the memory needs to have certain characteristics. It needs to be accessible to all execution units in the same way (i.e. cores do not have their own ‘private’ chunk of this memory) and persist across kernel invocations. The memory does not need to be mapped in the CPU’s address space but there needs to be a mechanism such as DMA to transfer data from and to global (DDR) memory. It would be desirable for memory transfers to happen concurrently with kernel execution to allow double-buffering. The accelerator ISA would also provide load and store instructions that can handle both DDR and scratchpad memory addresses.
Implementation Notes
With both approaches it might not be possible to expose scratchpad memory using existing concepts in the OpenCL/SYCL memory model, and it may be necessary to tag in-kernel pointers in such a way that they are separate from global and local pointers. This might be due to a standard address space being backed by different kinds of memory (e.g. DDR or scratchpad in the dedicated global memory use case) and wanting to expose hardware-specific features that can only be used with scratchpad memory. In this case a new (custom) address space needs to be implemented in the compiler front-end (e.g. by adding a new language keyword to extend the OpenCL C language) as well as other parts of the compiler. Special handling of kernel arguments with this new address space may be needed in the runtime (ComputeMux target). See Address Spaces for more details on how address spaces are handled in the compiler.
While it might be easier for the user if dedicated memories are exposed using standard compute concepts that fit the OpenCL/SYCL model, hardware idiosyncrasies and features sometimes require a proprietary extension to the standard in order to fully take advantage of these hardware-specific features. With the oneAPI Construction Kit this involves creating a new OpenCL extension, which could introduce new entry point functions to the OpenCL host as well as new builtin functions that can be called by OpenCL C kernels. Exposing hardware-specific functionality through builtin functions is described more in detail in the Mapping Algorithms To Vector Hardware section.