Memory Model
GPUs employ a similar memory hierarchy to the one used in CPU devices, including multiple levels of memory with different latency, bandwidths, and capacity. The lower the memory latency is, the closer the memory position is to the processor, and the lower its capacity is. In the following sections we explain how these levels of memory are exposed in both CUDA and SYCL.
CUDA Memory Model
The CUDA memory model is explicit. This will expose the full control of memory hierarchy to the programmer for allocation, de-allocation and data movement between host and device. CUDA memory allocations are performed using cudaMalloc which is equivalent to a malloc function, and returns a pointer that is not dereferencable on the host. The following image represents the overall architecture of the CUDA memory model.
Register
The fastest memory type on NVIDIA GPUs is the register file. Registers are used for allocating variables which are private to each threads. Registers are divided among warps in an SM. However, there is a very limited number of registers available per SM. Therefore, using fewer registers in the kernel allows more thread blocks to be executed concurrently on an SM and therefore the GPU occupancy and performance will improve. Kernel variables without a type qualifier will be allocated in registers. Also, if the array indices are constant and can be deduced at compile time, they will be allocated in a register as well.
Shared memory
Variables annotated with __shared__
are stored in shared memory.
Shared memory is on chip, and it has lower latency than global memory.
Also, its memory bandwidth is higher
than that of global memory. As the name suggests, the shared memory is shared
among the threads in a thread-block. Each SM has a limited amount of shared
memory. Using too much shared memory per thread block will decrease the number
of active warps and can consequently cause a drop in performance.
Global memory
The global memory is the device memory, visible by all SMs in the GPU
architecture. Global memory can be allocated in the host code through
cudaMalloc
and freed from host code by cudaFree
. The created memory will be
passed as a pointer to the CUDA device kernel function as a parameter. Also a
variable can be declared as global inside the kernel by adding __device__
qualifier in front of it.
Constant memory
The constant memory is a read-only device memory that stores variables
annotated with __constant__
attribute. Constant variables are declared
statically. Their scope is global and they must be defined outside of the
kernel as they will be visible to all kernels. For each SM, the constant
variables will be allocated in a separate cache area dedicated for the constant
variables. To get the best performance from the constant memory, all threads in
a warp should read the same memory address.
Texture memory
The texture memory is a type of read-only device memory. For each SM, there is a dedicated read only cache called a texture cache. The texture cache is optimized for two-dimensional array access, and is optimized to be accessed by warps accessing two-dimensional data.
Local memory
The local memory is a virtual concept where the actual location for it is in the global memory. Local memory is used for allocating register spills (including large arrays); local structures; variables that cannot fit into registers, or arrays whose indices cannot be deduced at compile-time. The accesses to local memory are always cached in per-SM L1 caches and per-device L2 caches for compute capability 3.x. For devices with compute capability of 5.x and 6.x, the data accesses are always cached in per-device L2 caches.
CUDA memory Management
The CUDA programming model uses pointer annotations to indicate where the different
objects are allocated in memory. When allocating global memory, C-style
malloc
functions can be used to create device-pointers.
CUDA memory allocation
CUDA memory allocation is triggered by calling :
cudaError_t cudaMalloc(void** dev_ptr, size_t num_bytes)
This function allocates a device pointer of size num_bytes
bytes of global
memory on the device and returns the address of the allocated memory in dev_ptr
.
The allocated pointer is aligned for any variable type.
Since the memory model is explicit, it is the user's responsibility to transfer the data between host and device using the right memory transfer operations, described in the next section.
A device pointer can be initialized by:
cudaError_t cudaMemset(void * dev_ptr, int val, size_t num_bytes);
This function will set the following num_bytes
of the memory position starting in
dev_ptr
with the value of val
.
It is also the user's responsibility to free the memory. Releasing the memory can be triggered by calling:
cudaError_t cudaFree(void * dev_ptr)
The function will return an error if the memory is not created or has been already freed.
CUDA memory transfer
The following function is used to transfer data from source
to destination
in
CUDA:
cudaError_t cudaMemcpy(void *destination, const void * source, size_t
num_bytes, enum cudaMemcpykind kind);
cudaMemcpyKind
can be one of the following types:
-
cudaMemcpyHostToHost
: determines that the source and destination are host pointers. -
cudaMemcpyHostToDevice
: determines that the source is a host pointer and the destination is a device pointer. -
cudaMemcpyDeviceToHost
: determines that the source is a device pointer and the destination is a host pointer. -
cudaMemcpyDeviceToDevice
: determines that the source and destination are device pointers.
These API entries always block the user thread.
If the source and destination does not match the directionality
determined by kind
, the result is undefined.
The asynchronous version of data transfer can be called using the following function:
cudaError_t cudaMemcpyAsync ( void* dst, const void* src, size_t count, cudaMemcpyKind kind, cudaStream_t stream = 0 )
The above function is non-blocking with respect to the host, so calling the
function may return before completing the copy. The default behaviour of
the above function is independent from the stream, however, it is possible to
associate the copy to a stream by passing a non-zero stream argument. When the
stream is specified using the cudaMemcpyHostToDevice
or
cudaMemcpyDeviceToHost
may result in overlapping copy with other operations
in other streams.
CUDA pinned memory
By default, CUDA allocated memory is pageable virtual memory. However, CUDA allows programmers to pin the host memory. When the host memory is not pinned, the GPU device has no control over the host memory in the event of a page fault on the host memory.
Therefore, whenever the data transfer function is called, the CUDA driver allocates a temporary pinned memory and copies the data to the pinned memory and then transfers the data from the pinned memory to the device. Once the memory has been pinned on the host side, it is not pageable from the host side anymore.
If the user provides pinned host memory, CUDA will not allocate temporary pageable memory, eliminating the extra intermediate copy operation. To allocate pinned memory in CUDA, the following function can be used:
cudaError_t cudaMallocHost(void** dev_ptr, size_t num_bytes )
num_bytes
represents the number of allocated bytes and dev_ptr
is the
actual non-pageable host pointer which is accessible from the device. Since
this memory is directly accessible from the device, the read/write bandwidth is
higher than that of pageable memory.
CUDA zero-copy memory
CUDA zero-copy memory is pinned host memory mapped into the device address space. CUDA zero-copy will avoid the explicit data movement between host and device. Although zero-copy improves the PCIe transfer rates, it is required to be synchronized whenever data is shared between host and device. Otherwise, the behavior is undefined. Zero-copy memory can also be useful when there is not enough memory available on device. Zero-copy memory can be allocated by calling the following function:
cudaError_t cudaHostAlloc(void**ptr_host, size_t num_bytes, unsigned int flags)
The possible flags are :
* cudaHostAllocDefault
: this is the equivalent of CUDA pinned memory.
-
cudaHostAllocPortable
: allocates pinned memory that can be accessed by all CUDA contexts. -
cudaHostAllocWriteCombined
: allocates a pinned memory useful to create a buffer to be written by the host and read by the device. -
cudaHostAllocMapped
: allocates host memory mapped into the device address space.
Calling the following function provides the device pointer for mapped, pinned memory:
cudaError_t cudaHostGetDevicePointer(void **ptr_device, void *ptr_host,
unsigned int flags);
If an NVIDIA GPU has the ability to access host registered memory at the same
virtual address as the CPU, the ptr_device
will match the ptr_host
pointer, meaning that the memory can be accessed on the device by using the
same host pointer ptr_host
. Otherwise, the device pointer may or may not
match the original host pointer.
Although zero-copy memory avoids explicit memory transfers between host and device by allocating pinned host memory, the kernel performance generally suffers from high-latency accesses to zero-copy memory over the PCIe bus.
CUDA Memory Access Patterns
Global Memory is a logical memory space residing in DRAM. Kernel code accesses
the data between DRAM and SM on-chip memory using L1 and L2 cache
as a shortcut whenever it is possible. The data transfer is either a 128-byte or
32-byte memory transaction. If only L2 cache is used, the access to data will be
32-byte memory transaction. However, if both L1 and L2 are used the access to
data would be a 128-byte memory transaction. The L1 cache can be enabled or
disabled manually, depending on the type of GPU architecture. The memory load
is considered as cached load if the L1 cache is enabled. The L1 cache can be explicitly enabled or disabled by passing the -Xptxas -dlcm=FLAG
, where the
FLAG
can either be ca
to enable the L1 cache or cg
to disable it.
However, on some NVIDIA GPU architectures such Kepler K10, K20, and K20x, the
L1 cache is disabled by default, in order to be used for caching register spills
to local memory. Knowing the data size in L1
and L2 cache in CUDA, in the following sub sections we explain the concept of
aligned memory access and coalesced memory access.
Aligned memory access
Memory accesses are called aligned when the address referenced in device memory is a multiple of 32 bytes for L2 cache or 128 bytes for L1 cache. If the memory load is misaligned, the memory bandwidth usage is reduced.
Coalesced memory access
Memory accesses are called coalesced if all 32 threads in a warp access a contiguous chunk of memory. The following image shows coalesced memory access by all threads of a warp.
Cache line
The L1 cache line is 128 bytes mapped to a 128 byte aligned segment in device memory. Therefore, an aligned coalesced access of 4 bytes per thread in a warp will perfectly map to the L1 cache line and will be loaded to register/shared memory by only one single 128-byte memory transaction.
SYCL Memory Model
The SYCL memory model is based on the OpenCL memory model, but operates at a higher level of abstraction. The key difference of this abstraction is the separation of storage from access to memory which is expressed through the relationship between memory objects (buffers or images) and accessors.
From the device point of view, the SYCL 1.2.1 memory model is fundamentally the same as the OpenCL 1.2 memory model.
From a user's point of view, SYCL creates one memory object in the form of a buffer or image. This memory object can be accessed on the host via a host accessor. The same memory object can be accessed on the device side via a device accessor.
It is important to understand that the SYCL buffer/image is a host API managing the data allocation and data transfer across variable devices. By creating an accessor, we create a requirement on the SYCL buffer/image. This requirement can be allocating memory, synchronization between two accessors, and data transfer between host and device. An accessor is an object telling the SYCL buffer/image where the data needs to be. If the host accessor is requested from a buffer/image, it implicitly implies that the data needs to be on the host. So the SYCL buffer/image allocates the data on the host. If a device accessor is requested from a SYCL buffer/image, it implicitly implies that the data needs to be on the device. Therefore, the SYCL buffer/image allocates the memory on the device. The SYCL buffer/image also handles the data transfer and synchronization, based on the accessor mode. For further information about data transfer using access modes, please refer to the SYCL memory transfer section.
Private Memory
In SYCL, private memory is a region of memory allocated per work item, visible only to that work item. Private memory is used for kernel parameters and local stack variables. Private memory cannot be accessed from the host.
Private memory is typically mapped to hardware registers. There is no mechanism in SYCL to query the number of registers available to a particular device. Developers must refer to the documentation of the SYCL implementation and/or the hardware vendor to understand the limits of private memory.
Local Memory
Local memory is a contiguous region of memory allocated per work group, and visible to all work items within that work group. Local memory is allocated and accessed using an accessor on submission. This memory cannot be accessed from the host. From the device, local memory is visible accessed as an array of contiguous elements of a particular type. Multiple arrays of local memory can be allocated on the device. The maximum available local memory depends on the hardware and the kernel configuration.
Global Memory
In SYCL (and OpenCL), global memory represents allocations from a pool of memory visible by all threads of the device. In addition, different devices in the same SYCL context can see this memory. Note that not all devices will be available in all contexts - this is an implementation specific detail that depends on the hardware architecture.
There are two ways of allocating global memory in SYCL: using Buffers or Images.
Buffer Memory
In SYCL, a buffer represents a contiguous region of memory which is managed by the SYCL runtime, and can be accessed on the host and on device. When accessed on host, the runtime may copy it via temporary host storage. When accessed from the device, memory is stored in global address space (globally visible to all work items) or the constant address space (globally visible to all work items, but read-only) if the runtime works-out that the memory is read-only.
Accessing memory on the host requires construction of a host accessor, whereas on the device it is accessed via command group accessors.
Once executing inside a device kernel or in a host accessor, the buffer is represented as an array of contiguous elements of a particular type.
In the following code snippet, we access the buffer buff_a
on the host side
by creating a host accessor host_acc
, and access the same buffer buff_a
on
the device side by using the device accessor dev_acc
on the device.
// creating a SYCL buffer which a data storage.
cl::sycl::buffer<float, 1> buff_a(cl::sycl::range<1>{100});
// geting a host access on the buff_a with write permission
auto host_acc = buff_a.get_access<cl::sycl::access::mode::write>();
// using host accessor to fill the data in the buffer
for (int i=0; i<100; i++) {
host_acc[i] = 1.0f;
}
cl::sycl::queue q;
q.submit([&](cl::sycl::handler& cgh){
// creating a device accessor on the buffer_a with write permission
auto dev_acc = buff_a.get_access<cl::sycl::access::write>(cgh);
cgh.parallel_for<class hello_world>(cl::sycl::range<1>{100}, [=](cl::sycl::id<1> id) {
// using device accessor inside the kernel to fill the data
device_acc[id] +=1.0f;
});
});
Image Memory
In SYCL, image objects represent a region of memory managed by the SYCL runtime. The data layout of the image memory is deliberately unspecified so as to allow implementations to provide a layout optimal to a given device. When accessed on host, image memory may be stored on temporary host memory. When accessed on device, image data is stored in device image memory, which can often be texture memory if the device supports it. Image memory can also be accessed as an array of images representation.
Accessing image memory on the host is done by constructing an accessor with the
host_image
access target. Accessing image memory on the device is done by
constructing an accessor with the image
access target. Accessing image memory
on the device in an image array representation is done by constructing an
accessor with the image_array
access target.
Image memory is accessed via member functions which perform read; sampled read, and write functions on the memory on the device.
Constant Memory
Constant memory is allocated using buffers that are read only or using constant SYCL buffer objects. See the Buffer Memory and Image Memory sections for more details on how to create a buffer or image memory.
SYCL memory Management
In SYCL, memory is managed through buffer and image objects which maintain a memory region across the host and one or more devices. Memory is accessed via accessors: objects which represent a request to access the memory in a particular region with particular properties. Accessors are provided to a command group as a way of describing data dependencies for a kernel function. These data dependencies are used by the SYCL runtime to both satisfy data requirements for kernel functions by making the memory available as requested and to perform data movement optimizations.
Accessors are parameterized by an access target which specifies where the
memory is to be accessed and an access mode which specify how the memory is
to be accessed. The access target of an accessor can be host_buffer
,
global_buffer
, constant_buffer
, host_image
, image
, image_array
or
local
. The access mode of an accessor can be read
, write
, read_write
,
discard_write
, discard_read_write
and atomic
.
SYCL memory allocation
Memory allocation on the device is typically asynchronous in SYCL, and varies according to the implementation where this happens.
SYCL accessors are templated by the cl::sycl::access::target
enum class which specify the memory space which memory is to be allocated in. When an accessor is passed to a SYCL kernel, it becomes a requirement for the kernel and is then allocated on the associated
memory space for the specified access target.
Access Target | Memory Space |
---|---|
cl::sycl::access::target::host_buffer , cl::sycl::access::target::host_image |
Host memory |
cl::sycl::access::target::global_buffer |
Global memory |
cl::sycl::access::target::constant_buffer |
Constant memory |
cl::sycl::access::target::local |
Local memory |
cl::sycl::access::target::image , cl::sycl::access::target::image_array |
Image memory |
SYCL memory transfer
Memory transfers in SYCL are asynchronous, and performed implicitly by
the SYCL runtime when an accessor requires access to memory not available already on the device.
SYCL accessors are templated by the cl::sycl::access::mode
enum class which specifies the mode by which the
memory is to be accessed. When an accessor is passed to a SYCL kernel, it
becomes a requirement for the kernel, and the access mode effects the data
dependencies between SYCL kernels.
The implicit transfer, and the rule
determining whether or not the data should be transferred is defined by
the type of cl::sycl::access::mode
.
The table below describes the different access modes available for SYCL accessors, and whether they can trigger a copy into the device memory from the host or not.
SYCL data access | Should the data be transferred to/from host/device | Is the accessed data modifyable on host/device |
---|---|---|
cl::sycl::access::mode::read |
Yes | No |
cl::sycl::access::mode::write |
Yes | Yes |
cl::sycl::access::mode::read_write |
Yes | Yes |
cl::sycl::access::mode::discard_write |
No | Yes |
cl::sycl::access::mode::discard_read_write |
No | Yes |
cl::sycl::access::mode::atomic |
Yes | Yes |
SYCL pinned memory
There is no explicit mechanism to request pinned memory in SYCL. The SYCL runtime will always aim to allocate memory in the most optimal way. However, the specific requirements for this varies from one architecture to another and from one OS to another, and there is no guarantee that SYCL devices support pinned memory, so vendor documentation should be consulted for architecture specific requirements.
There are generally two ways which host memory can be allocated:
* When not using the cl::sycl::property::buffer::use_host_pointer
property,
the SYCL runtime will allocate host memory when required. This uses the implementation-specific mechanism, which can attempt to request pinned memory.
- If the
cl::sycl::property::buffer::use_host_pointer
property is used, then the SYCL runtime will not allocate host memory and will use the pointer provided when thebuffer
is constructed. In this case, it is the users responsibility to ensure any requirements for memory allocation to allow pinned memory are satisfied.
Users can manually allocate pinned memory on the host, and hand it over to the SYCL implementation. This will often involve allocating host memory with a suitable
alignment and multiple, and sometimes can be managed manually using OS specific
operations such as mmap
and munmap
.
SYCL zero-copy memory
In SYCL there is no explicit mechanism to request zero-copy. If a SYCL buffer is allocated in pinned memory - according to the requirements above - then the SYCL runtime will attempt to initialize with zero-copy if possible.
SYCL Memory Access Patterns
A SYCL nd_range
class describes the iteration space over which the SYCL kernel
is to be executed, which can be 1, 2 or 3 dimensional.
SYCL buffer
and image
memory objects, and the subsequent accessor
objects
used to access the memory managed by them can also be 1, 2 or 3 dimensional. The
memory, however, is still allocated as a single contiguous array of elements,
regardless of the dimensionality. Therefore, when accessing an element of a
multi-dimensional accessor
the position within that multi-dimensional space
must be translated to a linear position in memory.
For any given multi-dimensional range R {r0, r1, 2}
and a point within that
range I {i0, i1, i2}
:
-
When a
buffer
orlocal
accessor
is accessed via a multi-dimensionalid
or multiple subscript operators the linear position in memory is translated in row-major order, i.e.L = 12 + (i1 * r2) + (i0 * r2 * r1)
. -
When an
image
accessor
is accessed via a multi-dimensionalid
the linear position in memory is translated in column-major order, i.e.L = 10 + (i1 * r0) + (i2 * r0 * r1)
.
THe following diagram demonstrates accessing the id
{2, 1}
within the
range
{3, 4}
using both a buffer
accessor
(buffer_acc
) linearizing in
row-major order and an image
accessor
(image_cc
) linearizing in
column-major order .
Alternatively, the position w.r.t the origin for an id
index can be calculated manually from the id
and range
objects.
Aligned memory access
Memory access is said to be aligned when the first address of a memory allocation on the device is an even multiple of the global cache line size. In SYCL the global memory cache line size can be queried from a device, as shown below.
Aligned access typically increases performance.
auto cacheLineSize = device.get_info<
cl::sycl::info::device::global_mem_cache_line_size>();
Coalesced memory access
Memory access is called coalesced in SYCL if contiguously executed work items
access a contiguous chunk of memory. This depends on the mapping between
the execution order of work items and the linearization of the id
being used
to access memory. There is no guarantee as to the execution order of
work items: this may vary from one architecture to another.
Refer to the vendor documentation to find the right mapping between the hardware access pattern and the memory linearization.
Similarities and differences
Memory model comparison for CUDA and SYCL
The following table shows the equivalence between the different components of the memory architecture in SYCL and CUDA.
CUDA name | SYCL name |
---|---|
Per-thread local memory | Private memory |
Per-block shared memory | Local memory |
Global memory | Global memory |
Constant memory | Constant memory |
Texture memory | Image memory |
Local-memory | N/A |
Note that developers cannot make assumptions as to where the different memory is actually placed in the hardware when using SYCL. Implementation documentation shows where in the actual hardware the different memories are mapped. However, which architectural element can access the different memories remains the same across all architectures, e.g private-memory is only visible by the work-item it belongs, whereas global memory is visible by all work-items in the ND-range.
Using C-style pointers on CUDA and SYCL
It is important to note that CUDA memory allocations are handled using C-style pointers that are not accessible directly on the host, and forces users to handle the host and the device copy of the information independently - except when using either CUDA unified memory or CUDA virtual addressing, see the sections below.
The following example illustrates a trivial CUDA memory allocation.
int *dev_a;
// allocate the memory on the GPU
HANDLE_ERROR( cudaMalloc( (void**)&dev_a, N * sizeof(int) ) );
The calls to cudaMalloc will allocate dev_a
, dev_b
and dev_c
pointers,
which will be accessible only from the device. Pointer arithmetic is possible,
so dev_a + 3
will point to the 3rd integer in the array on the device. Also,
the pointer address is guaranteed to be the same across multiple kernel
executions, since allocations are performed directly by the hardware low-level
driver.
This is different from the approach to memory allocation exposed by SYCL. In SYCL, memory allocation is handled via C++ objects that encapsulate the implementation-specific behavior. The developer must use accessors to specify when using the data on the host or the device. This enables the SYCL runtime to schedule optimal memory synchronization operations (e.g, copy or DMA operations).
cl::sycl::buffer<int, 1> bufA(range<1>(N));
The buffer above can be accessed on the host and the device using a host or a device accessor. The buffer objects are not pointers, hence pointer arithmetic is not possible. The pointer that can be obtained when using the accessor is only valid for the duration of the access, as there is no guarantee the underlying memory is the same between multiple accesses.
This mechanism is typically more suited to C++ code-bases, as it is type-safe and enables further runtime optimizations. However, in order to port existing CUDA code, it may not be possible to replace the existing code completely to use this mechanism. For this reason, we provide a number of utilities in the computecpp-sdk that help to expose a CUDA-like pointer to the legacy application whilst keeping SYCL objects underneath. We focus on the Software-managed virtual pointer since it is the one that has been used to port Eigen and Tensorflow to SYCL. When using the vptr utility, users can create a PointerMapper structure that holds the mapping of virtual addresses to SYCL buffers. This enables using C-style malloc and free functions that will return non-dereferenciable pointers that can be used in pointer arithmetic. The example below, extracted from the computecpp-sdk, illustrates its usage:
The following code shows how to mimic the CUDA-like pointer in SYCL, by creating a C-style pointer and applying arithmetic operations on the pointer without dereferencing it on the host side.
// Create the Pointer Mapper structure
PointerMapper pMap;
// Create a SYCL buffer of 10 floats
// This pointer is a number that identifies the buffer
// in the pointer mapper structure
float * a = static_cast<float *>(SYCLmalloc(10 * sizeof(float), pMap));
// Create a SYCL buffer of 25 integers
int * b = static_cast<int *>(SYCLmalloc(25 * sizeof(int), pMap));
// Create a pointer to the 5th element
// This simply adds 5 * sizeof(float *) to the base address.
float * c = a + 5;
// Retrieve the buffer
assert(pMap.get_buffer(a) == pMap.get_buffer(c))
// Substracting the value of the offset from the base address of the
// buffer recovers the offset into it
assert(pMap.get_offset(c) == 5 * sizeof(float))
// Invalid usage: no-dereference on the host
// float myVal = *c;
// Valid access on host: Use host-accessor
{
auto syclAcc = pMap.get_buffer(a).get_access<access::mode::read>();
float myVal = syclAcc[0];
}
// Free the pointers
SYCLfree(a, pMap);
SYCLfree(b, pMap);
CUDA Unified virtual addressing
When a CUDA application is built on a 64-bit system, on hardware supporting the compute capability 2.0 and higher, the host and multiple devices share a common virtual address space. This is possible due to the interaction of the CUDA application with the NVIDIA system driver. All host memory allocations made via CUDA API calls and all device memory allocations on supported devices are within this virtual address range. This enables detecting the device that contains a given pointer by using specific query functions, and using specific host allocation functions. Once the device is found, it is possible to enable automatic porting of data across host and device. This feature requires a specific system-level driver provided by the vendor, and it is not available on SYCL 1.2.1 (or OpenCL 1.2). Note, however, that SYCL objects do automatically handle the transition between the host and the device, and no explicit copy operation is required (as opposed to OpenCL).
In more recent versions of SYCL and OpenCL, new API features partially enable this feature. However, hardware support is not widely available.
CUDA unified memory
CUDA unified memory enables using the same pointer across host and device, and provides memory management services to a wide range of programs (either CUDA runtime API or directly from within the kernel). It enables, for example, kernels to trigger page-faults to read memory from the host. It can also automatically alter the memory allocation for cases where the requested size of memory is larger than the available memory size.
Unified Memory offers a “single-pointer-to-data” model that is conceptually similar to CUDA’s zero-copy memory. One key difference between the two is that with zero-copy allocations, the physical location of memory is pinned in CPU system memory such that a program may have fast or slow access to it depending on where it is being accessed from. Unified Memory, on the other hand, decouples memory and execution spaces so that all data accesses are fast.
CUDA unified memory is also called managed memory. There are two ways to create managed memory.
- By using
__managed__
attribute in front of memory declaration. This can only be done in global-scope or file-scope.
The following code snippet shows how to define __managed_memory__
in CUDA.
__device__ __managed__ int y;
//...
__global__ void custom_kernel()
{
y++;
}
int main(){
//....
y=1;
//...
}
- By using CUDA runtime functions.
cudaError_t cudaMallocManaged(void **devPtr, size_t size, unsigned int flags=0);
The following code shows how to define managed memory using CUDA runtime functions.
__global__ void custom_kernel(int * dev_a)
{
for(int i=0; i<N; i++) {
dev_a[i] += i;
}
}
int main() {
//...
int *dev_a;
// allocate the memory on the GPU
cudaMallocManaged( (void**)&dev_a, N * sizeof(int) );
for(int i=0; i<N; i++) {
dev_a[i] =i;
}
//....
}
All CUDA operations that are valid on device memory are also valid on managed memory. The primary difference is that the host is also able to reference and access managed memory.
Like in the case of CUDA unified virtual addressing, this feature requires a specific system-level driver provided by the vendor, and it is not available on SYCL 1.2.1 (or OpenCL 1.2). Note, however, that SYCL objects do automatically handle the transition between the host and the device, and no explicit copy operation is required (as opposed to OpenCL).
In newer versions of SYCL and OpenCL, new API features will partially enable this feature. In particular, SVM (Shared Virtual Memory), enables equivalent behavior.