Memory Model

This guide was created for versions: v0.1.0 - Latest

GPUs employ a similar memory hierarchy to the one used in CPU devices, including multiple levels of memory with different latency, bandwidths, and capacity. The lower the memory latency is, the closer the memory position is to the processor, and the lower its capacity is. In the following sections we explain how these levels of memory are exposed in both CUDA and SYCL.

CUDA Memory Model

The CUDA memory model is explicit. This will expose the full control of memory hierarchy to the programmer for allocation, de-allocation and data movement between host and device. CUDA memory allocations are performed using cudaMalloc which is equivalent to a malloc function, and returns a pointer that is not dereferencable on the host. The following image represents the overall architecture of the CUDA memory model.

CUDA memory architecture

Register

The fastest memory type on NVIDIA GPUs is the register file. Registers are used for allocating variables which are private to each threads. Registers are divided among warps in an SM. However, there is a very limited number of registers available per SM. Therefore, using fewer registers in the kernel allows more thread blocks to be executed concurrently on an SM and therefore the GPU occupancy and performance will improve. Kernel variables without a type qualifier will be allocated in registers. Also, if the array indices are constant and can be deduced at compile time, they will be allocated in a register as well.

Shared memory

Variables annotated with __shared__ are stored in shared memory. Shared memory is on chip, and it has lower latency than global memory. Also, its memory bandwidth is higher than that of global memory. As the name suggests, the shared memory is shared among the threads in a thread-block. Each SM has a limited amount of shared memory. Using too much shared memory per thread block will decrease the number of active warps and can consequently cause a drop in performance.

Global memory

The global memory is the device memory, visible by all SMs in the GPU architecture. Global memory can be allocated in the host code through cudaMalloc and freed from host code by cudaFree. The created memory will be passed as a pointer to the CUDA device kernel function as a parameter. Also a variable can be declared as global inside the kernel by adding __device__ qualifier in front of it.

Constant memory

The constant memory is a read-only device memory that stores variables annotated with __constant__ attribute. Constant variables are declared statically. Their scope is global and they must be defined outside of the kernel as they will be visible to all kernels. For each SM, the constant variables will be allocated in a separate cache area dedicated for the constant variables. To get the best performance from the constant memory, all threads in a warp should read the same memory address.

Texture memory

The texture memory is a type of read-only device memory. For each SM, there is a dedicated read only cache called a texture cache. The texture cache is optimized for two-dimensional array access, and is optimized to be accessed by warps accessing two-dimensional data.

Local memory

The local memory is a virtual concept where the actual location for it is in the global memory. Local memory is used for allocating register spills (including large arrays); local structures; variables that cannot fit into registers, or arrays whose indices cannot be deduced at compile-time. The accesses to local memory are always cached in per-SM L1 caches and per-device L2 caches for compute capability 3.x. For devices with compute capability of 5.x and 6.x, the data accesses are always cached in per-device L2 caches.

CUDA memory Management

The CUDA programming model uses pointer annotations to indicate where the different objects are allocated in memory. When allocating global memory, C-style malloc functions can be used to create device-pointers.

CUDA memory allocation

CUDA memory allocation is triggered by calling :

cudaError_t cudaMalloc(void** dev_ptr, size_t num_bytes)

This function allocates a device pointer of size num_bytes bytes of global memory on the device and returns the address of the allocated memory in dev_ptr. The allocated pointer is aligned for any variable type.

Since the memory model is explicit, it is the user's responsibility to transfer the data between host and device using the right memory transfer operations, described in the next section.

A device pointer can be initialized by:

cudaError_t cudaMemset(void * dev_ptr, int val, size_t num_bytes);

This function will set the following num_bytes of the memory position starting in dev_ptr with the value of val.

It is also the user's responsibility to free the memory. Releasing the memory can be triggered by calling:

cudaError_t cudaFree(void * dev_ptr)

The function will return an error if the memory is not created or has been already freed.

CUDA memory transfer

The following function is used to transfer data from source to destination in CUDA:

cudaError_t cudaMemcpy(void *destination, const void * source, size_t num_bytes, enum cudaMemcpykind kind);

cudaMemcpyKind can be one of the following types:

  • cudaMemcpyHostToHost : determines that the source and destination are host pointers.

  • cudaMemcpyHostToDevice : determines that the source is a host pointer and the destination is a device pointer.

  • cudaMemcpyDeviceToHost : determines that the source is a device pointer and the destination is a host pointer.

  • cudaMemcpyDeviceToDevice : determines that the source and destination are device pointers.

These API entries always block the user thread. If the source and destination does not match the directionality determined by kind , the result is undefined.

The asynchronous version of data transfer can be called using the following function:

cudaError_t cudaMemcpyAsync ( void* dst, const void* src, size_t count, cudaMemcpyKind kind, cudaStream_t stream = 0 )

The above function is non-blocking with respect to the host, so calling the function may return before completing the copy. The default behaviour of the above function is independent from the stream, however, it is possible to associate the copy to a stream by passing a non-zero stream argument. When the stream is specified using the cudaMemcpyHostToDevice or cudaMemcpyDeviceToHost may result in overlapping copy with other operations in other streams.

CUDA pinned memory

By default, CUDA allocated memory is pageable virtual memory. However, CUDA allows programmers to pin the host memory. When the host memory is not pinned, the GPU device has no control over the host memory in the event of a page fault on the host memory.

Therefore, whenever the data transfer function is called, the CUDA driver allocates a temporary pinned memory and copies the data to the pinned memory and then transfers the data from the pinned memory to the device. Once the memory has been pinned on the host side, it is not pageable from the host side anymore.

If the user provides pinned host memory, CUDA will not allocate temporary pageable memory, eliminating the extra intermediate copy operation. To allocate pinned memory in CUDA, the following function can be used:

cudaError_t cudaMallocHost(void** dev_ptr, size_t num_bytes )

num_bytes represents the number of allocated bytes and dev_ptr is the actual non-pageable host pointer which is accessible from the device. Since this memory is directly accessible from the device, the read/write bandwidth is higher than that of pageable memory.

CUDA zero-copy memory

CUDA zero-copy memory is pinned host memory mapped into the device address space. CUDA zero-copy will avoid the explicit data movement between host and device. Although zero-copy improves the PCIe transfer rates, it is required to be synchronized whenever data is shared between host and device. Otherwise, the behavior is undefined. Zero-copy memory can also be useful when there is not enough memory available on device. Zero-copy memory can be allocated by calling the following function:

cudaError_t cudaHostAlloc(void**ptr_host, size_t num_bytes, unsigned int flags)

The possible flags are : * cudaHostAllocDefault : this is the equivalent of CUDA pinned memory.

  • cudaHostAllocPortable : allocates pinned memory that can be accessed by all CUDA contexts.

  • cudaHostAllocWriteCombined: allocates a pinned memory useful to create a buffer to be written by the host and read by the device.

  • cudaHostAllocMapped : allocates host memory mapped into the device address space.

Calling the following function provides the device pointer for mapped, pinned memory:

cudaError_t cudaHostGetDevicePointer(void **ptr_device, void *ptr_host,
unsigned int flags);

If an NVIDIA GPU has the ability to access host registered memory at the same virtual address as the CPU, the ptr_device will match the ptr_host pointer, meaning that the memory can be accessed on the device by using the same host pointer ptr_host. Otherwise, the device pointer may or may not match the original host pointer.

Although zero-copy memory avoids explicit memory transfers between host and device by allocating pinned host memory, the kernel performance generally suffers from high-latency accesses to zero-copy memory over the PCIe bus.

CUDA Memory Access Patterns

Global Memory is a logical memory space residing in DRAM. Kernel code accesses the data between DRAM and SM on-chip memory using L1 and L2 cache as a shortcut whenever it is possible. The data transfer is either a 128-byte or 32-byte memory transaction. If only L2 cache is used, the access to data will be 32-byte memory transaction. However, if both L1 and L2 are used the access to data would be a 128-byte memory transaction. The L1 cache can be enabled or disabled manually, depending on the type of GPU architecture. The memory load is considered as cached load if the L1 cache is enabled. The L1 cache can be explicitly enabled or disabled by passing the -Xptxas -dlcm=FLAG, where the FLAG can either be ca to enable the L1 cache or cg to disable it. However, on some NVIDIA GPU architectures such Kepler K10, K20, and K20x, the L1 cache is disabled by default, in order to be used for caching register spills to local memory. Knowing the data size in L1 and L2 cache in CUDA, in the following sub sections we explain the concept of aligned memory access and coalesced memory access.

Aligned memory access

Memory accesses are called aligned when the address referenced in device memory is a multiple of 32 bytes for L2 cache or 128 bytes for L1 cache. If the memory load is misaligned, the memory bandwidth usage is reduced.

Coalesced memory access

Memory accesses are called coalesced if all 32 threads in a warp access a contiguous chunk of memory. The following image shows coalesced memory access by all threads of a warp.

Coalesced memory

Cache line

The L1 cache line is 128 bytes mapped to a 128 byte aligned segment in device memory. Therefore, an aligned coalesced access of 4 bytes per thread in a warp will perfectly map to the L1 cache line and will be loaded to register/shared memory by only one single 128-byte memory transaction.

SYCL Memory Model

The SYCL memory model is based on the OpenCL memory model, but operates at a higher level of abstraction. The key difference of this abstraction is the separation of storage from access to memory which is expressed through the relationship between memory objects (buffers or images) and accessors.

From the device point of view, the SYCL 1.2.1 memory model is fundamentally the same as the OpenCL 1.2 memory model.

From a user's point of view, SYCL creates one memory object in the form of a buffer or image. This memory object can be accessed on the host via a host accessor. The same memory object can be accessed on the device side via a device accessor.

It is important to understand that the SYCL buffer/image is a host API managing the data allocation and data transfer across variable devices. By creating an accessor, we create a requirement on the SYCL buffer/image. This requirement can be allocating memory, synchronization between two accessors, and data transfer between host and device. An accessor is an object telling the SYCL buffer/image where the data needs to be. If the host accessor is requested from a buffer/image, it implicitly implies that the data needs to be on the host. So the SYCL buffer/image allocates the data on the host. If a device accessor is requested from a SYCL buffer/image, it implicitly implies that the data needs to be on the device. Therefore, the SYCL buffer/image allocates the memory on the device. The SYCL buffer/image also handles the data transfer and synchronization, based on the accessor mode. For further information about data transfer using access modes, please refer to the SYCL memory transfer section.

Private Memory

In SYCL, private memory is a region of memory allocated per work item, visible only to that work item. Private memory is used for kernel parameters and local stack variables. Private memory cannot be accessed from the host.

Private memory is typically mapped to hardware registers. There is no mechanism in SYCL to query the number of registers available to a particular device. Developers must refer to the documentation of the SYCL implementation and/or the hardware vendor to understand the limits of private memory.

Local Memory

Local memory is a contiguous region of memory allocated per work group, and visible to all work items within that work group. Local memory is allocated and accessed using an accessor on submission. This memory cannot be accessed from the host. From the device, local memory is visible accessed as an array of contiguous elements of a particular type. Multiple arrays of local memory can be allocated on the device. The maximum available local memory depends on the hardware and the kernel configuration.

Global Memory

In SYCL (and OpenCL), global memory represents allocations from a pool of memory visible by all threads of the device. In addition, different devices in the same SYCL context can see this memory. Note that not all devices will be available in all contexts - this is an implementation specific detail that depends on the hardware architecture.

There are two ways of allocating global memory in SYCL: using Buffers or Images.

Buffer Memory

In SYCL, a buffer represents a contiguous region of memory which is managed by the SYCL runtime, and can be accessed on the host and on device. When accessed on host, the runtime may copy it via temporary host storage. When accessed from the device, memory is stored in global address space (globally visible to all work items) or the constant address space (globally visible to all work items, but read-only) if the runtime works-out that the memory is read-only.

Accessing memory on the host requires construction of a host accessor, whereas on the device it is accessed via command group accessors.

Once executing inside a device kernel or in a host accessor, the buffer is represented as an array of contiguous elements of a particular type.

In the following code snippet, we access the buffer buff_a on the host side by creating a host accessor host_acc, and access the same buffer buff_a on the device side by using the device accessor dev_acc on the device.


// creating a SYCL buffer which a data storage.
cl::sycl::buffer<float, 1> buff_a(cl::sycl::range{100});

// geting a host access on the buff_a with write permission
auto host_acc = buff_a.get_access<cl::sycl::access::mode::write>();

// using host accessor to fill the data in the buffer
for(int i=0; i< 100; i++) {
  host_acc[i]=1.0f;
}

cl::sycl::queue q;

q.submit([&](cl::sycl::handler& cgh){
  // creating a device accessor on the buffer_a with write permission
  auto dev_acc = buff_a.get_access<cl::sycl::access::write>(cgh); 
  cgh.parallel_for<class hellow_word>(cl::sycl::range<1>{100}, [=](cl::sycl::item<1> id) {
     // using device accessor inside the kernel to fill the data
     device_acc[id] +=1.0f; 
  }); 
});

Image Memory

In SYCL, image objects represent a region of memory managed by the SYCL runtime. The data layout of the image memory is deliberately unspecified so as to allow implementations to provide a layout optimal to a given device. When accessed on host, image memory may be stored on temporary host memory. When accessed on device, image data is stored in device image memory, which can often be texture memory if the device supports it. Image memory can also be accessed as an array of images representation.

Accessing image memory on the host is done by constructing an accessor with the host_image access target. Accessing image memory on the device is done by constructing an accessor with the image access target. Accessing image memory on the device in an image array representation is done by constructing an accessor with the image_array access target.

Image memory is accessed via member functions which perform read; sampled read, and write functions on the memory on the device.

Constant Memory

Constant memory is allocated using buffers that are read only or using constant SYCL buffer objects. See the Buffer Memory and Image Memory sections for more details on how to create a buffer or image memory.

SYCL memory Management

In SYCL, memory is managed through buffer and image objects which maintain a memory region across the host and one or more devices. Memory is accessed via accessors: objects which represent a request to access the memory in a particular region with particular properties. Accessors are provided to a command group as a way of describing data dependencies for a kernel function. These data dependencies are used by the SYCL runtime to both satisfy data requirements for kernel functions by making the memory available as requested and to perform data movement optimizations.

Accessors are parameterized by an access target which specifies where the memory is to be accessed and an access mode which specify how the memory is to be accessed. The access target of an accessor can be host_buffer, global_buffer, constant_buffer, host_image, image , image_array or local. The access mode of an accessor can be read, write, read_write, discard_write, discard_read_write and atomic.

SYCL memory allocation

Memory allocation on the device is typically asynchronous in SYCL, and varies according to the implementation where this happens.

SYCL accessors are templated by the cl::sycl::access::target enum class which specify the memory space which memory is to be allocated in. When an accessor is passed to a SYCL kernel, it becomes a requirement for the kernel and is then allocated on the associated memory space for the specified access target.

Access Target Memory Space
cl::sycl::access::target::host_buffer, cl::sycl::access::target::host_image Host memory
cl::sycl::access::target::global_buffer Global memory
cl::sycl::access::target::constant_buffer Constant memory
cl::sycl::access::target::local Local memory
cl::sycl::access::target::image, cl::sycl::access::target::image_array Image memory

SYCL memory transfer

Memory transfers in SYCL are asynchronous, and performed implicitly by the SYCL runtime when an accessor requires access to memory not available already on the device. SYCL accessors are templated by the cl::sycl::access::mode enum class which specifies the mode by which the memory is to be accessed. When an accessor is passed to a SYCL kernel, it becomes a requirement for the kernel, and the access mode effects the data dependencies between SYCL kernels. The implicit transfer, and the rule determining whether or not the data should be transferred is defined by the type of cl::sycl::access::mode.

The table below describes the different access modes available for SYCL accessors, and whether they can trigger a copy into the device memory from the host or not.

SYCL data access Should the data be transferred to/from host/device Is the accessed data modifyable on host/device
cl::sycl::access::mode::read Yes No
cl::sycl::access::mode::write Yes Yes
cl::sycl::access::mode::read_write Yes Yes
cl::sycl::access::mode::discard_write No Yes
cl::sycl::access::mode::discard_read_write No Yes
cl::sycl::access::mode::atomic Yes Yes

SYCL pinned memory

There is no explicit mechanism to request pinned memory in SYCL. The SYCL runtime will always aim to allocate memory in the most optimal way. However, the specific requirements for this varies from one architecture to another and from one OS to another, and there is no guarantee that SYCL devices support pinned memory, so vendor documentation should be consulted for architecture specific requirements.

There are generally two ways which host memory can be allocated: When not using* the cl::sycl::property::buffer::use_host_pointer property, the SYCL runtime will allocate host memory when required. This uses the implementation-specific mechanism, which can attempt to request pinned memory.

  • If the cl::sycl::property::buffer::use_host_pointer property is used, then the SYCL runtime will not allocate host memory and will use the pointer provided when the buffer is constructed. In this case, it is the users responsibility to ensure any requirements for memory allocation to allow pinned memory are satisfied.

Users can manually allocate pinned memory on the host, and hand it over to the SYCL implementation. This will often involve allocating host memory with a suitable alignment and multiple, and sometimes can be managed manually using OS specific operations such as mmap and munmap.

SYCL zero-copy memory

In SYCL there is no explicit mechanism to request zero-copy. If a SYCL buffer is allocated in pinned memory - according to the requirements above - then the SYCL runtime will attempt to initialize with zero-copy if possible.

SYCL Memory Access Patterns

The SYCL nd_range class describes the range over which the SYCL kernel is to be executed. SYCL buffer and image memory objects, alongside subsequent accessor objects, can all be 1, 2 or 3 dimensional. The memory is still allocated as a single contiguous buffer of memory, therefore individual memory positions can be accessed both using the subscript operator or by manually computing the offset into memory.

  • When buffer and local accessors are accessed via a multi-dimensional id, data is accessed in row-major order.

  • When image accessors are accessed via a multi-dimensional id, data is accessed in column-major order.

The following picture demonstrates accessing the position of (0,1) for buffer (buffer_acc) and image (image_acc). As demonstrated in this picture, buffer_acc[0][1] and image_acc[0][1] point to different locations for coordinate (0,1). This is because buffer_acc[0][1] access the location by using row-major order, while image_acc[0][1] access the location by using column-major order.

Accessing the position of (0, 1) for on image accessor and buffer accessor.

Alternatively, the position w.r.t the origin for an id index can be calculated manually from the id and range objects.

Aligned memory access

Memory access is said to be aligned when the first address of a memory allocation on the device is an even multiple of the global cache line size. In SYCL the global memory cache line size can be queried from a device, as shown below.

Aligned access typically increases performance.

auto cacheLineSize = device.get_info<
  cl::sycl::info::device::global_mem_cache_line_size>();

Coalesced memory access

Memory access is called coalesced in SYCL if contiguously executed work items access a contiguous chunk of memory. This depends on the mapping between the execution order of work items and the linearization of the id being used to access memory. There is no guarantee as to the execution order of work items: this may vary from one architecture to another. Refer to the vendor documentation to find the right mapping between the hardware access pattern and the memory linearization.

Similarities and differences

Memory model comparison for CUDA and SYCL

The following table shows the equivalence between the different components of the memory architecture in SYCL and CUDA.

CUDA name SYCL name
Per-thread local memory Private memory
Per-block shared memory Local memory
Global memory Global memory
Constant memory Constant memory
Texture memory Image memory
Local-memory N/A

Note that developers cannot make assumptions as to where the different memory is actually placed in the hardware when using SYCL. Implementation documentation shows where in the actual hardware the different memories are mapped. However, which architectural element can access the different memories remains the same across all architectures, e.g private-memory is only visible by the work-item it belongs, whereas global memory is visible by all work-items in the ND-range.

Using C-style pointers on CUDA and SYCL

It is important to note that CUDA memory allocations are handled using C-style pointers that are not accessible directly on the host, and forces users to handle the host and the device copy of the information independently - except when using either CUDA unified memory or CUDA virtual addressing, see the sections below.

The following example illustrates a trivial CUDA memory allocation.

 int *dev_a;
 // allocate the memory on the GPU
 HANDLE_ERROR( cudaMalloc( (void**)&dev_a, N * sizeof(int) ) );

The calls to cudaMalloc will allocate dev_a, dev_b and dev_c pointers, which will be accessible only from the device. Pointer arithmetic is possible, so dev_a + 3 will point to the 3rd integer in the array on the device. Also, the pointer address is guaranteed to be the same across multiple kernel executions, since allocations are performed directly by the hardware low-level driver.

This is different from the approach to memory allocation exposed by SYCL. In SYCL, memory allocation is handled via C++ objects that encapsulate the implementation-specific behavior. The developer must use accessors to specify when using the data on the host or the device. This enables the SYCL runtime to schedule optimal memory synchronization operations (e.g, copy or DMA operations).

  cl::sycl::buffer<int, 1> bufA(range<1>(N));

The buffer above can be accessed on the host and the device using a host or a device accessor. The buffer objects are not pointers, hence pointer arithmetic is not possible. The pointer that can be obtained when using the accessor is only valid for the duration of the access, as there is no guarantee the underlying memory is the same between multiple accesses.

This mechanism is typically more suited to C++ code-bases, as it is type-safe and enables further runtime optimizations. However, in order to port existing CUDA code, it may not be possible to replace the existing code completely to use this mechanism. For this reason, we provide a number of utilities in the computecpp-sdk that help to expose a CUDA-like pointer to the legacy application whilst keeping SYCL objects underneath. We focus on the Software-managed virtual pointer since it is the one that has been used to port Eigen and Tensorflow to SYCL. When using the vptr utility, users can create a PointerMapper structure that holds the mapping of virtual addresses to SYCL buffers. This enables using C-style malloc and free functions that will return non-dereferenciable pointers that can be used in pointer arithmetic. The example below, extracted from the computecpp-sdk, illustrates its usage:

The following code shows how to mimic the CUDA-like pointer in SYCL, by creating a C-style pointer and applying arithmetic operations on the pointer without dereferencing it on the host side.

// Create the Pointer Mapper structure
PointerMapper pMap;
// Create a SYCL buffer of 10 floats
// This pointer is a number that identifies the buffer
// in the pointer mapper structure
float * a = static_cast<float *>(SYCLmalloc(10 * sizeof(float), pMap));
// Create a SYCL buffer of 25 integers
int * b = static_cast<int *>(SYCLmalloc(25 * sizeof(int), pMap));
// Create a pointer to the 5th element
// This simply adds 5 * sizeof(float *) to the base address.
float * c = a + 5;
// Retrieve the buffer
assert(pMap.get_buffer(a) == pMap.get_buffer(c))
// Substracting the value of the offset from the base address of the
// buffer recovers the offset into it
assert(pMap.get_offset(c) == 5 * sizeof(float))

// Invalid usage: no-dereference on the host
// float myVal = *c;
// Valid access on host: Use host-accessor
{
  auto syclAcc = pMap.get_buffer(a).get_access<access::mode::read>();
  float myVal = syclAcc[0];
}

// Free the pointers
SYCLfree(a, pMap);
SYCLfree(b, pMap);

CUDA Unified virtual addressing

When a CUDA application is built on a 64-bit system, on hardware supporting the compute capability 2.0 and higher, the host and multiple devices share a common virtual address space. This is possible due to the interaction of the CUDA application with the NVIDIA system driver. All host memory allocations made via CUDA API calls and all device memory allocations on supported devices are within this virtual address range. This enables detecting the device that contains a given pointer by using specific query functions, and using specific host allocation functions. Once the device is found, it is possible to enable automatic porting of data across host and device. This feature requires a specific system-level driver provided by the vendor, and it is not available on SYCL 1.2.1 (or OpenCL 1.2). Note, however, that SYCL objects do automatically handle the transition between the host and the device, and no explicit copy operation is required (as opposed to OpenCL).

In more recent versions of SYCL and OpenCL, new API features partially enable this feature. However, hardware support is not widely available.

CUDA unified memory

CUDA unified memory enables using the same pointer across host and device, and provides memory management services to a wide range of programs (either CUDA runtime API or directly from within the kernel). It enables, for example, kernels to trigger page-faults to read memory from the host. It can also automatically alter the memory allocation for cases where the requested size of memory is larger than the available memory size.

Unified Memory offers a “single-pointer-to-data” model that is conceptually similar to CUDA’s zero-copy memory. One key difference between the two is that with zero-copy allocations, the physical location of memory is pinned in CPU system memory such that a program may have fast or slow access to it depending on where it is being accessed from. Unified Memory, on the other hand, decouples memory and execution spaces so that all data accesses are fast.

CUDA unified memory is also called managed memory. There are two ways to create managed memory.

  • By using __managed__ attribute in front of memory declaration. This can only be done in global-scope or file-scope.

The following code snippet shows how to define __managed_memory__ in CUDA.


__device__ __managed__ int y;

//...

__global__ void custom_kernel()
{
  y++;
}

int main(){

  //....

  y=1;

  //...

}

  • By using CUDA runtime functions.

cudaError_t cudaMallocManaged(void **devPtr, size_t size, unsigned int flags=0);

The following code shows how to define managed memory using CUDA runtime functions.

__global__ void custom_kernel(int * dev_a)                                                 
{
  for(int i=0; i<N; i++) {
    dev_a[i] += i;
  }

}

int main() {

  //...
  int *dev_a;
  // allocate the memory on the GPU
  cudaMallocManaged( (void**)&dev_a, N * sizeof(int) );

  for(int i=0; i<N; i++) {
    dev_a[i] =i;
  }

  //....
}

All CUDA operations that are valid on device memory are also valid on managed memory. The primary difference is that the host is also able to reference and access managed memory.

Like in the case of CUDA unified virtual addressing, this feature requires a specific system-level driver provided by the vendor, and it is not available on SYCL 1.2.1 (or OpenCL 1.2). Note, however, that SYCL objects do automatically handle the transition between the host and the device, and no explicit copy operation is required (as opposed to OpenCL).

In newer versions of SYCL and OpenCL, new API features will partially enable this feature. In particular, SVM (Shared Virtual Memory), enables equivalent behavior.