This document details how to use CUDA-aware MPI with the DPC++ CUDA backend.
Prerequisites
In this document we assume that you have a working installation of the Intel oneAPI DPC++ compiler supporting the CUDA backend. For instructions on how to build DPC++ refer to the getting-started-guide. You will also need a CUDA-aware build of an MPI implementation, with the compiler wrapper (e.g. mpicxx
) built to point to your DPC++ compiler.
Using MPI with the CUDA backend
Compiling and running applications
The send_recv_buff.cpp and send_recv_usm.cpp samples are introductory examples of how CUDA-aware MPI can be used with DPC++ using either buffers or USM. The use of buffers with CUDA-aware MPI requires that the user makes the MPI calls within a host_task: see the send_recv_buff.cpp sample for full details. When using SYCL USM with MPI, users should always call the MPI function directly from the main thread; calling MPI functions that take SYCL USM from within a host_task is currently undefined behavior. In addition, the scatter_reduce_gather.cpp sample exhibits how MPI can be used together with the SYCL 2020 reduction and parallel_for interfaces for optimized but simple multi-rank reductions.
In order to compile the samples your compiler wrapper (e.g. mpicxx
) must point to your DPC++ compiler.
Firstly, make sure that you have the wrapper in your path:
$ export PATH=/path/to/your-mpi-install/bin:$PATH
Then you can compile a sample:
$ mpicxx -fsycl -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend --cuda-gpu-arch=sm_xx send_recv_usm.cpp -o ./res
Where “sm_xx” indicates the Compute Capability of the device.
The samples require two ranks to execute correctly. To run the samples simply follow standard MPI invocations, e.g.:
$ mpirun -n 2 ./res
Where “-n 2” indicates that two MPI ranks are used. Note that this will run two ranks using a single GPU unless the user makes specific changes to the sample or sets special environment variables for each rank. See the following section for more details.
Mapping MPI ranks to specific GPUs
For CUDA-aware MPI within a single node, users will need the ability to control whether or not each rank uses a unique GPU. One way to do this is to map a given rank to a given GPU within the program itself. Currently in the ext_oneapi_cuda
backend each CUDA device has its own platform. The vector of platforms returned from sycl::platform::get_platforms()
is ordered by CUDA device ID with the lowest ID available coming first. This means that if devices 0 and 1 are available, we can assign these devices to ranks 0 and 1 in a two-rank MPI program by initializing the queues in the following way:
std::vector<sycl::device> Devs;
for (const auto &plt : sycl::platform::get_platforms()) {
if (plt.get_backend() == sycl::backend::ext_oneapi_cuda)
Devs.push_back(plt.get_devices()[0]);
}
sycl::queue q{Devs[rank]};
NOTE: In the long-term CUDA devices will be contained within a single platform, and the above code will have to be adjusted.
If only Nvidia gpus are available, an identical list can be populated in a simpler way:
std::vector<sycl::device> Devs =
sycl::device::get_devices(sycl::info::device_type::gpu);
An alternative means of mapping an MPI rank to a unique GPU is to set the CUDA_VISIBLE_DEVICES environment variable to a single value for each MPI process. A common solution is to set the value of CUDA_VISIBLE_DEVICES
equal to the local MPI rank ID. Consult the documentation of your MPI implementation to find out how to retrieve the local rank ID.
Finally, if you are using a SLURM system you can achieve the same effect as with CUDA_VISIBLE_DEVICES
using the GPU affinity flag --gpu-bind
. Consult the SLURM documentation for details.
Current limitations
Using CUDA-aware MPI with DPC++ currently does not support SYCL shared USM for inter-node MPI.