MPI and SYCL are standards that work together when a chosen backend supports implementations of both standards. Programmers can combine SYCL with GPU-aware MPI to write 100% portable code across a range of backends. In fact it is more appropriate to use a more general term, “device-aware” MPI, to encompass MPI that supports any kind of SYCL device allocated memory, including that of a CPU backend.
There are still some backend specific considerations to be made when using MPI + SYCL code.
This guide details such issues for the cuda
or hip
backends of DPC++.
For specific issues with using GPU-aware MPI with Intel GPUs (using the level_zero
backend) consult appropriate Intel documentation.
Using GPU-aware MPI with either the cuda
or hip
backends of DPC++ is straightforward and similar to how it is used with native cuda (nvcc
) or hip (hipcc
) compilers.
The required compilation invocations are almost identical to those used with these native compilers. If you are using documentation that details how to use GPU-aware MPI with a native compiler, specialized for a particular computing cluster,
then migrating instructions to icpx
is often as simple as replacing native compiler invocations with icpx
(and adjusting any associated (e.g. architecture specifier) compiler flags accordingly). The following sections of this document provide more complete details.
Prerequisites
In this document we assume that you have a working installation of the Intel oneAPI DPC++ compiler supporting the cuda
backend. Refer to Install oneAPI for NVIDIA GPUs for instructions on how to install Codeplay oneAPI Plugins supporting this backend.
You will also need a CUDA-aware build of an MPI implementation. Note that the Intel oneAPI toolkit comes with an Intel implementation of MPI which is not CUDA/ROCM-aware.
You will need to use a different implementation, for example OpenMPI or MPICH built from source with CUDA-awareness enabled, or a CRAY MPICH module with appropriate device acceleration.
The code examples used in this guide are available in the SYCL-samples repository.
Using MPI with Codeplay oneAPI Plugins
The send_recv_buff.cpp and send_recv_usm.cpp samples are introductory examples of how device-aware MPI can be used with DPC++ using either buffers or USM.
The use of sycl::buffer with device-aware MPI requires that the user make the MPI calls within a host_task
: see the send_recv_buff.cpp sample for full details.
When using SYCL USM with MPI, users should always call the MPI function directly from the main thread; calling MPI functions that take SYCL USM from within a host_task
is currently undefined behavior.
In addition, the scatter_reduce_gather.cpp sample exhibits how MPI can be used together with the SYCL 2020 reduction
and parallel_for
interfaces for optimized but simple multi-rank reductions.
Compile using an MPI wrapper
In order to compile the samples, your compiler wrapper (e.g. mpicxx
) must point to your DPC++ compiler (icpx
). Consult the documentation of your MPI implementation on how to build it this way. It may also be possible to change the default compiler without rebuilding using command-line arguments or environment variables (e.g. MPICH_CXX
, OMPI_CXX
) depending on the implementation.
Firstly, make sure that you have the wrapper in your path:
export PATH=/path/to/your-mpi-install/bin:$PATH
Then you can compile a sample:
mpicxx -fsycl -fsycl-targets=TARGET send_recv_usm.cpp -o ./res
Refer to the get-started-guide for details on how to correctly specify TARGET
in the cuda
backend.
The samples require two ranks in order to execute correctly. To run the samples simply follow standard MPI invocations, e.g.:
mpirun -n 2 ./res
Where -n 2
indicates that two MPI ranks are used. Note that this will run two ranks using a single device unless the user makes specific changes to the sample or sets special environment variables for each rank.
Compile using Cray modules
Some clusters have a Cray MPICH module available that can be used to link with DPC++ directly. In order to do this you will generally have to ensure that a hardware specific Cray module is also loaded, e.g.
module load craype-accel-nvidia80
Then you can compile directly with icpx
. You will need to include/link appropriate Cray MPICH libraries. The correct libraries to link will depend on the cluster you use, and you should consult appropriate documentation, but they might look something like this:
icpx -fsycl -fsycl-targets=TARGET send_recv_usm.cpp -o res -I/opt/cray/pe/mpich/8.1.25/ofi/cray/10.0/include/ -L/opt/cray/pe/mpich/8.1.25/ofi/cray/10.0/lib -lmpi -o res
You may also have to set the following environment variable:
MPICH_GPU_SUPPORT_ENABLED=1
You will then be able to execute the program using standard job submission instructions that will depend on your particular cluster.
Mapping MPI ranks to specific devices
For GPU-aware MPI, users will need the ability to control whether or not each rank uses a specific GPU. MPI interfaces such as MPI_Send do not specify which GPU device is targeted to receive the data; they only specify the MPI rank. In the HIP and CUDA backends it is currently necessary to use environment variables in order to prevent memory leaks. For example, if you wish to use a 1:1 mapping between each MPI process and each GPU, it is recommended to use an appropriate environment variable to only expose a single unique GPU per process.
Refer to the “Controlling Nvidia devices exposed to DPC++” section of the get-started-guide for details of how to use environment variables ONEAPI_DEVICE_SELECTOR or CUDA_VISIBLE_DEVICES in order to ensure that each process is only exposed to a device ID matching the local MPI rank ID.
Consult the documentation of your MPI implementation to find out how to retrieve the local rank ID.
Finally, if you are using a Slurm system you may be able achieve similar effects as with CUDA_VISIBLE_DEVICES
using the GPU affinity flag --gpu-bind
. However note that the gpu-bind options are specific to a particular compute cluster and you should consult the specific documentation of your cluster.
You may wish to also consult the Slurm documentation.
Current limitations
CUDA-aware MPI with DPC++ currently does not support SYCL shared USM for inter-node MPI.
CUDA-aware MPI with DPC++ is only currently tested with either a mapping of a single MPI rank to a single GPU, or a mapping of multiple MPI ranks to a single GPU. Mapping multiple GPUs to a single MPI rank is untested.