MPI and SYCL are standards that seamlessly work together when a chosen backend supports implementations of both standards. Programmers can combine SYCL with GPU-aware MPI to write 100% portable code across a range of backends. In fact it is more appropriate to use a more general term, “device-aware” MPI, to encompass MPI that supports any kind of SYCL device allocated memory, including that of a CPU backend.
There are still some minor backend specific considerations to be made when using MPI + SYCL code. This guide details such issues for the cuda
or hip
backends of DPC++.
For specific issues with using GPU-aware MPI with Intel GPUs (using the level_zero
backend) consult appropriate Intel documentation.
Using GPU-aware MPI with either the cuda
or hip
backends of DPC++ is very straightforward and similar to how it is used with native cuda (nvcc
) or hip (hipcc
) compilers.
The required compilation invocations are almost identical to those used with these native compilers. If you are using documentation that details how to use GPU-aware MPI with a native compiler, specialized for a particular computing cluster,
then migrating instructions to icpx
is often as simple as replacing native compiler invocations with icpx
(and adjusting any associated (e.g. architecture specifier) compiler flags accordingly). The following sections of this document provide more complete details.
Prerequisites
In this document we assume that you have a working installation of the Intel oneAPI DPC++ compiler supporting the hip
backend. Refer to Install oneAPI for AMD GPUs (beta) for instructions on how to install Codeplay oneAPI Plugins supporting this backend.
You will also need a ROCM-aware (for hip
) build of an MPI implementation. Note that the Intel oneAPI toolkit comes with an Intel implementation of MPI which is not CUDA/ROCM-aware.
You will need to use a different implementation, for example OpenMPI or MPICH built from source with ROCM-awareness enabled, or a CRAY MPICH module with appropriate device acceleration.
The code examples used in this guide are available in the SYCL-samples repository.
Using MPI with Codeplay oneAPI Plugins
The send_recv_buff.cpp and send_recv_usm.cpp samples are introductory examples of how device-aware MPI can be used with DPC++ using either buffers or USM.
The use of buffers with device-aware MPI requires that the user make the MPI calls within a host_task
: see the send_recv_buff.cpp sample for full details.
When using SYCL USM with MPI, users should always call the MPI function directly from the main thread; calling MPI functions that take SYCL USM from within a host_task
is currently undefined behavior.
In addition, the scatter_reduce_gather.cpp sample exhibits how MPI can be used together with the SYCL 2020 reduction
and parallel_for
interfaces for optimized but simple multi-rank reductions.
Compile using an MPI wrapper
In order to compile the samples, your compiler wrapper (e.g. mpicxx
) must point to your DPC++ compiler (icpx
). Consult the documentation of your MPI implementation on how to build it this way. It may also be possible to change the default compiler without rebuilding using command-line arguments or environment variables (e.g. MPICH_CXX
, OMPI_CXX
) depending on the implementation.
Firstly, make sure that you have the wrapper in your path:
export PATH=/path/to/your-mpi-install/bin:$PATH
Then you can compile a sample:
mpicxx -fsycl -fsycl-targets=TARGET send_recv_usm.cpp -o ./res
Refer to the get-started-guide for details on how to correctly specify TARGET
in the hip
backend.
The samples require two ranks in order to execute correctly. To run the samples simply follow standard MPI invocations, e.g.:
mpirun -n 2 ./res
Where -n 2
indicates that two MPI ranks are used. Note that this will run two ranks using a single device unless the user makes specific changes to the sample or sets special environment variables for each rank.
Compile using Cray modules
Some clusters have a Cray MPICH module available that can be used to link with DPC++ directly. In order to do this you will generally have to ensure that a hardware specific Cray module is also loaded, e.g.
module load craype-accel-amd-gfx90a
Then you can compile directly with icpx
. You will need to include/link appropriate Cray MPICH libraries. The correct libraries to link will depend on the cluster you use, and you should consult appropriate documentation, but they might look something like this:
icpx -fsycl -fsycl-targets=TARGET send_recv_usm.cpp -o res -I/opt/cray/pe/mpich/8.1.25/ofi/cray/10.0/include/ -L/opt/cray/pe/mpich/8.1.25/ofi/cray/10.0/lib -lmpi -o res
You may also have to set the following environment variable:
MPICH_GPU_SUPPORT_ENABLED=1
You will then be able to execute the program using standard job submission instructions that will depend on your particular cluster.
Mapping MPI ranks to specific devices
For device-aware MPI within a single node, users will need the ability to control whether or not each rank uses a unique device. One way to do this is to map a given rank to a given device within the program itself.
std::vector<sycl::device> Devs;
for (const auto &plt : sycl::platform::get_platforms()) {
if (plt.get_backend() == sycl::backend::hip) {}
Devs=plt.get_devices();
break;
}
}
sycl::queue q{Devs[rank]};
If only AMD GPUs are available, an identical list can be populated in a simpler way:
std::vector<sycl::device> Devs =
sycl::device::get_devices(sycl::info::device_type::gpu);
An alternative means of mapping an MPI rank to a unique GPU is to set the HIP_VISIBLE_DEVICES environment variable to a single value for each MPI process.
A common solution is to set the value of HIP_VISIBLE_DEVICES
equal to the local MPI rank ID. Consult the documentation of your MPI implementation to find out how to retrieve the local rank ID.
Finally, if you are using a Slurm system you may be able achieve similar effects as with HIP_VISIBLE_DEVICES
using the GPU affinity flag --gpu-bind
. However note that the gpu-bind options are compute cluster specific and you should consult the specific documentation of your cluster.
You may wish to also consult the Slurm documentation.