info

Please note that you are viewing a guide targeting an older version of oneAPI for NVIDIA® GPUs. This guide was designed for version 2023.2.0.

MPI Guide for the CUDA backend

This document details how to use CUDA-aware MPI with the DPC++ CUDA backend.

Prerequisites

link

In this document we assume that you have a working installation of the Intel oneAPI DPC++ compiler supporting the CUDA backend. For instructions on how to build DPC++ refer to the getting-started-guide. You will also need a CUDA-aware build of an MPI implementation, with the compiler wrapper (e.g. mpicxx) built to point to your DPC++ compiler.

Using MPI with the CUDA backend

link

Compiling and running applications

link

The send_recv_buff.cpp and send_recv_usm.cpp samples are introductory examples of how CUDA-aware MPI can be used with DPC++ using either buffers or USM. The use of buffers with CUDA-aware MPI requires that the user makes the MPI calls within a host_task: see the send_recv_buff.cpp sample for full details. When using SYCL USM with MPI, users should always call the MPI function directly from the main thread; calling MPI functions that take SYCL USM from within a host_task is currently undefined behavior. In addition, the scatter_reduce_gather.cpp sample exhibits how MPI can be used together with the SYCL 2020 reduction and parallel_for interfaces for optimized but simple multi-rank reductions.

In order to compile the samples your compiler wrapper (e.g. mpicxx) must point to your DPC++ compiler. Firstly, make sure that you have the wrapper in your path:

$ export PATH=/path/to/your-mpi-install/bin:$PATH

Then you can compile a sample:

$ mpicxx -fsycl -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend --cuda-gpu-arch=sm_xx send_recv_usm.cpp -o ./res

Where “sm_xx” indicates the Compute Capability of the device.

The samples require two ranks to execute correctly. To run the samples simply follow standard MPI invocations, e.g.:

$ mpirun -n 2 ./res

Where “-n 2” indicates that two MPI ranks are used. Note that this will run two ranks using a single GPU unless the user makes specific changes to the sample or sets special environment variables for each rank. See the following section for more details.

Mapping MPI ranks to specific GPUs

link

For CUDA-aware MPI within a single node, users will need the ability to control whether or not each rank uses a unique GPU. One way to do this is to map a given rank to a given GPU within the program itself. Currently in the ext_oneapi_cuda backend each CUDA device has its own platform. The vector of platforms returned from sycl::platform::get_platforms() is ordered by CUDA device ID with the lowest ID available coming first. This means that if devices 0 and 1 are available, we can assign these devices to ranks 0 and 1 in a two-rank MPI program by initializing the queues in the following way:

std::vector<sycl::device> Devs;

for (const auto &plt : sycl::platform::get_platforms()) {
  if (plt.get_backend() == sycl::backend::ext_oneapi_cuda)
    Devs.push_back(plt.get_devices()[0]);
}

sycl::queue q{Devs[rank]};

NOTE: In the long-term CUDA devices will be contained within a single platform, and the above code will have to be adjusted.

If only Nvidia gpus are available, an identical list can be populated in a simpler way:

std::vector<sycl::device> Devs =
      sycl::device::get_devices(sycl::info::device_type::gpu);

An alternative means of mapping an MPI rank to a unique GPU is to set the CUDA_VISIBLE_DEVICES environment variable to a single value for each MPI process. A common solution is to set the value of CUDA_VISIBLE_DEVICES equal to the local MPI rank ID. Consult the documentation of your MPI implementation to find out how to retrieve the local rank ID.

Finally, if you are using a SLURM system you can achieve the same effect as with CUDA_VISIBLE_DEVICES using the GPU affinity flag --gpu-bind. Consult the SLURM documentation for details.

Current limitations

link

Using CUDA-aware MPI with DPC++ currently does not support SYCL shared USM for inter-node MPI.

Rate this Guide

Debugging SYCL Applications

oneAPI Menu

Main Menu

Products

menu_bookGuides

MPI Guide for the CUDA backend

Prerequisites

Using MPI with the CUDA backend

Compiling and running applications

Mapping MPI ranks to specific GPUs

Current limitations

Debugging SYCL Applications

Introduction

assignmentJump to Section

Select a Product

oneAPI

Dark Mode

Light Mode

Also,

part of our network

Codeplay.com

SYCL.tech

Codeplay Developer

Codeplay Open Source