CUDA and SYCL Ecosystems
CUDA has been available for developers since early 2007 and since then it has developed a large ecosystem of libraries and support tools. Developers for NVIDIA hardware can use multiple pre-existing libraries for different purposes that are provided either as part of the CUDA toolkit or as separate downloads from the CUDA developers website.
SYCL, on the other hand, has only been public since 2015. The latest specification was published in December 2017. As a more recent standard there are currently not as many applications and libraries available in the SYCL ecosystem compared to CUDA. In this section we focus on the libraries for which there is an equivalent between CUDA and SYCL, so that developers can find direct comparisons between the two when migrating code.
Basic Linear Algebra Subprograms (BLAS) is a specification that describes a set of low-level routines for performing common linear algebra operations such as vector addition, scalar multiplication, dot products, linear combinations, and matrix multiplication. They are organized in different levels, depending on whether they perform vector-vector, matrix-vector or matrix-matrix operations. Level 1 BLAS performs scalar, vector and vector-vector operations; Level 2 BLAS perforsm matrix-vector operations, and Level 3 BLAS performs matrix-matrix operations.
Most systems and hardware vendors provide optimized BLAS libraries, since they are the fundamental operations of many other libraries, like linear algebra software or, more recently, machine learning frameworks.
The BLAS interface defines C and Fortran routines.
NVIDIA cuBLAS is an implementation of BLAS optimized for NVIDIA GPUs. The library supports single and multiple GPU configurations, and offers the complete BLAS interface for all types. cuBLAS requires re-writing your source code to include CUDA calls and cuBLAS library calls. An alternative implementation (NVBLAS) is available for level 3 operations that is able to re-route calls to a CPU version of BLAS to the GPU variant, at runtime.
The example below illustrates a snippet of code that initializes data using cuBLAS and performs a general matrix multiplication. More complete examples can be found in the CUDA Code Samples
/* Allocate memory using standard cuda allocation layout */ CHECK_ERROR(cudaMalloc((void **)&d_C, n2 * sizeof(d_C))); /* Create "vector structures" on the device and initialize them with data on the host. * The routine will copy data from the host to the device if required. */ CHECK_ERROR(cublasSetVector(n2, sizeof(h_A), h_A, 1, d_A, 1)); CHECK_ERROR(cublasSetVector(n2, sizeof(h_B), h_B, 1, d_B, 1)); CHECK_ERROR(cublasSetVector(n2, sizeof(h_C), h_C, 1, d_C, 1)); /* Call Sgemm (Single floating point precision general matrix multiply) algorithm on Cublas. * "handle" is a CUBLAS specific type that stores all the library and device initialization code. */ CHECK_ERROR(cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, N, N, N, &alpha, d_A, N, d_B, N, &beta, d_C, N)); /* Allocate host memory for reading back the result from device memory */ h_C = (float *)malloc(n2 * sizeof(h_C)); /* Read the result back. The routine will trigger the copy back to the host implicitly. */ CHECK_ERROR(cublasGetVector(n2, sizeof(h_C), d_C, 1, h_C, 1));
SYCL-BLAS is a BLAS interface implementation written in SYCL. SYCL-BLAS leverages C++ expression tree templates to generate SYCL kernels via kernel composition. Expression tree templates are a widely used technique to implement expressions on C++, facilitating development and composition of operations. SYCL-BLAS can be optimized for different platforms using different compile-time parameters.
The example below illustrates the basic usage of SYCL-BLAS to dispatch an AXPY operation. More complete examples can be found in the tests from the project repository.
// Create an Executor for the Library interface using SYCL and a pre-existing queue. Executor<SYCL> ex(syclQueue); // Instantiate the data in the SYCL runtime auto gpu_vX = ex.template allocate<ScalarT>(size); auto gpu_vY = ex.template allocate<ScalarT>(size); // Explicit SYCL copy from the host to the device ex.copy_to_device(vX.data(), gpu_vX, size); ex.copy_to_device(vY.data(), gpu_vY, size); // Call the axpy routine. Note the the type is automatically inferred, // so no need for the first letter _axpy(ex, (size + strd - 1) / strd, alpha, gpu_vX, strd, gpu_vY, strd); // Update the host pointer ex.copy_to_host(gpu_vY, vY.data(), size);
SYCL BLAS aims to support all the original BLAS APIs. At the time of writing this document, SYCL BLAS supports all APIs from BLAS level 1, the GEMV and GER APIs from BLAS level 2, and GEMM APIs from BLAS level 3.
Deep Neural Networks
Deep learning (also known as deep structured learning or hierarchical learning) is part of a broader family of machine learning methods based on learning data representations, as opposed to task-specific algorithms. Learning can be supervised, semi-supervised or unsupervised. Recent advances in this area have forced hardware and system vendors to offer highly optimized versions of common algorithms for their platforms.
cuDNN is the NVIDIA Deep Neural Network library, a CUDA-based library that contains a number of primitives to accelerate deep neural network frameworks. It contains a set of the most commonly used routines in machine learning, such as convolution, pooling, normalization and activation layers.
SYCL-DNN is a portable machine learning convolution library written in SYCL. It implements highly optimized convolution algorithms for different platforms. SYCL-DNN is a work-in-progress project from Codeplay Software, and can be obtained from their open-source repository.
TensorFlow is an open-source software library for dataflow programming across a range of tasks. Although it was designed initially as a math library, its main use is the development of machine learning applications, with a particular focus on neural networks. TensorFlow was developed by Google, but is now released under the Apache 2.0 open source license.
Tensorflow upstream supports both CUDA and SYCL programming models natively, using different compilation options. Codeplay maintains a development branch of TensorFlow with SYCL support that is updated with the latest performance optimizations and SYCL support.
Instructions on how to build TensorFlow with SYCL can be found on the Codeplay developer website.
The Tensorflow user-interface is the same irrespective of the backend used, therefore, any Tensorflow model used on the CUDA backend should run on the SYCL backend. Note that different hardware has different restrictions, which limits the portability of the different models.