The Eigen project has a SYCL backend implementation that implements operations such as tensors and runs on various devices. You can find the repository for the Eigen SYCL implementation here, and the additions are being up-streamed to the main project repository. The supported devices range from desktop CPUs, GPUs through to embedded accelerators such as the Renesas CVengine or ARM Mali GPU. Each device has different capabilities and in particular the embedded devices have different memory models and restricted memory sizes.
In order to tune the Eigen expressions for these devices and make the most of the hardware's capabilities, various compile time flags can be passed to the CMake when compiling your code to specialize the Eigen expressions for a particular device. These flags are outlined on this page.
They can be passed to CMake using -Dmacroname or in the source code using a #define.
1. Running the CMake Command to Build the SYCL Tests for Eigen
The SYCL Eigen implementation includes a set of tests that can be run and the following commands will build these tests so that they can be executed.
Create a build folder in the root of Eigen and run the following command.
cmake -DEIGEN_TEST_CXX11=1 -DEIGEN_TEST_SYCL=1 -DOpenCL_LIBRARY="PathTOLib.so" -DCOMPUTECPP_PACKAGE_ROOT_DIR="PathToComputeCppRootDirectory" -DCMAKE_CXX_FLAGS="-FlagsForCXXCompiler" -DCOMPUTECPP_USER_FLAGS="-fsycl-split-modules=20" ../
It is possible to only compile the SYCL tests using this make command from the root of your build folder.
make -j 8 $(make help | grep -E "sycl" | sed "s/\.\.\.//")
Similarly the command with ninja looks like this.
time ninja $(ninja -t targets | grep -E "sycl" | sed "s/:.*//")
2. Eigen CMake flags used with the SYCL Compiler
There are a set of specific CMake flags that are used when building code that uses the SYCL implementation of Eigen.
If you use the compute++ compiler directly you should use these flags.
If you use the CXX compiler use this flag.
//Typical flags to append to COMPUTECPP_USER_FLAGS -sycl-compress-name -Xclang -cl-mad-enable -O2
//Typical flags to append to CMAKE_CXX_FLAGS -O2
3. Eigen Local Work Group Size
Different devices might have different work group sizes, and the following can be set to accommodate different hardware configurations and architectures.
This is the static value for itemID.get_local(0). The value must be power of 2. It is advised to set the same value for LOCAL_THREAD_DIM0 and LOCAL_THREAD_DIM1.
#define LOCAL_THREAD_DIM0 16
This is the static value for itemID.get_local(1). The value must be power of 2. It is advised to set the same value for LOCAL_THREAD_DIM0 and LOCAL_THREAD_DIM1.
#define LOCAL_THREAD_DIM1 16
If the device has resource constraints you should restrict the maximum number of threads. This is important for certain processors, for example Intel CPUs and GPUs, and Arm devices. This macro is used for TensorScanOp on GPU for SYCL backend. The following default option is equivalent of 1024 blocks on AMD which has 256 work group size.
#define MAX_GLOBAL_RANGE 262144
4. Local Memory Usage
It might be necessary to disable local memory on specific hardware that does not have that type of memory. The macros below can be used to enable and disable local memory. If both are defined or neither are defined, the decision for using the local memory will be made at runtime. The default is to use local memory in order to improve performance when there is a dedicated shared memory.
#define LOCAL_MEM 1
On devices such as Arm GPUs that do not have dedicated shared memory, local memory is used to mimic global memory so it is best to disable the LOCAL_MEM macro and enable the NO_LOCAL_MEM macro.
#define NO_LOCAL_MEM 1
5. Register Usage
If the device you are using has limitations on registers, you can define the size of the registers. If there are no limitations the launcher tries to choose the optimized number of registers.
#define REG_M [YOUR SIZE POWER of 2] #define REG_N [YOUR SIZE POWER of 2] #define REG_K [YOUR SIZE POWER of 2]
For hardware that supports it, asynchronous kernel execution can be enabled using the following macro.
#define ASYNC_EXECUTION 1
7. Device selection
By default when running the tests, they will run on all available SYCL devices. Compiling with the USE_DEFAULT_SELECTOR macro will instead select the preferred device using cl::sycl::default_selector that can be controlled at runtime by the environment variable COMPUTECPP_TARGET. (Note: COMPUTECPP_TARGET is a ComputeCpp specific macro and is not available in other SYCL implementations)
#define USE_DEFAULT_SELECTOR 1
8. Eigen Vectorization
The following flags can be used for enabling and disabling Eigen vectorization. By default Eigen vectorization is enabled however, if the device does not support vectorization or you are compiling the code for OpenCL CPU devices you should pass both the above flags in compute++ and CXX compilers to disable vectorization as Compute Aorta has its own vectorization system. Remember that either both flags should be set or none. If you set one of them and don't set the other one you may get inconsistent kernel generated for host and device.
#define EIGEN_DONT_VECTORIZE_SYCL 1 #define EIGEN_DONT_VECTORIZE 1
Alternatively if you are compiling the Eigen test via cmake you can simply pass the following flag to cmake which will define the correct flags during compilation.
cmake -DEIGEN_DONT_VECTORIZE_SYCL=1 ..