Speaker
Description
In this talk, we discuss the optimal strategy for parallel matrix-matrix multiplication algorithm that minimizes the time-to-solution by finding the best parameters of the algorithm for overlapping multiplications of separate tiles in each GPU and data transfers between GPUs. The new algorithm developed for multi-GPU nodes is discussed [1]. The correlation is analyzed between the optimal parameters of the algorithm and the hardware specifications (e.g. the floating point performance and the memory bandwidth). The results are illustrated by the benchmarks made for different Nvidia GPU connected with PCIe or NVLink.
[1] Choi Y. R., Nikolskiy V., Stegailov V. Matrix-Matrix Multiplication Using Multiple GPUs Connected by Nvlink // 2020 Global Smart Industry Conference (GloSIC). – IEEE, 2020. – С. 354-361.