Research of improving the performance of explicit numerical methods on the x86 and ARM CPU

8 Jul 2021, 14:30
15m
403 or Online - https://jinr.webex.com/jinr/j.php?MTID=mf93df38c8fbed9d0bbaae27765fc1b0f

403 or Online - https://jinr.webex.com/jinr/j.php?MTID=mf93df38c8fbed9d0bbaae27765fc1b0f

https://jinr.webex.com/jinr/j.php?MTID=mf93df38c8fbed9d0bbaae27765fc1b0f
Sectional reports 5. High Performance Computing HPC

Speaker

Vladislav Furgailo

Description

Explicit numerical methods are used to solve and simulate a wide range of mathematical problems whose origins can be mathematical models of physical conditions. However, simulations with large model spaces can require a tremendous amount of floating point calculations and run times of several months or more are possible even on large HPC systems.
The vast majority of HPC systems in the field today are powered by x86 and ARM CPUs [1]. Our aim is to investigate methods of increasing computational speed for simulation on CPUs and also to compare the performance and energy efficiency on x86 and ARM CPUs. High-order finite difference time domain (FDTD) method to solve the 3D acoustic equation was used in our work.
For HPC, in conjunction with parallel computing, we used CPU capabilities like SIMD-computing (AVX on x86 and NEON on ARM) [2] and hierarchical structure of the memory of the CPU caches to optimize data locality. For data locality was used the method of changing order of traversal on the iteration space – loop tiling [3]. Our work considers a number of optimization tiling algorithms and test calculations for x86 and ARM architectures. In particular, we considered recursive and non-recursive cube-tiling [4] and ZCube data locality optimization.
We have found that ZCube increases the performance of SIMD-computations on ARM CPU [5] and speeds up computation with tiling on both CPU architectures. Also, as expected, we found that non-recursive tiling has better performance for the CPU architectures than recursive tiling due to less CPU cache misses. And finally, we found that ARM CPU have 12 times more performance/energy efficiency factor than x86 CPU.
In this respect, extending our experiments on ARM-cluster computing with increasing performance of non-recursive and recursive tiling would be of interest.
References

  1. http://www.top500.org/
  2. S. M. et. al., “Vector instructions to enable efficient
    synchronization and parallel reduction operations,” U.S. Patent
    WO2009120981A2, Oct. 2009.
  3. J. Xue, “On tiling as a loop transformation,”Parallel Processing
    Letters, vol. 07,no. 04, pp. 409–424, 1997.
  4. V. Furgailo, A. Ivanov, and N. Khokhlov, “Research of techniques to
    improve the performance of explicit numerical methods on the cpu,”
    pp. 79–85, 09 2019.
  5. J. Bakos,Embedded Systems: ARM Programming and Optimization.
    Elsevier Science, 2015.

Primary authors

Presentation materials