9th International Conference "Distributed Computing and Grid Technologies in Science and Education" (GRID'2021)

Name: 9th International Conference "Distributed Computing and Grid Technologies in Science and Education" (GRID'2021)
Start: 2017-12-03T12:00:00+03:00
End: 2021-07-09T19:05:00+03:00
Location: No location set

5–9 Jul 2021

Europe/Moscow timezone

Support

grid2021@jinr.ru

Verifiable application-level checkpoint and restart framework for parallel computing

8 Jul 2021, 16:30

15m

403 or Online - https://jinr.webex.com/jinr/j.php?MTID=mf93df38c8fbed9d0bbaae27765fc1b0f

https://jinr.webex.com/jinr/j.php?MTID=mf93df38c8fbed9d0bbaae27765fc1b0f

Sectional reports 5. High Performance Computing HPC

Mr Ivan Gankevich (Saint Petersburg State University)

Fault tolerance of parallel and distributed applications is one of the concerns that becomes topical for large computer clusters and large distributed systems. For a long time the common solution to this problem was checkpoint and restart mechanisms implemented on operating system level, however, they are inefficient for large systems and now application-level checkpoint and restart is considered as a more efficient alternative. In this paper we implement application-level checkpoint and restart manually for the well-known parallel computing benchmarks to evaluate this alternative approach. We measure the overheads introduced by creating and restarting from a checkpoint, and the amount of effort that is needed to implement and verify the correctness of the resulting programme. Based on the results we propose generic framework for application-level checkpointing that simplifies the process and allows to verify that the application gives correct output when restarted from any checkpoint.

Mr Ivan Gankevich (Saint Petersburg State University) Ivan Petriakov (Saint Petersburg State University) Anton Gavrikov (Saint Petersburg State University) Dmitrii Tereshchenko (Saint Petersburg State University) Gleb Mozhaiskii (Saint Petersburg State University)

Gankevich-MPI-Checkpoint.pdf

9th International Conference "Distributed Computing and Grid Technologies in Science and Education" (GRID'2021)

Support

Verifiable application-level checkpoint and restart framework for parallel computing

403 or Online - https://jinr.webex.com/jinr/j.php?MTID=mf93df38c8fbed9d0bbaae27765fc1b0f

Speaker

Description

Authors

Presentation materials

Choose timezone

9th International Conference "Distributed Computing and Grid Technologies in Science and Education" (GRID'2021)

Support

Speaker

Description

Authors

Presentation materials