Verifiable application-level checkpoint and restart framework for parallel computing

8 Jul 2021, 16:30
15m
403 or Online - https://jinr.webex.com/jinr/j.php?MTID=mf93df38c8fbed9d0bbaae27765fc1b0f

403 or Online - https://jinr.webex.com/jinr/j.php?MTID=mf93df38c8fbed9d0bbaae27765fc1b0f

https://jinr.webex.com/jinr/j.php?MTID=mf93df38c8fbed9d0bbaae27765fc1b0f
Sectional reports 5. High Performance Computing HPC

Speaker

Mr Ivan Gankevich (Saint Petersburg State University)

Description

Fault tolerance of parallel and distributed applications is one of the concerns that becomes topical for large computer clusters and large distributed systems. For a long time the common solution to this problem was checkpoint and restart mechanisms implemented on operating system level, however, they are inefficient for large systems and now application-level checkpoint and restart is considered as a more efficient alternative. In this paper we implement application-level checkpoint and restart manually for the well-known parallel computing benchmarks to evaluate this alternative approach. We measure the overheads introduced by creating and restarting from a checkpoint, and the amount of effort that is needed to implement and verify the correctness of the resulting programme. Based on the results we propose generic framework for application-level checkpointing that simplifies the process and allows to verify that the application gives correct output when restarted from any checkpoint.

Primary authors

Mr Ivan Gankevich (Saint Petersburg State University) Ivan Petriakov (Saint Petersburg State University) Anton Gavrikov (Saint Petersburg State University) Dmitrii Tereshchenko (Saint Petersburg State University) Gleb Mozhaiskii (Saint Petersburg State University)

Presentation materials