Speaker
Mr
Yury Tipikin
(Saint Petersburg University)
Description
The problem of reliability and stability of high performance computing
parallel jobs become more and more topical with the increasing number of
cluster nodes. Existing solutions rely mainly on inefficient process of RAM
dumping to stable storage. In case of really big supercomputers, such approach –
making checkpoints - may be completely unacceptable. In this study, I examined
the model of distributed computing – Actor model - and on this basis I developed
an algorithm of batch jobs processing on a cluster that restores interrupted
computation state without checkpoints. The algorithm is part of a computing
model that, to be specific, I called "computational kernels model in the name
of its core component – computational kernel. This work describes all the
components of the new model, its internal processes, benefits and potential
problems.
Primary author
Mr
Yury Tipikin
(Saint Petersburg University)