Fusion of Cluster Management and Cluster Simulation systems

3 Jul 2023, 17:00
15m
Room 403

Room 403

HPC HPC

Speaker

Anton Katenev (Компания Listware)

Description

The approach of integrating Cluster Management and Cluster Simulation systems addresses the challenges of High-Performance Computing (HPC) cluster management by leveraging simulation to enhance decision-making in case of failures. Foliage's team as an extensive experience in building and managing HPC clusters, however, uncertainties regarding cluster management behaviour during failures remain. Simulation is proposed as a solution to improve cluster management by quantifying subsystem degradation and predicting the impact of actions. Foliage's architecture, including a shared environment for management and simulation systems (“Unified Configuration Space"), enables constant refinement and updating of the simulation model. Integration of various applications through adapters and the use of a functional graph space empower seamless interactions between services. The proposed approach is demonstrated through a simple example, showcasing the calculation of overall cluster reliability. Future developments include the integration of AI capabilities for enhanced prediction and automation.

Primary author

Anton Katenev (Компания Listware)

Presentation materials