Speaker
Description
The approach of integrating Cluster Management and Cluster Simulation systems addresses the challenges of High-Performance Computing (HPC) cluster management by leveraging simulation to enhance decision-making in case of failures. Foliage's team as an extensive experience in building and managing HPC clusters, however, uncertainties regarding cluster management behaviour during failures remain. Simulation is proposed as a solution to improve cluster management by quantifying subsystem degradation and predicting the impact of actions. Foliage's architecture, including a shared environment for management and simulation systems (“Unified Configuration Space"), enables constant refinement and updating of the simulation model. Integration of various applications through adapters and the use of a functional graph space empower seamless interactions between services. The proposed approach is demonstrated through a simple example, showcasing the calculation of overall cluster reliability. Future developments include the integration of AI capabilities for enhanced prediction and automation.