Speaker
Description
Empirical studies have repeatedly shown that in High-Performance Computing (HPC) systems, the user’s resource estimation lacks accuracy [1]. Therefore, resource underestimation may remove the job at any step of computing, and subsequently allocated resources will be wasted. Moreover, resource overestimation also will waste resources. The SLURM, a famous job scheduler, has a mechanism to predict only the starting time of a job. However, this mechanism is very primitive and more often the time is overestimated. In addition, there is a software system for modeling the activity of computing cluster users based on the SLURM, which uses the collection of statistics to simulate the load on a model of computing cluster under the control of SLURM. This software lists several metrics used by the system administrators of clusters. This approach has been tested on data from computing clusters of the Faculty of Computational Mathematics and Cybernetics, Moscow State University, and NIKIET JSC. However, this solution is also not suitable because it does not allow for analysis and prediction. There are other systems for analyzing the efficiency of using a cluster. Nevertheless, trying to connect such systems to the SLURM wastes a large number of resources, while our proposed method only requires the development of a component where the analysis system can easily get embedded. In this work, to effectively utilize the overall HPC system, we proposed a new approach to predict the required resources such as the number of required CPUs, time slots, etc. for a newly submitted job. The study focused on predictive analytics, including regression and classification tasks. The possibility of designing and using a plugin to apply our proposed method in real applications used by the system user was also studied.
A supervised machine learning (ML) system comprising several ML models was trained based on the collection of statistical data collected from the reference queue systems. Our dataset includes per-job and per-user features. To make our system applicable on a job scheduler, particularly SLURM, a dynamically connected SLURM SPANK was designed. This plugin has the ability to collect statistics, analyze them and based on the analysis create a model. Using this model, the prediction could be done. Plugins may not only complement but also modify the behavior of SLURM to optimize the system’s performance. SLURM has several utilities for its users. In this work, we are mainly interested in 2 utilities: srun - to start a job and sbatch - for placing jobs in the queue. More precisely, sending a batchscript (instructions to SLURM to perform the job) to the SLURM. The task is done through the design, develop and test of the functionality of the component connected to the SLURM queuing system. The component collects statistical data and does analysis on the flow of computational tasks. The proposed plugin, MLSP (Machine Learning SLURM Plugin), takes control when executing the srun and sbatch commands. The code splits into 2 large parts: main - working with SLURM and auxiliary - working with the server.
Our work has led us to conclude that adding new features to the dataset improves prediction accuracy. An innovative solution for the resource allocation problem was found. The possibility of writing a plugin to apply our machine learning system in practical applications was studied. It was found that designing a plugin allows the practical use of machine learning algorithms in decision making. However, it is required to improve the performance of this component. In future work, we will use this component to evaluate our algorithms on a real cluster to find the best method to predict required resources.
1- Tsafrir, Dan, Yoav Etsion, and Dror G. Feitelson. "Backfilling using runtime predictions rather than user estimates." School of Computer Science and Engineering, Hebrew University of Jerusalem, Tech. Rep. TR 5 (2005): 2003.