Automated Analysis and Monitoring of Scientific HTC Jobs on Distributed Heterogeneous Computing Resources

6 Jul 2023, 16:30
15m
Room 310

Room 310

Distributed Computing Systems Distributed Computing Systems

Speaker

Ms Anna Ilina (Joint Institute for Nuclear Research)

Description

Executing millions of scientific high-throughput computing (HTC) jobs on distributed heterogeneous computing resources poses challenges in observing their status and behavior after their completion. To address this, an approach was developed to analyze jobs using scatter plots, showcasing the dependency between job durations and the relative performance of CPU cores they were assigned to. Subsequently, a specialized system was created to automate this analysis process. The system regularly collects relevant data regarding finished jobs within the DIRAC infrastructure.

Using the Django web framework on the server side and the HTML+CSS+JavaScript stack on the client side, a web application was developed, offering the necessary tools and filters to highlight different aspects of the operation, such as final status, processors used, cluster names and the sending user. Highcharts JavaScript library was used to visualize the results. After investigating several approaches it was decided to store the data in CSV files. The web application use these datasets as a data source for analysis.

The developed system has proven to be invaluable, enabling the identification of issues on remote servers and demonstrating performance disparities among different computing resources. It facilitates efficient monitoring and analysis of HTC jobs, improving the overall understanding of their execution behavior.

Primary authors

Ms Anna Ilina (Joint Institute for Nuclear Research) Igor Pelevanyuk (Joint Institute for Nuclear Research)

Presentation materials