Speaker
Description
Executing millions of scientific high-throughput computing (HTC) jobs on distributed heterogeneous computing resources poses challenges in observing their status and behavior after their completion. To address this, an approach was developed to analyze jobs using scatter plots, showcasing the dependency between job durations and the relative performance of CPU cores they were assigned to. Subsequently, a specialized system was created to automate this analysis process. The system regularly collects relevant data regarding finished jobs within the DIRAC infrastructure.
Using the Django web framework on the server side and the HTML+CSS+JavaScript stack on the client side, a web application was developed, offering the necessary tools and filters to highlight different aspects of the operation, such as final status, processors used, cluster names and the sending user. Highcharts JavaScript library was used to visualize the results. After investigating several approaches it was decided to store the data in CSV files. The web application use these datasets as a data source for analysis.
The developed system has proven to be invaluable, enabling the identification of issues on remote servers and demonstrating performance disparities among different computing resources. It facilitates efficient monitoring and analysis of HTC jobs, improving the overall understanding of their execution behavior.