ATLAS BigPanDA Monitoring and Its Evolution

7 Jul 2016, 09:40
20m
LIT Conference Hall

LIT Conference Hall

Sectional reports 2. Operation, monitoring, optimization in distributed computing systems Plenary reports

Speaker

Tatiana Korchuganova (National Research Tomsk Polytechnic University)

Description

BigPanDA is the latest generation of the monitoring system for the Production and Distributed Analysis (PanDA) system. The BigPanDA monitor is a core component of PanDA and also serves the monitoring needs of the new ATLAS Production System Prodsys-2. BigPanDA has been developed to serve the growing computation needs of the ATLAS Experiment and the wider applications of PanDA beyond ATLAS. Through a system-wide job database, the BigPanDA monitor provides a comprehensive and coherent view of the tasks and jobs executed by the system, from high level summaries to detailed drill-down job diagnostics. The system has been in production and has remained in continuous development since mid 2014, today effectively managing more than 2 million jobs per day distributed over 150 computing centers worldwide. BigPanDA also delivers web-based analytics and system state views to groups of users including distributed computing systems operators, shifters, physicist end-users, computing managers and accounting services. Providing this information at different levels of abstraction and in real time has required solving several design problems described in this work. We describe our approach, design, experience and future plans in developing and operating BigPanDA monitoring.

Primary authors

Siarhei Padolski (Brookhaven National Laboratory) Torre Wenaus (Brookhaven National Laboratory)

Co-author

Tatiana Korchuganova (National Research Tomsk Polytechnic University)

Presentation materials