THE BIGPANDA MONITORING SYSTEM ARCHITECTURE

Sep 11, 2018, 3:30 PM
15m
406B

406B

Sectional reports 2. Operation, monitoring, optimization in distributed computing systems 2. Operation, monitoring, optimization in distributed computing systems

Speaker

Tatiana Korchuganova (National Research Tomsk Polytechnic University)

Description

Currently-running large-scale scientific projects involve unprecedented amounts of data and computing power. For example, the ATLAS experiment at the Large Hadron Collider (LHC) has collected 140 PB of data over the course of Run 1 and this value increases at rate of ~800 MB/s during the ongoing Run 2 and recently has reached 350 PB. Processing and analysis of such amounts of data demands development of complex operational workflow and payload systems along with building top edge computing facilities. In the ATLAS experiment a key element of the workflow management is the Production and Distributed Analysis system (PanDA). It consists of several core components and one of them is the monitoring. The latter is responsible for providing a comprehensive and coherent view of the tasks and jobs executed by the system, from high level summaries to detailed drill-down job diagnostics. The BigPanDA monitoring has been in production since the middle of 2014 and it continuously evolves to satisfy increasing demands in functionality and growing payload scales. Today it effectively keeps track of more than 2 million jobs per day distributed over 170 computing centers worldwide in the largest instance of the BigPanDA monitoring: the ATLAS experiment. In this paper we describe the monitoring architecture and its principal features.

Primary authors

Mr Aleksandr Alekseev (National Research Tomsk Polytechnic University) Dr Alexei Klimentov (Brookhaven National Lab) Siarhei Padolski (Brookhaven National Lab) Tatiana Korchuganova (National Research Tomsk Polytechnic University) Torre Wenaus (Brookhaven National Lab)

Presentation materials