The BigPanDA self-monitoring alarm system for ATLAS

11 Sept 2018, 15:45
15m
406B

406B

Sectional reports 2. Operation, monitoring, optimization in distributed computing systems 2. Operation, monitoring, optimization in distributed computing systems

Speaker

Mr Aleksandr Alekseev (National Research Tomsk Polytechnic University)

Description

The BigPanDA monitoring system is a Web application created to deliver the real-time analytics, covering many aspects of the ATLAS experiment distributed computing. The system serves about 35000 requests daily and provides critical information used as input for various decisions: from distribution of the payload among available resources to issue tracking related to any of 350k jobs running simultaneously. It evolves intensively; in particular, in 2017, the system received 933 commits, delivering new features and expanding the scope of the presented data. The experience of operating BigPanDA in 24/7 mode led to development of a multilevel self-monitoring alarm system. This ELK-stack based solution covers all critical components of the BigPanda: from user authentication to management of the number of connections to the DB backend. The developed solution provides an intelligent error analysis, delivering to the operators only those notifications that need human intervention. We describe the architecture, principal features, and operation experience of self-monitoring, as well as its adaptation possibilities.

Primary authors

Mr Aleksandr Alekseev (National Research Tomsk Polytechnic University) Mr Siarhei Padolski (Brookhaven National Laboratory) Tatiana Korchuganova (National Research Tomsk Polytechnic University)

Presentation materials