JINR Tier-1 service monitoring system: Ideas and Design

5 Jul 2016, 14:30
15m
406B

406B

Sectional reports 2. Operation, monitoring, optimization in distributed computing systems 2. Operation, monitoring, optimization in distributed computing systems

Speaker

Mr Igor Pelevanyuk (JINR)

Description

In 2015, a Tier-1 center for processing data from the LHC CMS detector was launched at JINR. After a year of operation it became the third among CMS Tier-1 centers considering Completed Jobs. The large and growing infrastructure pledged QoS and complex architecture all make support and maintenance very challenging. It is vital to detect signs of service failures as early as possible and enough information to react properly. Apart from the infrastructure monitoring, which is done on the JINR Tier-1 with Nagios, there is a need for consolidated service monitoring. The top-level services that accept jobs and data from the Grid depend on lower-level storage and processing facilities that themselves rely on the underlying infrastructure. The sources of information about the state and activity of the Tier-1 services are diverse and isolated from each other. Several tools have been examined for the service monitoring role, including HappyFace and Nagios, but the decision was made to develop a new system. The goals are to retrieve a monitoring information from various sources, to process the data into events and statuses, and to react according to a set of rules, e.g. to notify service administrators. Another important part of the system is an interface visualizing data and a state of the systems. A prototype has been developed and evaluated at JINR. The architecture, current and planned functionality of the system are presented in this report.

Primary authors

Presentation materials