Speaker
Description
The software infrastructure of the BM@N experiment contains a set of various information systems that are essential for the work with experimental or simulated data on all processing stages, including the collection, storage, intermediate processing and physics analysis. Some examples of the systems are the Electronic Logbook Platform, Condition Database and Event Metadata System. In case one of such systems stops functioning, the work with BM@N data by collaboration members gets either impossible or, at least, much less productive. Due to this fact, the timely detection of possible failures in the systems due to software or hardware failures is fairly important. The Monitoring Service described in the report is used to check availability and health status of information systems. This includes measuring, storing, visualizing and sending alert notifications on monitored parameters, such as CPU, memory and disk utilization, DBMS functioning parameters, response times of databases and API endpoints, ping round-trip times, and so on. The current implementation of the BM@N monitoring service is discussed in detail. A related task of building highly available information services is also briefly noted.