Speaker
Mr
Alexandr Mikula
(Institute of Physics of the Czech Academy of Sciences)
Description
Typical WLCG Tier-2 centres use several hundreds of servers with different
services. Manual checks of all log files is impossible and various smart
solutions for monitoring and log file analysis are used. We describe used
procedures in the Computing Centre of the Institute of Physics in Prague,
which hosts Tier-2 centre for ALICE and ATLAS experiments and provides
resources
for several other projects.
Nagios is used as a basic monitoring tool set. Our custom plug-in aggregates
warning and standard error messages and sends them summarised 3 times
per day to
administrators via email. Errors on critical components are sent immediately
via email and Short Message System to predefined phone numbers. Nagios is
complemented by Munin and Ganglia for better status overview of each
server and
the whole infrastructure.
ELK stack is the most recent part of our monitoring set up.
All log files from all production servers are shipped for processing by
Logstash and then are
stored in Elastic Search. We will describe used hardware, roles of each
machine in the ELK cluster, technological challenges, obstacles and our
cluster set up and its tuning. Typical examples of searches and
graphical outputs will be presented.
Primary author
Mr
Alexandr Mikula
(Institute of Physics of the Czech Academy of Sciences)