Speaker
Mr
Qingbao Hu
(IHEP)
Description
LHAASO The on line machinecomputer room of LHAASO experiment located at has high altitude and poor natural environment. As t, and there is no permanent resident maintenance manpowerpersonnel, so it needs to deploy an automatic operation and maintenance system for the remote management.
According to the characteristics of the LHAASO cluster management, we have designed a distributed monitoring framework, to support the site monitoring and management. In this framework, the monitoring data is collected in real-time at remote site, then the data is compressed and transferred the data back to IHEP. The servers at IHEP and used to analyze the data and display the running status via web page..
This monitoring system monitors can monitoring and displays the machine performance of both physical machine and virtual machine, cluster's service status, job status and equipment energy consumption information in real-time. The system; detects abnormal equipment and givesfor real-time alarm; It creates and destroys the virtual machines based on physical machine states to maintain adequate computing capacity; provide accurate cause of failure for temporary maintenance staff at LHAASO personnel.
Primary author
Mr
Qingbao Hu
(IHEP)
Co-authors
Dr
Haibo Li
(Institute of High Energy Physics,Chinese Academy of Sciences)
Dr
Qiulan Huang
(Institute of High Energy of Physics(IHEP), Chinese Academy of Science)
Mr
wei zheng
(IHEP)
Mr
xiaowei jiang
(IHEP)