Speaker
Dr
Jingyan Shi
(INSTITUTE OF HIGH ENERGY PHYSICS, Chinese Academy of Science)
Description
HTCondor, a scheduler focusing on high throughput computing has been more and more popular in high energy physics computing. The HTCondor cluster with more than 10,000 cpu cores running at computing center, institute of high energy physics in China, supports several HEP experiments, such as JUNO, BES, Atlas, Cms etc. The work nodes owned by the experiments are managed by HTCondor. A sharing pool including the work nodes contributed by all HEP experiments has been created to meet the peak computing requirement from the different experiments during different time periods. To manage the sharing pool, a database is used to store the cluster’s information including nodes and groups attributes. The attributes can be adjusted by the cluster manager and published to both scheduler servers and work nodes via http protocol. A monitoring dog is developed to monitor the work nodes health status and report to the database. Both servers and work nodes update their own configuration based on the attributes published by the database. The whole resource utilization rate of the cluster has been promoted from 50% to more than 80% after the sharing pool is created.
Primary author
Dr
Jingyan Shi
(INSTITUTE OF HIGH ENERGY PHYSICS, Chinese Academy of Science)