Machine Learning Technologies to Predict the ATLAS Production System Behaviour

4 Jul 2016, 17:30
1h
Poster presentations Poster Session

Speaker

Maksim Gubin (Tomsk Polytechnic University)

Description

The second generation of the ATLAS Production System (ProdSys2) is an automated scheduling system that is responsible for data processing, data analysis and Monte-Carlo production on the Grid, supercomputers and clouds. The ProdSys2 project was started in 2014 and commissioned in 2015 (just before the LHC Run2) and now it handles O(2M) tasks per year, O(2M) jobs per day running on more than 250000 cores, each task transforms in many jobs. ProdSys2 evolves to accommodate a growing number of users and new requirements from the ATLAS Collaboration, Physics groups and individual users. ATLAS Distributed Computing in its current state is a big and heterogeneous facilities, running on the WLCG, academic and commercial clouds and supercomputers. This cyber-infrastructure presents computing conditions in which contention for resources among high-priority data analyses happens routinely. Inevitably, over-utilized computing resources cause degradation of services or significant workload and data handling interruptions. For these and other reasons, grid data management and processing must inevitably tolerate a continuous stream of failures, errors, and faults. This makes simulating ProdSys2 behavior a very challenging task requiring unfeasibly large computing power. However, behavior of the system seems to contain regularities that can be modeled using Machine Learning (ML) algorithms. We proposed use of ML approach in conjunction with ProdSys2 jobs execution information to predict behavior of the system, starting with estimating task completion times. The WLCG ML R&D project was started in 2016, we will present our first results how ProdSys2 behavior could be predicted and simulated. On the next phase we will use ML algorithms to predict and to find anomalies in the ProdSys2 behavior.

Primary author

Maksim Gubin (Tomsk Polytechnic University)

Co-authors

Dr Alexei Klimentov (Brookhaven National Lab) Mr Dmitry Golubkov (Institute for High Energy Physics) Fernando Barreiro (University Texas at Arlington) Mr Mikhail Borodin (NRNU MEPHI, NRC KI) Tadashi Maeno (Brookhaven National Laboratory)

Presentation materials

There are no materials yet.