Data analysis platform for stream and batch data processing on hybrid computing resources

6 Jul 2021, 14:00
15m
407 or Online - https://jinr.webex.com/jinr/j.php?MTID=m573f9b30a298aa1fc397fb1a64a0fb4b

407 or Online - https://jinr.webex.com/jinr/j.php?MTID=m573f9b30a298aa1fc397fb1a64a0fb4b

Sectional reports 9. Big data Analytics and Machine learning Big data Analytics and Machine learning.

Speaker

Ivan Kadochnikov (JINR, PRUE)

Description

The modern Big Data ecosystem provides tools to build a flexible platform for processing data streams and batch datasets. Supporting both the functioning of modern giant particle physics experiments and the services necessary for the work of many individual physics researchers generate and transfer large quantities of semi-structured data. Thus, it is promising to apply cutting-edge technologies to study these data flows and make the services ' provisioning more effective.
In this work, we describe the structure and implementation of our data analysis platform, built around an Apache Spark cluster. With the official support for GPU computing now available in Spark version 3, we propose a change in architecture to utilize these more performant resources while keeping the platform's functionality provided by using mainstream Big Data software. Furthermore, wanting GPU support necessitated a change of computing resource management infrastructure from Apache Mesos to Kubernetes. Finally, to show the features and operation of the system, we used the task of network packet analysis for security monitoring and anomaly detection in both batch and stream mode.

Primary authors

Ivan Kadochnikov (JINR, PRUE) Sergey Belov (Joint Institute for Nuclear Research, PRUE) Vladimir Korenkov (JINR, PRUE) Roman Semenov (JINR, PRUE) Petr Zrelov (JINR, PRUE)

Presentation materials