Data analysis platform for stream and batch data processing on hybrid computing resources

Jul 6, 2021, 2:00 PM
407 or Online -

407 or Online -

Sectional reports 9. Big data Analytics and Machine learning Big data Analytics and Machine learning.


Ivan Kadochnikov (JINR, PRUE)


The modern Big Data ecosystem provides tools to build a flexible platform for processing data streams and batch datasets. Supporting both the functioning of modern giant particle physics experiments and the services necessary for the work of many individual physics researchers generate and transfer large quantities of semi-structured data. Thus, it is promising to apply cutting-edge technologies to study these data flows and make the services ' provisioning more effective.
In this work, we describe the structure and implementation of our data analysis platform, built around an Apache Spark cluster. With the official support for GPU computing now available in Spark version 3, we propose a change in architecture to utilize these more performant resources while keeping the platform's functionality provided by using mainstream Big Data software. Furthermore, wanting GPU support necessitated a change of computing resource management infrastructure from Apache Mesos to Kubernetes. Finally, to show the features and operation of the system, we used the task of network packet analysis for security monitoring and anomaly detection in both batch and stream mode.

Primary authors

Ivan Kadochnikov (JINR, PRUE) Sergey Belov (Joint Institute for Nuclear Research, PRUE) Vladimir Korenkov (JINR, PRUE) Roman Semenov (JINR, PRUE) Petr Zrelov (JINR, PRUE)

Presentation materials