The International Conference "Distributed Computing and Grid-technologies in Science and Education" will be held in a MIXED format at the Meshcheryakov Laboratory of Information Technologies (MLIT) of the Joint Institute for Nuclear Research (JINR).
Conference Topics:
Research Computer Infrastructure
1. Distributed Computing Systems – technologies, architectures, models, operation and optimization, middleware and services.
2. HPC – supercomputers, CPU architectures, GPU, FPGA.
3. Сloud Technologies – cloud computing, virtualization technologies, automation of deployment of software infrastructure and applications.
4. Distributed Storage Systems
5. Distributed Computing and HPC Applications in science, education, industry and business, open data.
6. Computing for MegaScience Projects
Computing Science Trends
7. Quantum informatics and computing – information processing, machine learning, communication, program engineering and robotics, simulation of quantum algorithms.
8. Big Data, Machine Learning and Artificial Intelligence – big data analytics and platforms, neural networks and deep learning, intelligent control systems, decision intelligence tools and recommendation systems.
9. e-Learning – e-Learning tools, virtual labs, EdTech and HR Tech, human assets management and development systems.
During the conference, there will be held the workshop "Computing for radiobiology and medicine", workshop "Modern approaches to the modeling of research reactors, creation of the "digital twins" of complex systems" (4-5 July), round table "RDIG-M - Russian distributed infrastructure for large-scale scientific projects in Russia".
Conference languages – Russian and English.
Contacts:
Address: 141980, Russia, Moscow region, Dubna, Joliot Curie Street, 6
Phone: (7 496 21) 64019, 64826
E-mail: grid2023@jinr.ru
URL: http://grid2023.jinr.ru/
The large research infrastructure project “Multifunctional Information and Computing Complex (MICC) of JINR” is an integral part of the Seven-Year Plan for the development of JINR for 2024-2030. A large research infrastructure project is justified and timely, given the decisive importance of the continuous development of the information and computing infrastructure, which will allow JINR to stay at the forefront of scientific research in different fields that the Institute is conducting and will conduct in the coming years.
The main objective of this project is to ensure computing for obtaining physics results within the priorities of the JINR research programme. There are among them the experiments of the NICA megaproject and the JINR neutrino program, the experiments at the LHC and other large-scale experiments, as well as other theoretical and experimental studies, according to the Seven-Year Plan for the development of JINR. Constant support for users from the JINR Laboratories and its Member States is also a high priority.
The MICC is one of JINR’s basic facilities. The MICC computing infrastructure consists of four advanced software and hardware components: the Tier1 and Tier2 grid sites for distributed data processing, the HybriLIT heterogeneous platform with hyperconverged “Govorun” supercomputer for high-performance hybrid computing, the cloud infrastructure and the distributed multi-layer data storage system. This set of components ensures the uniqueness of the MICC on the world landscape and allows the scientific community of JINR and its Member States to use all progressive computing technologies within one computing complex.
This talk presents a survey of the JINR MICC and describes it potential for future development.
Quantum many-body control is among most challenging problems in the field of quantum technologies, yet it is absolutely essential for further developments of this vast field. In this work, we propose a novel approach for solving control problems for many-body quantum systems. The key feature of our approach is the ability to run tens of thousands of iterations of a gradient-based optimization of a control signal within reasonable time. This is achieved by a tensor-networks-based reduced-order modeling scheme allowing one to build a low-dimensional reduced-order model of a many-body system, whose numerical simulation in many orders of magnitude faster and more memory efficient than for the original model; these reduced-order models can be seen as “digital twins” of many-body systems. The control protocols developed for the “digital twins” can be then directly applied to the quantum many-body system of interest.
We validate the proposed method by demonstrating solutions of control problems for a one-dimensional XYZ model, such as controllable information spreading/transmission over the system, and for a spin chain in many-body localization phase, such as controllable dynamics inversion. Interestingly, our approach by design uses environmental effects (such as non-Markovianity) to make control protocols more efficient: instead of fighting against a potential loss caused by the interaction with the environment, the method uses interaction as a communication protocol with environment that is used as a “memory” for storage of quantum information. We expect that our results will find direct applications in the study of complex many-body systems, specifically, in probing non-trivial quasiparticle excitations, as well as in development control tools for quantum computing devices.
We acknowledge the support by the RSF Grant No. 19-71-10092 and Russian Roadmap on Quantum Computing.
-
CVE, CWE, and CAPEC databases and their relationships are shortly introduced. Focus on this paper is on formalization and more specific on weakness formaliza-tion. Software weaknesses are described as formatted text. There is no widely ac-cepted formal notation for weakness specification. This paper shows how Z-notation can be used for formal specification of CWE-119.
The state of queues in modern computing centers is such that a user's task can hold in the queue before start for weeks. However, even if the queuing system gives a forecast for the start of a task, the forecast often turns out to be incorrect. It is happening because during the week the task is in the queue, there are some events will occur that will change the moments tasks starts. We know the amount of resources required for the task but, it is difficult to get an answer about the moment the task started from the queuing system without placing the task in the queue.
The paper proposes to improve the previously created tool for predicting the characteristics of tasks. It is performed through the use of more modern software interfaces to the queuing system SchedMD (Slurm), as well as the use of various mathematical forecasting methods to predict the moment the task starts in the queue.
To compare and evaluate machine learning models, a Standard Workload Format (SWF) file was selected with task execution history data on a computing cluster of the University of Luxembourg with 59715 tasks. The following methods have been explored: linear regression with L2 regularization, support vector machines, random forest, LightGBM, CatBoost, LightGBM with parameter optimization, CatBoost with parameter optimization. To check the correctness of the prediction, a test bench was built based on the Slurm simulator (SUNY Center for Computational Research at the University of Buffalo, USA), which works on the basis of tasks execution and queuing saved logs.
The Data Knowledge Base (DKB) project is aimed at knowledge acquisition and metadata integration which consists of DKB python library and API service. Due to its ETL workflow engine it allows to build a single system with integrated and pre-processed information behind distributed metadata infrastructure. Such system provides fast and efficient response for summary reports, monitoring tasks (aggregation queries), multi-system join queries and others. In this work the resent improvements and a current status of the project are shown as well as its usage in WLCG for operation metadata analysis.
Containerization technology for Linux systems appeared many years ago with OpenVZ and LCX systems. However, wider usage came with the advent of the Docker system and the addition of new container support mechanisms to the Linux kernel itself. Originating as a system for servers, containerization technology has gradually moved into the user space as a universal mechanism for distributing and running complex software, including calculations and analyzing data on computing clusters or cloud services. Together with DevOps methodology as a clue it could be used for organizing a full circle of computation on such systems.
The paper describes the usage of containers with help of DevOps methodology for organizing calculations on the IHEP central Linux cluster.
Modern computing centers consist of many enginering systems which provide working conditions for complex computing hardware. Some of such systems do not have build-in monitoring components or their cost is very expensive which makes them difficult for surveillance. Meanwhile there are many open source software libraries available for computer vision and image recoginition. One of them is OpenCV - Open Source Computer Vision Library which was developed since 2006 and has many computer vision algorithms implemented with supporting hardware acceleration of recent GPU and CPU.
This work describes the development of the system for recognition different sensors indications of engineering equipment: cooling system status, water pressure, water meter. It is written on Python with OpenCV and tested in the computing center at NRC «Kurchatov Institute» - IHEP".
В Лаборатории Информационных Технологий им. М.Г. Мещерякова (ЛИТ), Объединённого Института Ядерных исследований (ОИЯИ) в качестве системы для пакетной обработки данных поступающих от экспериментов CMS, АTLAS, ALICE, MPD, BM@N и других, используется система Slurm. Выполнение задачи от Slurm происходит в операционной системе (ОС) непосредственно на физическом сервере (вычислительный узел, ВУ). Перед включением в кластер требуется его подготовить, т.е. установить и настроить операционную систему и базовое программное обеспечение. Эта операция является времязатратной, особенно когда требуется настроить сразу много серверов. Для снижения этих операционных усилий автором предлагается заменить среду исполнения пользовательских задач контейнерезированной средой.
В докладе будет представлен прототип пакетной обработки выполненной в контейнерезированной среде. Будет рассмотрена техническая реализация прототипа с применением открытого ПО: Slurm,Consul,consul-template,Terraform,Prometheus,Docker.
As a part of organization of a simplified access to the computing resources of the central IHEP cluster based on WEB technologies, was developed a system architecture based on the free software Apache Guacamole. Apache Guacamole is a clientless remote desktop gateway supporting protocols like ssh, vnc and rdp via a web-browser. VNC and RDP support is implemented on the server side using native libraries, and only input and output information is transmitted through the browser, Guacamole provides a good performance, close to standard VNC and RDP clients. This is a good solution, because such a system does not require the installation of plugins and third-party software on the user side, like vnc client and ssh-server for Windows. In our case, this is an additional tool for working with a cluster that does not require any settings on the user side.
This work is about the Guacamole system installation through docker containers, some configuration and user interaction with it as a part of working with the IHEP cluster.
To build a computer cluster for bioinformatics and biomedical research is a very complex task. Such cluster has to seamlessly combine different types of the software stack which is used for computations and it should provide the easiest way for organizing complex workflows for scientific research. On the other side it has to be as simple as possible for usage to allow researchers with no or basic knowledge in information technology to perform their tasks.
This work present one of the possible architecture for a such system and a cluster software stack which could be used to build and operate it using a computer cluster of the Institute of Translational Medicine from Pirogov Russian National Research Medical University as an example.
Modern scientific research requires the utilization of advanced computing systems, software tools, and data analysis and visualization tools. Currently, there is significant interest in studying spintronic nanstructures such as superconductor/ferromagnet (S/F) hybrids and their potential for controlling magnetic properties using superconducting currents. This research direction promises substantial reduction in energy consumption of such devices and their potential application in quantum information processing and superconducting spintronics. Investigations into the topological properties of magnetic and phase dynamics, as well as hybrid nanstructures, offer alternative approaches to information processing by employing novel carriers of information such as magnetic skyrmions and Majorana bound states. Solving equations that describe the dynamics of such structures requires the utilization of heterogeneous computing systems and specialized software tools.
To address these needs, a platform based on Jupyter Binder technologies has been developed. It is a cloud-based platform that allows running Jupyter notebooks in a web browser without the need to install any additional software on the local device. This modern, user-friendly, and continuously evolving tool is designed for researchers, developers, and data processing specialists who wish to collaborate on projects and share their work. Jupyter notebooks are interactive documents that contain program code, equations, visualizations, and descriptive text. They can be written in multiple programming languages, making them a versatile tool for data analysis and scientific research. Jupyter notebooks are highly popular among researchers and developers due to their simplicity and ease of use, as they enable running code and viewing results directly within the document. Additionally, users can easily share notebooks and collaborate with others by publishing their notebooks online.
The approach of integrating Cluster Management and Cluster Simulation systems addresses the challenges of High-Performance Computing (HPC) cluster management by leveraging simulation to enhance decision-making in case of failures. Foliage's team as an extensive experience in building and managing HPC clusters, however, uncertainties regarding cluster management behaviour during failures remain. Simulation is proposed as a solution to improve cluster management by quantifying subsystem degradation and predicting the impact of actions. Foliage's architecture, including a shared environment for management and simulation systems (“Unified Configuration Space"), enables constant refinement and updating of the simulation model. Integration of various applications through adapters and the use of a functional graph space empower seamless interactions between services. The proposed approach is demonstrated through a simple example, showcasing the calculation of overall cluster reliability. Future developments include the integration of AI capabilities for enhanced prediction and automation.
УДК 519.6+004.42
ВЫСОКОПРОИЗВОДИТЕЛЬНЫЕ ВЫЧИСЛЕНИЯ НА МАТЕМАТИЧЕСКИХ СОПРОЦЕСCОРАХ И ГРАФИЧЕСКИХ УСКОРИТЕЛЯХ С ИСПОЛЬЗОВАНИЕМ PYTHON
С. В. Борзунов, А. В. Романов, С. Д. Кургалин, K. О. Петрищев
ФГБОУ ВО «Воронежский государственный университет»
В настоящей работе рассмотрено применение модулей языка программирования Python для решения ресурсоемких задач. На примере перемножения вещественных матриц продемонстрированы методы высокопроизводительных вычислений на базе математических сопроцессоров и графических ускорителей. Показано, что существует удобный интерфейс организации вычислений путем вызова исполняемого кода из скрипта на языке Python. Тем не менее, использование полного функционала графических ускорителей труднее осуществить указанным образом, что объясняется особенностями работы потоковых ядер. Использование алгоритмического языка Python в высокопроизводительных вычислениях во многих случаях значительно расширяет удобство создания, отладки и тестирования программного кода.
Ключевые слова: высокопроизводительные вычисления, суперкомпьютер, компьютерный кластер, язык программирования Python, CUDA
Введение
Многие современные вычислительные комплексы, входящие в список Top500 ведущих суперкомпьютеров мира, собраны по гибридной схеме [1, 2]. Они имеют в своем составе не только универсальные процессоры, но и энергоэффективные математические сопроцессоры, такие как Intel Xeon Phi или Nvidia Tesla. Подобные машины представляют собой пример устройств, созданных специально для выполнения массивных параллельных вычислений. Например, сопроцессор Intel Xeon Phi предоставляет возможность производить вычисления с использованием до 240 логических ядер, а математический сопроцессор Nvidia Tesla A100 включает в себя 6912 потоковых ядер CUDA. Такие суперкомпьютеры имеют достаточно сложную архитектуру взаимодействия между процессором и сопроцессором, которая сильно усложняет процесс подготовки и поддержки программного кода. Программирование подобных систем с использованием полного функционала средств распараллеливания вычислений имеет свои особенности, поскольку современные GPU, в отличие от центральных процессоров, представляют собой массивно-параллельные вычислительные устройства с относительно большим количеством вычислительных ядер и иерархически организованной собственной памятью.
Суперкомпьютерный центр Воронежского государственного университета (ВГУ), созданный в 2002 году [3, 4], имеет в своем составе высокопроизводительный вычислительный кластер, состоящий из 10 узлов, в каждом из которых по два 12-ядерных процессора, 128 Гбайт оперативной памяти и SSD-диск 256 Гбайт. При этом семь узлов кластера содержат по 2 ускорителя Intel Xeon Phi, а три остальных узла – по 2 ускорителя Nvidia Tesla. Узлы кластера объединены сетью InfiniBand. Кластер используется как для проведения научных вычислений, так и в учебном процессе факультета компьютерных наук ВГУ.
В связи с тем, что в узлах кластера находятся ускорители указанных двух типов, актуальной является задача обеспечить их эффективное использование.
Методы высокопроизводительных вычислений
с использование модулей Python
Следуя современным тенденциям разработки программных инструментов более высокого уровня для упрощения программирования сложных вычислительных систем, созданы специальные модули языка Python, которые позволяют упростить методы работы с сопроцессором. В качестве примеров таких расширений укажем на модули PyMIC и PyCUDA [5]. Принцип действия этих модулей совпадает, он заключается в предоставлении интерфейса к основным операциям процессор/сопроцессор.
Разумеется, реализация перемножения вещественных матриц на языке Python не может похвалиться высокой скоростью вычислений, поэтому математические операции подобного рода выполняются, как правило, средствами сторонних библиотек, чаще всего, написанных на языках C или Fortran [6]. Примером могут служить вычисления, производимые с помощью пакета NumPy. В этом случае в качестве сторонних библиотек используются подпрограммы, реализованные на языке Fortran. Соответственно, первым шагом к переносу вычислений на сопроцессор будет реализация пользовательской библиотеки на компилируемом языке и ее интеграция в код на языке Python.
Наиболее доступными с точки зрения простоты программирования являются вычисления с использованием математического сопроцессора Intel Xeon Phi. Это объясняется сходством архитектуры сопроцессора с универсальными процессорами фирм Intel и AMD, что позволяет использовать уже известные приемы программирования.
Рассмотрим работу с таким сопроцессором на примере задачи перемножения двух вещественных матриц.
В листинге 1 представлен вариант исходного кода пользовательской библиотеки для интеграции с модулем PyMIC. Заметим, что в этом случае код практически не отличается от стандартной реализации перемножения матриц на языке С.
Листинг 1
PYMIC_KERNEL
void multiplication (const double A, const double B, double C, const int nrows, const int *ncols){
for(int i = 0; i < *nrows; i++)
for(int j = 0; j < *ncols; j++)
{
for(int k = 0; k < *ncols; k++)
C[i][j] += A[i][k] * B[k][j];
}
}
Библиотека выполняет расчеты на сопроцессоре Intel Xeon Phi, получая данные из программы на языке Python, и возвращает обратно результат вычисления.
Код программы с вызовом представленной ранее библиотеки (libmult.so) и генерацией матриц средствами пакета NumPy представлен в листинге 2.
Листинг 2
import pymic as mic
import numpy as np
device = mic.devices[0]
library = device.load_library("libmult.so")
stream = device.get_default_stream()
nrows = 1024
ncols = 1024
a = np.random.random(size = (nrows,ncols))
b = np.random.random(size = (nrows,ncols))
c = np.zeros(size = (nrows, ncols))
offl_a = stream.bind(a)
offl_b = stream.bind(b)
offl_c = stream.bind(c)
offl_c.update_device()
stream.invoke(library.multiplication, offl_a, offl_b, offl_c, nrows,ncols)
stream.sync()
offl_c.update_host()
stream.sync()
Учитывая возможность использования для вычислений до 240 потоков (одно физическое ядро зарезервировано под нужды операционной системы), сопроцессор Intel Xeon Phi предоставляет широкие возможности в исследовании вопросов параллельных вычислений без необходимости глубокого понимания его архитектуры и специальных диалектов языка С.
Подготовка программного кода, ориентированного на работу с GPU фирмы Nvidia, значительно сложнее. Основная трудность написания кода для сопроцессора Tesla объясняется особенностями работы потоковых ядер CUDA.
В листинге 3 приведен текст программы перемножения двух матриц, написанный для сопроцессора Nvidia Tesla с применением модуля PyCUDA.
Листинг 3
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy
(n, m, p) = (3, 4, 5)
n = numpy.int32(n)
m = numpy.int32(m)
p = numpy.int32(p)
a = numpy.random.randint(2, size=(n, m))
b = numpy.random.randint(2, size=(m, p))
c = numpy.zeros((n, p), dtype=numpy.float32)
a = a.astype(numpy.float32)
b = b.astype(numpy.float32)
a_gpu = cuda.mem_alloc(a.size * a.dtype.itemsize)
b_gpu = cuda.mem_alloc(b.size * b.dtype.itemsize)
c_gpu = cuda.mem_alloc(c.size * c.dtype.itemsize)
cuda.memcpy_htod(a_gpu, a)
cuda.memcpy_htod(b_gpu, b)
mod = SourceModule("""
global void multiply( int n, int m, int p, float a, float b, float c )
{
int idx = pthreadIdx.x + threadIdx.y;
c[idx] = 0.0;
for(int k=0; k<m; k++)
c[idx] += a[m*threadIdx.x+k]
*b[threadIdx.y+k*p];
}
""")
func = mod.get_function("multiply")
func(n, m, p, a_gpu, b_gpu, c_gpu, block=(int(n), int(p), 1), \
grid=(1, 1), shared=0)
cuda.memcpy_dtoh(c, c_gpu)
Как и в случае кода для Intel Xeon Phi, программа состоит из инициализации данных, их пересылки на сопроцессор и обратно, а также из ядра, которое осуществляет вычисления. Однако, математическое ядро здесь сильно отличается от аналога для сопроцессора Intel из-за разницы архитектур.
Заключение
Использование алгоритмического языка Python в высокопроизводительных вычислениях во многих случаях значительно расширяет удобство создания, отладки и тестирования программного кода. Более того, для обучения специалистов работе на современных суперкомпьютерах целесообразно начинать обучение с более простых в программировании сопроцессоров фирмы Intel с переходом на сопроцессоры Nvidia, ставшими сегодня признанным стандартом в отрасли. В итоге, использование для операций верхнего уровня модулей PyMIC и PyCUDA позволяет существенно упростить работу с суперкомпьютером, предоставить бесшовную интеграцию с инструментами популярного языка программирования Python.
Литература
1. Korenkov, V. The JINR distributed computing environment / V. Korenkov, A. Dolbilov, V. Mitsyn [et al.] // EPJ Web of Conferences, CHEP 2018. – 2019. – Vol. 214. P. 03009. – DOI:10.1051/epjconf/201921403009
2. Кургалин С. Д. Using the resources of the Supercomputer Center of Voronezh State University in learning processes and scientific researches / С. Д. Кургалин, С. В. Борзунов // Труды международной конференции «Суперкомпьютерные дни в России», Москва, 24-25 сентября 2018 г. – Москва : Издательство МГУ, 2018. – С. 972-977.
3. Кургалин С.Д. Технологии высокопроизводительных параллельных вычислений для научных исследований в области ядерной физики и в дистанционном обучении / С.Д. Кургалин // Распределенные вычисления и GRID-технологии в науке и образовании : тр. Международ. конф., Дубна, 29 июня - 6 июля 2004 г. – Дубна, 2004. – С. 155-160.
4. Кургалин С.Д. Моделирование ядерно-физических процессов на параллельном компьютерном кластере / С.Д. Кургалин // Distributed Сomputing and GRID-Technologies in Science and Education : Book of Abstr. of the Intern. Conf., June 29 - July 2, 2004, Dubna. – Dubna, 2004. – P. 59.
5. Pycuda 2022.2.2 documentation // 2022. – URL: https://documen.tician.de/pycuda/ (дата обращения: 12.05.2023).
6. Борзунов, С.В. Суперкомпьютерные вычисления : практический подход / С.В. Борзунов, С.Д. Кургалин. – Санкт-Петербург : БХВ-Петербург, 2019. – 256 с.
Сведения об авторах
Борзунов Сергей Викторович, к.ф.-м.н., доц. кафедры цифровых технологий факультета компьютерных наук Воронежского государственного университета. 394018, Воронеж, Университетская пл., 1.
Е-mail: sborzunov@gmail.com. Tел.: 8 (473) 220-83-84.
Романов Александр Викторович, ст. преп. кафедры цифровых технологий факультета компьютерных наук, вед. инженер суперкомпьютерного центра Воронежского государственного университета. 394018, Воронеж, Университетская пл., 1.
Е-mail: alphard.rm@gmail.com. Tел.: 8 (473) 220-83-84.
Кургалин Сергей Дмитриевич, д.ф.-м.н., проф., зав. кафедрой цифровых технологий факультета компьютерных наук Воронежского государственного университета. 394018, Воронеж, Университетская пл., 1.
Е-mail: kurgalin@bk.ru. Tел.: 8 (473) 220-83-84.
Петрищев Константин Олегович, студ. факультета компьютерных наук Воронежского государственного университета, г. Воронеж. 394018, Воронеж, Университетская пл., 1.
Е-mail: vrn.kostyan.p@mail.ru. Tел.: 8 (961) 616-97-93.
HIGH-PERFORMANCE COMPUTING ON MATHEMATICAL CO-PROCESSORS AND GRAPHIC ACCELERATORS USING PYTHON
S. V. Borzunov, A. V. Romanov, S. D. Kurgalin, K. O. Petrishchev
Voronezh State University
The paper considers the use of Python programming language modules for solving resource-intensive tasks. On the example of the problem of multiplication of real matrices, high-performance computing methods based on mathematical coprocessors and graphics accelerators are demonstrated. It is shown that there is a convenient interface for organizing calculations by calling the executable code from a Python script. However, the use of the full functionality of graphics accelerators is more difficult to implement in this way, which is explained by the peculiarities of the operation of streaming cores. The use of the Python algorithmic language in high-performance computing in many cases significantly expands the convenience of creating, debugging, and testing program code.
Keywords: high performance computing, supercomputer, computer cluster, Python programming language, CUDA
Работа посвящена созданию программной системы для поддержки популяционных методов при решении сложных задач дискретной и непрерывной оптимизации высокой размерности. Проводится обзор наиболее популярных роевых и эволюционных методов оптимизации. Анализируются существующие программные решения в данной области с учетом возможности параллельного выполнения рассматриваемого класса алгоритмов. Предлагаются обобщенная модель популяционных алгоритмов и структура программной системы, позволяющей проводить параметрическую и структурную настройку уже реализованных в ней алгоритмов, а также создавать новые алгоритмы. Рассматриваются возможные схемы автоматической параллельной реализации алгоритмов, включенных в систему.
Quasiprobability distributions associated to quantum states play the same role as the probability distribution functions in classical statistical physics, but with a key difference that quantum counterparts can take negative values for some states. Due to this fact, all states are divided into classes, the first one comprised of the "classical states", whose quasiprobability distributions are non-negative, and the second one which consists of the complement states, the carriers of a certain "quantumness". One possible way to quantify "quantumness-classicality" of states is based on the evaluation of their remoteness from a set of reference classical states. The paper studies this type of distance indicator of non-classicality in finite-dimensional quantum systems supposing that the classical states are those states whose Wigner function is non-negative. We prove the representation for the distance indicator of non-classicality as the piecewise function with support provided by the special Wigner quasiprobability non-negativity polytope in the simplex of a state's eigenvalues, discuss indicator's properties, and exemplify details of its evaluation for qubit and qutrit cases.
Quantum computing performance depends on the properties of the underlying physical qubits. The depth of an algorithm is limited by the decoherence of the qubits. In this respect the design of algorithms that quantify the decoherence of qubits is particularly of interest. In order to fit the data qubit models are necessary. We present the performance of our SU2 C++ package for polymorphically implemented SU(2) scalars, applied to spin-echo modelling, against the measurements of the IBM armonk qubit.
The physical concept of elementary and composite systems forms the pillar on which our understanding of quantum phenomena stands. The present talk aims to discuss a complementary character of the description of elementary and composite finite-dimensional quantum systems within the modern phase-space formulation of quantum mechanics.
We will give a generic method of constructing the Wigner quasiprobability distributions of composite quantum states, analyse their properties, and discuss the features making them different from those distributions associated to elementary systems.
The paper simulates the process of the entanglement states transferring along a chain of tryptophans a) into cell's microtubule, b) connected by dipole-dipole interaction. In the work the conditions under which the migration of the entanglement states in the microtubule is possible are obtained.
The results of the work allow us to talk about the signal function of microtubule tryptophans working as a quantum repeater that transmits quantum entangled states by relaying through intermediate tryptophans.
(see detail in the email)
Division of Computational Physics, MLIT, JINR
Status of MPDROOT framework for current and future tasks of MPD experiment is considered. Also the experience of using interware DIRAC for mass production and reconstruction of simulated data for MPD experiment is reviewed.
As in other large particle collision experiments, the topic of distributed event processing and computing is extremely relevant in the BM@N experiment, the first ongoing experiment of the NICA project due to the heavy data flow, the sequential processing of which would take hundreds of years. Only in the last Run of the BM@N experiment about half a petabyte of raw data was collected, and when the experiment reaches the design parameters, the amount of the data will increase by an order of magnitude. To solve this problem and combine all distributed resources of the experiment into a single computing system with a single storage system as well as provide automation of job processing flows, the software-computing architecture has been developed and is being implemented, which includes a complex of software services for distributed processing of the BM@N data flow and will be presented in the report. Furthermore, a set of the online and offline software and information systems has been adapted for mass data production to be a part of the BM@N computing infrastructure. In addition. various auxiliary services will be shown that provide control and quality of the distributed processing of the physics events.
The SPD (Spin Physics Detector) is a planned spin physics experiment in the second interaction point of the NICA collider that is under construction at JINR. The main goal of the experiment is the test of basics of QCD via the study of the polarized structure of the nucleon and spin-related phenomena in the collision of longitudinally and transversely polarized protons and deuterons at the center-of-mass energy up to 27 GeV and luminosity up to 10^32 1/(cm^2 s). The expected raw data rate from the detector at design luminosity is expected to reach 20 GB/s, or 200 PB/year. The key challenge of the SPD computing is the fact, that no simple selection of physics events is possible at the hardware level, because the trigger decision would depend on measurement of momentum and vertex position, which requires tracking. Moreover, the free-running DAQ provides a continuous data stream, which requires a sophisticated unscrambling prior to building individual events. That is the reason why any reliable hardware-based trigger system turns out to be over-complicated, and the computing system will have to cope with the full amount of data supplied by the DAQ system. Therefore, a fast online data reconstruction and filtering system designed to reduce data by a factor of 20-50 is the cornerstone of the data processing pipeline. The report presents the design of the planned online filter and the results of key component modeling.
Since many years, the Worldwide LHC Computing Grid (WLCG) has provided
the distributed computing infrastructure for the CERN Large Hadron Collider
experiments. During that time, it has seen steady evolution in technologies
as well as growth, to deal with ever increasing data rates. Those trends
need to be made to continue, to allow the WLCG to take on High-Luminosity
LHC data volumes as of 2029. In this contribution, we describe improvements
across a wide range of services and software, some of which already bring
benefits as of today. Thanks to many partners and projects, not only do the
LHC experiments profit, but many other communities as well.
-
The JINR grid infrastructure was created at the Meshcheryakov Laboratory of Information Technologies and successfully developed from year to year in accordance with the rapid development of information technologies, computing equipment and computing technologies, satisfying user needs. Thus, the participation of JINR scientists in the experiments at the Large Hadron Collider (LHC) at CERN entailed the creation of computing clusters at JINR, which are integrated into a distributed computing environment for processing and storing hundreds of petabytes of data obtained from the LHC experimental facilities. In this regard, the Tier1 center for the CMS experiment, which is one of the best in the world, and the Tier2 center, which is the best in Russia and serves all experiments at the LHC and other studies with JINR scientists’ participation using grid technologies, should be particularly noted. At present, both of these centers serves the experiments at the NICA complex. The current state of the JINR grid infrastructure and its future development will be presented.
-
As part of the PIK nuclear reactor reconstruction project, the PIK Data Centre was commissioned in 2017. After more than five years of successful operation we would like to share our experience in one of the crucial parts of business continuity: monitoring. PIK Data Centre monitoring covers everything from engineering systems such as cooling machines to storage and computing nodes, jobs and user activities. Operational information gathered in several comprehensive user interfaces allows operators to have a bird's eye view of the entire facility and react quickly in the event of a failure.
WLCG Tier-2 computing center at NRC "Kurchatov Institute" - IHEP has been participating in the Worldwide LHC Computing Grid from very beginning since 2003. Over a twenty-year period it became one of the biggest WLCG Tier-2 centers in Russia. Ru-Protvino-IHEP Grid site provides computing resources for LHC experiments in high energy physics such as Atlas, Alice, CMS, LHCb and internal experiments at NRC "Kurchatov Institute" - IHEP such as OKA, BEC and other.
In this work the current status of the computing capacities, networking and engineering infrastructure, used software will be shown as well as the evolution of the computing center over last 20 years for stable and efficient operation.
We present our results on updating the middleware of Russian GRID sites to be able to continue processing ALICE data in the future, including the HL stage of the Large Hadron Collider operation. We will share our experience with one of the GRID sites and discuss some practical cases of scaling the updated middleware to other Russian sites in 2022-2023.
"This work is supported by the SPbSU grant ID: 94031112"
Every year the ATLAS experiment produces several billions event records in raw and other formats. The data are spread among hundreds of computing Grid sites around the world. The EventIndex is the complete catalogue of all ATLAS real and simulated events, keeping the references to all permanent files that contain a given event in any processing stage; its implementation has been substantially revised in advance of LHC Run 3 to be able to scale to the higher production rates. During physics analysis, it is often necessary to retrieve a lot of events in different runs to inspect their properties in detail and check their reconstruction parameters; manual extraction of such data takes a lot of time. The Event Picking Server automates the procedure of finding the location of the events, extracting and collecting them into separate files. It supports different formats of events and has an elastic workflow for different input data. The convenient graphical interface of the Event Picking Server is integrated with ATLAS SSO. The monitoring system controls the performance of all parts of the service.
The CREST project is a new realization of the Conditions DB for the ATLAS experiment, using the Rest API and JSON support. This project simplifies the conditions data structure and optimizes data access.
CREST development requires not only a client C++ library (CrestApi) but also various tools for testing software and validating data. A command line client (crest_cmd) was written to get a quick access to the stored data. A set of utilities was used to make a dump of the data from CREST to the file system and to test the client library and the CREST server using dummy data. Now CREST software is being tested using the real conditions data converted to CREST format with the COOL to CREST converter. The Athena code (ATLAS event processing software framework) was modified to operate with the new conditions data source.
The P-BEAST is a highly scalable, highly available and durable system for archiving monitoring information of the trigger and data acquisition (TDAQ) system of the ATLAS experiment at CERN. The Grafana plugin communicate with P-BEAST by the Rest API and JSON support. Grafana as a multi-platform open source analytics and interactive visualization web application is continuously developed with including modern technologies. As a result, the plugin has to be rebuilt almost for every new Grafana version. The early versions of the plugin were written on JavaScript with a support of the Angular web framework. For proper support of the plugin the Grafana server was also slightly modified, as Grafana did not support some options features. This summer a new 10th version should be released. It won’t support Angular any more. For this reason, the modern plugin was completely redesigned. Now it is developed on typescript with a help of React library and DataFrame support.
The BM@N 8th physics run using Xenon ion beams was successfully completed in February 2023, resulting in the recording of approximately 550 million events. They were recorded in the form of 31306 files, with a combined size exceeding 430TB. However, the reconstruction of these files demands significant computing resources, which is why a distributed infrastructure unified by DIRAC was chosen for this task. The first objective was to transfer the raw files from EOS in LHEP to DIRAC storage, based on EOS in LIT. This was achieved through parallel transfer using multiple independent DIRAC jobs. Once the data was accessible by all the resources integrated in DIRAC, the profiling of digitization and reconstruction jobs were performed to determine the computing resource requirements. For the digitization step three computing resources were selected: Tier1 and LHEP for 99% of the files, and Govorun for large files ranging from 16 to 250 gigabytes. Finally, Tier1, Tier2, and LHEP clusters were utilized to reconstruct the files obtained after digitization. The BM@N 8th physics run in February 2023 was the first time DIRAC had been used for raw data reconstruction in JINR in production rather than just in test mode. As a result, a set of approaches, systems, and methods were developed during this campaign, which will aid in reducing the efforts required for future data reconstructions at JINR.
The Configuration Information System (CIS) has been developed for the BM@N experiment to store and provide data on the configuration of the experiment hardware and software systems while collecting data from the detectors in the online mode. The CIS allows loading configuration information into the data acquisition and online processing systems, activating the hardware setups and launching all necessary software applications with required parameters on specified distributed nodes. The architecture of the CIS mainly contains the Web Interface, Configuration Database and Configuration Manager, where the Configuration Manager uses API of the chosen Dynamic Deployment System (DDS) developed by the FAIR collaboration for running and managing tasks, as well as providing their intercommunications. The SSH plugin of the DDS is employed to control online processing tasks in the BM@N experiment. The client-server architecture of the CIS will be presented in detail, where the client has been implemented as a Web service to manage configuration parameters by users and monitor active online tasks. Furthermore, log files of all running tasks controlled by the information system and logs of DDS sessions collected from distributed hosts are provided for users via the Web interface of the CIS.
The high-precision coordinate detectors of the tracking system in the BM@N experiment are based on microstrip readout. The complete tracking system designed for the latest xenon physics run (winter of 2023) consists of three parts: an ion-beam tracker and two trackers (inner and outer) for charged particle registration after primary interactions. The report reviews the features and implementation of the method for spatial coordinate reconstruction from two-coordinate microstrip readout planes concerning the latest run configuration. Also, this work presents the features of the development of the unified software model which implements the mentioned data processing for the tracking detectors.
Machine Learning methods are proposed to be used in more and more high energy physics tasks nowadays, in particular for charged particle identification (PID). It is due to the fact that machine learning algorithms improve PID in the regions where conventional methods fail to provide good identification. This report gives results of gradient boosted decision tree application for particle identification in the MPD experiment.
In this work we consider the planar three-body problem with zero angular momentum symmetric initial configuration and bodies with equal masses. We are interested in special periodic orbits called choreographies. A choreography is a periodic orbit in which the three bodies move along one and the same trajectory with a time delay of T/3, where T is the period of the solution. Such an orbit is called trivial if it is a topological power of the famous figure-eight choreography, otherwise it is called nontrivial. A specialized numerical search for new nontrivial choreographies is made. The search is based on a modification of Newton’s method used with high precision floating point arithmetic. With only 3 known so far nontrivial choreographies, we found over 150 new ones. The linear stability of all found orbits is investigated by a high precision computing of the eigenvalues of the monodromy matrices. The extensive computations are performed in the "HybriLIT" platform.
Particle-in-Cell (PIC) simulation of high-beta plasmas in an axisymmetric mirror machine is of interest because of a new proposal for a plasma confinement regime with extremely high pressure, equal to the pressure
of the magnetic field, so-called diamagnetic confinement. The results of simulations can be used for the development of aneutronic fusion.
In this work, we will show our latest PIC code developed for the numerical simulation of the fully ionized hydrogen plasma with the injection of the ions and electrons inside the cylindrical trap. This is MPI-based Fortran code with data aligning for efficient AVX2/AVX512 auto-vectorization. We will present the role of manual and automatic data aligning on different programs' performance characteristics with different versions of Fortran compilers.
This work was supported by the Russian Science Foundation (project 19-71-20026).
Рассматривается динамика φ0 джозефсоновского перехода и явления переворота намагниченности под воздействием импульса тока. Динамика φ0 перехода описывается замкнутой системой уравнений, состоящих из уравнений Ландау-Лифщиц-Гильберта для намагниченности и уравнений резистивной модели для разности фаз перехода, которая представляет собой задачу Коши для системы обыкновенных нелинейных дифференциальных уравнений. Численное решение этой системы основано на применении двухшагового метода Гаусса–Лежандра. Параллельная реализация для большого количества расчетов в широком диапазоне параметров выполнена с использованием технологий MPI и OpenMP. Компьютерное моделирование проводилось на Гетерогенной платформе «HybriLIT» и на суперкомпьютере «Говорун» Многофункционального информационно-вычислительного комплекса Лаборатории информационных технологий им. Мещерякова ОИЯИ (Дубна). Исследовано влияние параметров импульса тока на периодичность возникновения доменов переворота намагниченности. Также представлены результаты численного исследования эффекта параметров модели на реализацию переворота намагниченности. Продемонстрированы результаты тестовых расчетов для оценки эффекта параллельной реализации на платформе «HybriLIT» и на суперкомпьютере «Говорун».
Spherically symmetric localized long-lived pulsating states (oscillons) in the three-dimensional φ$^4$ theory are numerically investigated in a ball of finite radius. These structures are of interest in a number of physical and mathematical applications including several cosmological contexts. Numerical approach is based on numerical continuation of solutions of a boundary value problem for the respective nonlinear PDE on the rectangle [0,T]×[0,R] where T is period of oscillations and R is the finite radius. Stability analysis is based on the Floquet theory and reduces to multiple solution of the Cauchy problem and subsequent solution of the eigenvalue problem for the matrix formed using these solutions. Details of numerical approach are presented including parallel implementation of the respective MATLAB code. Numerical results on the spatio-temporal structure and bifurcation of the oscillons are presented as well as the results of test calculations demonstrating effect on parallel implementation at resources of the JINR Multifunctional Information and Computing Complex.
Под воздействием внешнего излучения на вольтамперной характеристике (ВАХ) джозефсоновского перехода возникает ступенька постоянного напряжения, так называемая ступенька Шапиро. Ширина этой ступеньки зависит от амплитуды и частоты внешнего излучения, а также от параметров модели. При численном моделировании динамики джозефсоновского перехода и исследовании влиянии параметров модели на ступеньки приходится провести времязатратные вычисления при различных значениях параметров модели. Поэтому актуальной проблемой для исследователей является разработка эффективных алгоритмов для вычисления вольтамперной характеристики и зависимости ширины ступеньки Шапиро от параметров модели и излучения.
В настоящей работе с использованием Python в среде Jupyter book разработаны алгоритмы для вычисления ВАХ джозефсоновского перехода под воздействием внешнего излучения и нахождения ширины ступеньки в процессе вычисления ВАХ. Также реализован параллельный алгоритм для расчёта зависимости ширины ступеньки Шапиро от амплитуды внешнего излучения и показана эффективность параллельного вычисления.
Математическое моделирование и вычислительный эксперимент служат важным инструментом в изучении процессов переноса заряда в биополимерах, таких как ДНК. Актуальность исследований переноса заряда в ДНК связана, в частности, с развитием нанобиоэлектроники, которая является потенциальной заменой современной микроэлектроники, основанной на полупроводниковых технологиях.
Задача моделирования переноса заряда в квази-одномерных биомолекулах с математической точки зрения сводится к следующему. Биополимер моделируется цепочкой сайтов (групп сильно связанных атомов), движение которых описывается классическими уравнениями движения. По цепочке сайтов распространяется квантовая частица (электрон или дырка), движение которой описывается уравнением Шредингера. Заряд может деформировать цепочку, и наоборот - смещения сайтов влияют на вероятности нахождения заряда.
При моделировании движения сайтов можно применять различные способы задания температуры (под температурой понимают среднюю кинетическую энергию атомов). Мы провели сравнение двух вариантов: термостат Ланжевена (к классическим уравнениям системы добавляются члены с трением и случайная сила со специальным распределением) и гамильтонова система, в которой температура задается только начальным распределением скоростей и смещений сайтов. Для заряда рассмотрены различные начальные состояния – полярон, равномерное распределение, возникновение на одном сайте.
По результатам моделирования, переход от режима полярона к делокализованному состоянию происходит в одинаковом диапазоне тепловой энергии, для обоих вариантов, однако здесь для гамильтоновой системы температура – не заданная начальными данными, а определенная после расчета из средней кинетической энергии. Для больших температур результаты, усредненные по набору траекторий, в системе со случайной силой и результаты, усредненные по времени, для гамильтоновой системы близки, что не противоречит гипотезе эргодичности. С практической точки зрения, при биологически значимых температурах T $\approx$ 300 K можно использовать любой вариант задания термостата.
Мы благодарны сотрудникам ЦКП ИПМ им. М.В. Келдыша РАН (http://ckp.kiam.ru) за предоставленные вычислительные мощности k-100 и k-60
One of the modern technologies for obtaining new materials and coatings is the deposition of nanoparticles on a substrate. This process is relevant for many industries and the social sphere. Constantly increasing requirements for the quality and nomenclature of this type of product lead to the need for detailed theoretical and experimental studies of the spraying process in various conditions. One of the methods of such research is computer and supercomputer modeling. In this paper, a comprehensive the methodology of such modeling, covering all stages of the computational experiment. The basis of the technique is a direct atomic-molecular modeling of the deposition process on a microscopic scale. Parallel technologies are used for the computer implementation of the technique, which allow obtaining results with a given level of resolution and accuracy. In the upcoming report, various aspects of the developed modeling technology are discussed using the example of deposition of nickel nanoparticles on a substrate.
The work was carried out with the support of the Russian Science Foundation, project No. 21-71-20054.
This paper presents a mathematical and numerical model of basal melt of Antarctic glaciers. At each point of the continent, for which the heights above sea level of the lower and upper ice edges are known, a one-dimensional three-phase Stefan problem with moving phase boundaries is solved along the vertical direction. The model allows to calculate the dynamics of the temperature distribution and the law of motion for phase boundaries under real life conditions, as well as the possibility of the appearance/degeneration of a liquid phase under the glacier and on its surface. The calculations were carried out using data of the international project Bedmap2, which contain the topological characteristics of Antarctica on a uniform grid with a step size of 1 km.
The equations were discretized using the finite difference method with an implicit difference scheme of the first order of accuracy in time and space on an inhomogeneous grid which gets finer near the phase boundaries.
The model allows for full data parallelism. The simulation was carried out in the MATLAB environment in a parallel asynchronous mode. The software implementation showed very good scalability both in computing on an SMP node and on a cluster.
A number of optimizations were carried out with help of the profiler built into MATLAB, which significantly reduced the simulation time. In particular, a tridiagonal matrix algorithm for solving systems of linear equations was implemented in C with the MEX API for integration with MATLAB, which made it possible to reduce the calculation time by a factor of five.
The calculation for a grid of 120,000 points on 28 cores took about 2.5 hours. The basal melt rate of the mainland of Antarctica of 29 Gt/year was obtained.
Discrete Fourier methods have the known Gibbs phenomenon problematic due to their limited time window. A solution to this problem has been apodisation, a truncation of the time window that softens the edges. Still, due to discretisation such methods are imperfect, here reported Fourier apodisation aleviating this aspect. Although Fourier-space apodisation is known, no consistent approach exists to date, that eliminates exactly spectrum leakage tails. Our FoxLima discrete Fourier transform package, fielding these methods, has been adapted to design a wavelet digital filter with this type of Fourier-space apodisation. We report the performance of this filter on neutron noise simulated data.
В лаборатории нейтронной физики им. И. М. Франка в качестве базовой установки ОИЯИ с 2012 г пущен в эксплуатацию импульсный реактор периодического действия ИБР-2М, сменивший реактор ИБР-2 после выработки его ресурса. Реактор генерирует мощные нейтронные импульсы шириной 200 мкс с частотой 5 Гц при средней мощности 2 МВт. В процессе работы ИБР-2М происходит усиление низко частотных колебаний мощности. Для обоснования безопасности реактора регулярно проводятся исследования. Результаты исследований позволили полнее узнать физику работы реакторов периодического действия с одновременным видением путей повышения безопасности на перспективу. Создана модель динамики импульсного реактора, как дискретной (импульсной) системы автоматического регулирования. Параметры быстрой мощностной обратной связи (МОС), входящие в модель динамики, определялись экспериментально. Оказалось, что эти параметры и определяют колебания мощности реактора. Показано, что деградационные изменения в активной зоне ИБР-2М приводят к ослаблению МОС и появлению колебаний. Показано также, что колебания мощности существенно зависят от энерговыработки и уровня средней мощности.
Event reconstruction in the SPD (Spin Physics Detector) experiment in the NICA mega-science project presents a significant challenge of processing a high data flow with limited valuable events. To address this, we propose novel approaches for unraveling time slices. With a data rate of 20 GB/sec and a pileup of about 40 events per time slice, our methods focus on efficient event reconstruction and selection. We explore predictive vertex clustering, utilizing hits from reconstructed tracks to predict vertices using Gradient Boosting methods for subsequent clustering. Additionally, we develop a triplet siamese network that generates track feature vectors for effective clustering. Furthermore, we introduce a novel technique for evaluating event separation quality, enabling a comprehensive assessment of each unraveling approach. Our research contributes to the real-time analysis and event selection in the SPD experiment and provides insights into the challenges and opportunities of time-slice unraveling in high-intensity physics experiments.
In accordance with the technical design report, the SPD detector, which is being built at the NICA collider at JINR, will produce trillions of physical events per year, estimated at dozens of petabytes of data, which puts it on a par with experiments at the Large Hadron Collider. Although the physical facility is under construction, these figures must be taken into account already now, at the design stage of the offline data storage and processing system. Besides that, as the design of subdetectors and subsystems is developed, the applied software of the experiment requires more and more computing resources to perform Monte Carlo simulations of the future facility. In modern physics research, even the process of modelling a small subsystem may require significant computational resources, and in its form is organized as a set of sequential processing steps that form a data processing chain. And although the facility itself has not yet been built, the needs of physical groups in computing power for carrying out such calculations are constantly growing and already occupy a fairly significant amount of processor time on the computing resources of JINR. In fact, this form of processing is a user or group processing sequence, within which tens of thousands of intermediate and final files can be produced, which means that both tasks and data management is required. The status of work on building a system for managing the SPD experiment data storage and processing is presented in this report.
Particle tracking is critical in high-energy physics experiments, but traditional methods like the Kalman filter cannot handle the massive amounts of data generated by modern experiments. This is where deep learning comes in, providing a significant boost in efficiency and tracking accuracy.
A new experiment called the SPD is planned for the NICA collider, which is currently under construction at JINR. The SPD is expected to generate enormous amounts of data at 20 GB/s or 200 PB/year, so researchers are exploring the use of transformer-based architectures like the Perceiver for tracking. By leveraging the attention mechanism to incorporate particle interactions, the Perceiver shows tremendous potential for tracking in high-luminosity experiments like SPD.
Мега-сайенс проект NICA задаёт высокую планку к вычислительным ресурсам и системам хранения и обработки данных. Участники коллабораций MPD, BM@N, SPD при выполнении расчётов активно задействуют различные вычислительные ресурсы ОИЯИ: МИВК Tier-2, СК Говорун, Облако JINR-Cloud, вычислительный кластер NCX. При выполнении расчётов применяется классическая иерархия систем хранения и обработки данных. В качестве "холодной" системы для хранения данных используется распределённая файловая система EOS, в качестве "теплой" и "горячей" систем для обработки данных — файловые системы NFS/ZFS и Lustre, привязанные к вычислительным ресурсам. Поиск свободных вычислительных ресурсов требует от участников коллабораций постоянно выполнять перенос своих "горячих" данных с одного вычислительного ресурса на другой.
В рамках данной работы предложено решение на основе файловой системы Lustre для быстрого копирования "горячих" данных между СК Говорун и вычислительным кластером NCX. Разработанная архитектура включает в себя следующие сегменты для обработки данных: локальная файловая система для СК Говорун, локальная файловая система для вычислительного кластера NCX, а также сегмент с зеркальной файловой системой, доступной на обоих вычислительных ресурсах. Данное решение позволит существенно сократить усилия пользователей и время на перенос "горячих" данных. Также в работе рассмотрены режимы отказоустойчивости на основе компонент файловой системы Lustre, сервисов Heartbeat и DRBD.
Современные научные исследования не могут существовать без крупных вычислительных систем, которые способны хранить большие объемы данных и обрабатывать их в относительно короткие сроки. К таким системам относятся распределенные центры сбора, хранения и обработки данных (РЦОД).
Распределенные системы имеют сложную структуру и включают в себя множество разнообразных компонент, поэтому для проектирования, поддержки и развития таких центров сбора, хранения и обработки данных необходим инструмент, позволяющий исследовать их эффективность и надежность, проверять различные сценарии масштабирования, находить необходимое количество ресурсов для решения конкретных задач. Таким инструментом могут быть различные средства моделирования, но они имеют ряд недостатков.
В докладе описаны возможности использования цифровых двойников (ЦД) при построении и модернизации РЦОД. Продемонстрирован разработанный авторами метод построения цифровых двойников РЦОД, на основе которого в Лаборатории информационных технологий им. М.Г. Мещерякова Объединенного института ядерных исследований создается специальное программное обеспечение. Ядром системы является программа, которая позволяет моделировать РЦОД, учитывая происходящие в системе процессы, а также требования к потокам хранимых данных и потокам задач по обработке этих данных. Помимо этого, в состав программного комплекса входит база данных и веб-сервис.
Большое внимание в докладе уделяется представлению прототипа программного комплекса, который прошел успешную апробацию при построении ЦД распределенной вычислительной инфраструктуры эксперимента BM@N проекта NICA.
Проводимая работа поддержана грантом для молодых научных сотрудников ОИЯИ №23-602-03.
Extensive studies, in the field of high temperature plasma and controlled thermonuclear fusion were started in 50th of the last century. Main goal of these studies was the creation of power source runs on relatively cheap hydrogen isotope Deuterium heated up to hundred million degrees in the conditions where it will be possible to obtain thermonuclear reaction.
In the beginning, the simple idea of a thermonuclear reactor was rather simple, but it would appear so complicated that only in 2010th the construction of International Thermonuclear Experimental Reactor (ITER) was started in Cadarache (First plasma in 2028). During previous 70 years it was created physical and technical fusion data bases closely coupled with solid-state physics, magnetic hydrodynamics, etc. This is a labor thousands of physicists, engineers, inventors. Аt present time, huge amount of fusion knowledge is accumulated in Russian scientific research centers and universities.
Purpose of creation FusionSpace.ru in Russia is:
• delivery of instruments and services for joint research in fusion,
• access to accumulated knowledge,
• modern, reliable and comfortable information access for scientific results in Russia and, through ITER project, access in to international fusion society.
Report presents the first stage of FusionSpace.ru prototype commissioning results for fusion research in Russia. It was shown that the concept of FusionSpace.ru gives the possibility of integration Russian and international fusion research knowledge. Russian and International experience in fusion IT infrastructure manufacturing is described, including remote experiments on JET, WEST, DIII-D and ITER (trough Russian RPC and REC). Report also present the results of remote participation in ITER project in the framework of Remote Participation Center.
Report has an interest for physicists and engineers working at physical installations connected with Big Data processing.
Work done under contract № Н.4к.241.09.23.1036 dated 22.03.2023 and contract № Н.4а.241.19.23.1014 dated 18.01.2023.
Keywords: Controlled Fusion, Tokamak, ITER, distributed researching, digital platform.
In the BM@N experiment, a xenon heavy ion beam with an energy of 2.7 GeV/nucleon interacts with a cesium target, generating many secondary particles π, μ, p, n, γ, e, d, α, K, etc. After computer processing of the data from the detectors used in the experiment, we obtain a series of images of the tracks of emerging particles. We processed four of them using the Gwyddion program and calculated the value of the deviations of the set of tracks from the fractal δ. Они составили δ1= 0,01738, δ2= 0,01574, δ3= 0,01862, δ4= 0,01574, то есть менее 2%. Since all the values of δ turned out to be sufficiently small, the structure of the set of tracks of produced particles is close to fractal. For all four patterns, the fractal parameters Sf, D, Tf of the set of tracks of produced particles are determined. Based on these data, diagrams of fractal states Sf(Tf) were constructed. The indices of fractal equations of state (FOS) of the studied sets were calculated in the mathematical model of fractal thermodynamics [1].
We give a presentation of our polymorphic non-abelian package of 3D vectors and matrices for high-speed algorithms intended for trigger applications in Particle Physics. The package is part of our "Math-on-Paper" C++ concept - of fielding solutions that are as close as possible in code to actual scientific on-paper computations, known that often it is nearly impossible to bring paper equations into actual code. CPU performance and polymorphic type calculations, in SFINAE context, are presented for a set of example applications in Particle Physics: tracking and vertexing.
Презентация посвящённа созданию и развитию вычислительно центра института SAPHIR (Millennium Institute for Subatomic Physics at the High Energy Frontier, Santiago, Chile).
Данное исследование посвящено моделированию распространения упругих волн в гетерогенной среде с явным учетом неоднородностей. Для реализации данного подхода был реализован алгоритм, основанный на сеточно-характеристическом методе с использованием наложенных сеток. Предложенный алгоритм был распараллелен в распределенной кластерной среде с использованием технологии MPI. Результаты исследования показали, что использование предложенного подхода позволяет существенно ускорить процесс моделирования распространения упругих волн в неоднородных средах.
Работа выполнена при финансовой поддержке Российского научного фонда (номер проекта 21-11-00139).
The task of fluid simulation is computationally difficult, both in terms of the required computational costs and in terms of representing the system with a large number of particles. This study considers various methods for solving this problem, such as the use of parallel computing, rendering optimization, and optimizing information transfer between the CPU and GPU. The work was conducted in the Unity environment. The study explores technologies such as GPU Instancing, Unity DOTS, C# Job System, Burst Compiler, Entity Component System, Shaders, Compute Shaders, and CUDA. A comparison and selection of features and areas of application for each technology are carried out. An example of the utilization of these tools in implementing the incompressible Schrödinger flow method is provided.
Численное решение задач сейсморазведки играет важную роль в нефтегазовой промышленности, помогая определять наличие нефтегазоносных пластов и оптимизировать процессы бурения и добычи нефти и газа.
Учет топографии является важным аспектом при сейсморазведке, поскольку форма поверхности земли может оказывать значительное влияние на распространение сейсмических волн и, следовательно, на получаемые данные. Например, при наличии горных хребтов или долин в районе исследования может возникнуть эффект отражения волн от данных геологических структур, что усложняет интерпретацию результатов. Кроме того, учет топографии позволяет более точно определить глубину расположения геологических формаций и составить более точные карты подземных структур. В целом, учет топографии помогает улучшить качество и точность сейсмической разведки, что имеет большое значение для промышленной геофизики и геологии.
В данной работе рассматривается подход к моделирования распространения сейсмических возмущений с учетом рельефа земной поверхности сеточно-характеристическим методом с использованием наложенных сеток. Разработанный алгоритм распараллелен с использованием технологий MPI и OpenMP. В работе проведено тестирование и верификация алгоритмов, представлены тесты ускорения параллельного алгоритма.
Работа выполнена при финансовой поддержке Российского научного фонда (номер проекта 21-11-00139).
Detailed theoretical studies of deposition processes, including the interaction of nanoparticles with a substrate, are of particular practical interest. The technology under consideration is used in such critical areas as microelectronics, the creation of protective coatings and new medical materials. Mathematical modeling of each level of this process makes it possible to effectively select the operating modes of deposition installations, as well as significantly reduce the number of full-scale experiments to obtain coatings with the required physical properties. The applied mathematical models, in this case, often use several dimensional levels and involve the solution of related problems by various mesh and meshfree methods. The latter significantly increases the number of technical applications, which leads to the complexity of their configuration and launch, and also increases the number of required competencies of the end user. Today, to simplify the conduct of complex computational experiments, the development and use of web laboratories is widespread, allowing the researcher to implement the entire computational cycle through a unified graphical user interface available on the Internet. The report proposes the implementation of such a digital platform based on a client-server architecture using a reactive approach to generating a graphical interface. The main feature of the developed web laboratory is the ability to dynamically add applied services and computing resources to the software environment for their subsequent use in computing experiments.
The work was carried out with the support of the Russian Science Foundation, project No. 21-71-20054.
We present a new conservative scheme for computation of the Boltzmann collision integral for binary and triple processes in relativistic plasma based on direct integration of exact quantum electrodynamical matrix elements. Parallel evaluation of collision integral is done within the framework of general-purpose computing on graphics processing units (GPGPU). This approach is important for kinetic and emission processes in high energy astrophysical environments.
Приводится описание технологии сборки модулей в ЯП средствами системы АПРОП, сделанной по ГОСДоговору с НиЦЕВТ (1975-1976) под руководством академика В.М.Глушкова для ОС ЕС ЭВМ. АПРОП cдана в ГОСФонд. В 1977г. Применялась АПРОП на ВПК (МНИИПА, Липаев В.В.) по договору с 1978-1985г. для реализации программных комплексов Прометей, Яуза, Руза и бортовых приборов для авиации, космоса и флота. Отмечается широкое применение метода сборочного программирования модулей в нашей стране и за рубежом. Определен 2 вариант метода сборки информационных, технических, компьютерных, операционных и интеллектуальных ресурсов в классе предметных областей знаний (физики, математики, медицины, биологии) с использованием средств E-SCIENCES (GRID, ETICS, Семантик Веб. Приводятся лекции и доклады про сборку систем и их семейств СПС из объектов, компонентов, сервисов, reuses в среде . WWW3C. Показаны новые книги, учебники по аспектам технологии сборки 2 и примеры реализации предметных областей знаний с участием специалистов МФТИ и ИСП.
Bioinformatics is the area that develops methods and software tools for understanding of biological data, which includes sequence analysis, gene and protein expression, analysis of cellular organization, structural bioinformatics, data centers etc. A new and more general direction is to consider bioinformatics as informatics on the bases of nanobioelectronics and biocomputer technologies.
DNA molecular is an important example of data storage and biocomputing. Performing millions of operations simultaneously DNA – biocomputer allows the performance rate to increase exponentially. The limitation problem is that each stage of paralleled operations requires time measured hours or days. To overcome this problem can nanobioelectronics [1]-[3].
The central problem of nanobioelectronics is the realization of effective charge transfer in biomacromolecules. The most promising molecule for this goal is DNA. Computer simulation of charge transfer can make up natural experiment in such complex object as DNA. Such processes of charge transport as Bloch oscillations, soliton evolution, polaron dynamics, breather creation and breather inspired charge transfer are modeled. The supercomputer simulation of charge dynamics at finite temperatures is presented. Different molecular devices based on DNA are considered. These make the basis for solution of informatics problems on biomolecular technologies.
References
[1] V.D. Lakhno, DNA Nanobioelectronics, Int. J. Quantum Chem, v.108, p. 1970-1981, 2008
[2] V.D. Lakhno, Theoretical basis of Nanobioelectronics, EPJ Web of Conferences, 226, 01008, 2020
[3] V.D. Lakhno, A.V. Vinnikov, Molecular devices based on DNA, MBB, v. 16, p. 115-135, 2021
В постановке и проектировании вычислительного эксперимента представляются актуальными вопросы интерактивного управления ресурсоёмкими алгоритмами, с возможностью динамической перенастройки моделей гидромеханики, под независимым визуальным контролем трёхмерных физических явлений и процессов в реальном масштабе времени. Прямой вычислительный эксперимент позволяет достигать практических инженерных решений [1] без традиционных аналитических ограничений, ставя оптимизационным условием лишь объёмы пространственных сеток и точность аппроксимации изучаемых инженерных объектов.
Современная архитектура многопроцессорных вычислительных комплексов обладает набором интервальных таймеров с часами реального времени, использование которых позволяет организовать параллельное исполнение алгоритмов математического моделирования без задействования какой-либо внешней инфраструктуры [2], кроме таймера, приостанавливающего вычисления для синхронизации с реальным временем. Во время такой приостановки выполняются все процедуры визуализации, а также обслуживаются запросы от внешних устройств, графического терминала, клавиатуры и курсора. В случае перегрузки арифметико-логического ядра компьютера, происходит рассогласование вычислений с реальным временем, однако собственно эксперимент и визуализация результатов продолжается с поддержкой как интерактивного интерфейса, так и с возможностью протоколирования результатов в отсчётах времени физического процесса.
Рассматриваются два варианта реализации вычислительного эксперимента: с полным распараллеливанием и достижением независимости процессов собственно моделирования с интерактивной и графической визуализацией [3]; в другом варианте выполняется поочерёдное квантованием времени математического моделирования и алгоритмов графической визуализации с управляющими прерываниями от компьютерной периферии. В качестве прикладных алгоритмических задач отрабатываются программные комплексы для моделирования штормового маневрирования корабля, а также континуально-корпускулярные вычислительные эксперименты для анализа процессов и явлений в пространственном взаимодействии поляризованных частиц, в их групповых объединениях или в условно близких взаимодействиях в окрестностях узлов пространственных сеток.
Настоящая работа выполняется при частичной поддержке Санкт-Петербургского государственного университета, проект ID: 94062114.
Литература.
1. Богданов А. В., Дегтярев А. Б., Храмушин В. Н. Трёхмерная тензорная математика вычислительных экспериментов в гидромеханике. // Вычислительные технологии в естественных науках. Методы суперкомпьютерного моделирования. Серия «Механика, управление, информатика». Часть 3. Сборник трудов ИКИ РАН 17–19 ноября 2015 г. Россия, Таруса. С.34-48
2. Храмушин В. Н. «Контекстная графика» (Window-Place) – Контекстно-зависимая среда построения трёхмерной графики OpenGL с использованием виртуальных процедур C++ и многооконного интерфейса Windows со стековым наложением графических и текстовых фрагментов. Роспатент. СахГУ № 2010615850 от 2010-09-08.
3. Храмушин В. Н. «Tensor» – Программа для построения числовых объектов и функций трёхмерной тензорной математики при реализации вычислительных экспериментов в гидромеханике. Роспатент. СПбГУ № 2013619727 от 14 октября 2013 г.
Recently, the branch of mathematics associated with functional integration has been rapidly developing. For a long time it was a means of constructing perturbation theory and solving applied problems. However, recently it has become clear that this can be a very effective tool for high-performance algorithms creation. Moreover, it may be the only tool for developing algorithms for quantum computing emulators. The report discusses in detail solutions of partial differential equations with particular emphasis on representation at the so-called "intermediate point". It turns out that this kind of representation opens the way for absolutely parallel algorithms. The analysis of intermediate results based on the "catastrophe theory" makes it possible to analyze the results obtained in the spirit of the "qualitative theory of differential equations".
Взаимодействие научных и образовательных организаций в подготовке специалистов для решения научно-исследовательских задач
Темы для обсуждения
- Организация сетевых учебных программ с партнерством ОИЯИ.
- Создание сетевых школ в области информационных технологий, физики высоких энергий и проектов класса мегасайнс.
Участники
- В.В. Кореньков, С.В. Шматов, С.Г. Арутюнян, Ю.Л. Калиновский, Д.И. Пряхина, О.И. Стрельцова, О.Ю. Дереновская, М.И. Зуев (ЛИТ ОИЯИ)
- Д.В. Каманин, А.В. Верхеев (УНЦ ОИЯИ)
- А.С. Деникикн, Е.Н. Черемисина, О.В. Анисимова, Е.Ю. Кирпичёва, С.В. Потёмкина, О.И. Пискунова, М.А. Белов (Университет «Дубна»)
- Ю.В. Чемарина, А.Н. Цирулёв, И.А. Шаповалова, В.П. Цветков, И.В. Цветков (ТвГУ)
- А.Б. Дегтярев, А.В. Богданов, Н.Л. Щёголева (СПбГУ)
- А.В. Тараненко (МИФИ, ЛФВЭ ОИЯИ)
- В.А. Сухомлин (МГУ)
- Л.А. Севастьянов, К.Ю. Малышев (РУДН)
В работе Круглого стола могут принять участие все желающие!
Ссылка для подключения - https://jinr.webinar.ru/28441625/1814740842
Статья посвящена описанию концепции и основных характеристик магистерской программы "Кибербезопасность", разработанной факультетом вычислительной математики и кибернетики МГУ имени М.В. Ломоносова совместно с Департаментом кибербезопасности ПАО Сбербанк. Рассмотрены цели, основные принципы разработки, архитектура свода знаний магистерской программы, ее принципиальные особенности, профессиональные компетенции как ожидаемые результаты обучения, состав дисциплин. Магистерская программа "Кибербезопасность" предназначена для тех, кто хочет получить глубокие знания и навыки в области информационной безопасности и защиты информации и данных от кибератак. Программа ориентирована на подготовку магистров науки по кибербезопасности. Она разработана в соответствии с современными международными профессиональными и образовательными стандартами и с учетом действующих национальных стандартов и норм.
Artificial Neural Networks in High Energy Physics data processing (succinct survey) and probable future development
Abstract
The rising the role of Artificial Neural Networks (ANN) as part of machine learning/deep learning (ML/DL) in High Energy Physics (HEP) and related areas can be seen last decades. Several reasons for rising the role of ANN were observed. The comparison the ANN usage results with known rules-based data analysis results have briefly been presented. It is important that ANN usage practices have many peculiarities including the preparation process of the data for training the model, which is implementing by ANN, testing the model, finding, and comparing several models, activation function choice, loss function choice, etc. The number of ANN models with variants might be estimated by hundreds. Most popular model architectures are briefly described. In this context there are connected topics: an exchange models in between researchers, an use already trained ANN, automation of ANN developing process, etc. Among ANN main problems the training speed up and the ANN result interpretability are observed. Many theoretical aspects of ANN were already explained, but presumably a lot of theories yet have to be developed.
The funding mainly government and partly private agencies support trend towards open access to experimental data in research laboratories over the past decade. This trend has been driven by several factors: the increasing importance of data sharing and collaboration in scientific research, rapid progress in ANN design, as well as the development of new technologies and platforms for data sharing and analysis. It has been recognized that it is easier to access data with ANN if the data satisfies the requirements of the Findable, Accessible, Interoperable, Reusable (FAIR) principles. A lot of new experimental data is expected, which will require the analysis with ANN from already running experiments and/or from those will be launched in the coming years. Naturally new experimental data will require new larger ANN architectures. Known large scale general purpose ANN so called “foundation models” show the benefits and risks of using “foundation models”.
Finally, the idea of the development large scale ANN “foundation model” dedicated to HEP and related areas has been suggested. Such ANN can presumably be trained on scientific data distributed around variety of physics experiments. It is assumed those data has to satisfy FAIR principles. The trained ANN can be used for deep extensive data analysis. The possible synergetic effects of above ANN “foundation model” supported by advanced computing tools have shortly been described.
Machine learning systems are today the main examples of the use of Artificial Intelligence in a wide variety of areas. From a practical point of view, we can say that machine learning is synonymous with the concept of Artificial Intelligence. In some works, this definition is somewhat limited, and they only talk about artificial neural networks and deep learning in the context of artificial intelligence, but this does not change the essence of the matter. Yes, there is the concept of the so-called strong artificial intelligence (Strong AI, full AI, and AGI are also all synonyms), but it is still far from practical use. Accordingly, in practice, we must focus on the current architectures of machine learning systems and on existing machine learning models and schemes for their implementation. Today, artificial intelligence (machine learning) applications are used in a wide variety of fields. The spread of machine learning technologies leads to the need for their application in the so-called critical areas: avionics, nuclear energy, automatic driving, etc. Traditional software, for example, in avionics, undergoes special certification procedures. These ad hoc testing procedures cannot be directly transferred to machine learning models. The article discusses approaches to the certification of machine learning models.
Данный доклад посвящен исследованию задачи детекции объектов различного размера на примере открытого датасета с использованием нейросетевой модели Yolo v5. Основное внимание уделено изучению влияния предварительной фильтрации изображений на результаты детекции объектов и разработке методики оценки такого влияния. Кроме того, в работе проводится оценка влияния фильтрации искажений от дождя и снега на результаты детекции объектов по предложенной методике. Полученные результаты могут быть полезны для улучшения точности детекции объектов на изображениях с различными типами искажений и применены в различных областях, таких как автоматическое вождение автомобилей, мониторинг транспорта и т.д.
В данном докладе описывается метод смешивания изображений на основе интерполяции признаков скрытого пространства в процессе генерации диффузионных моделей. Основное внимание уделено особенностям реализации и изучению влияния параметров генерации на процесс создания итогового изображения. Также производится обзор альтернативных методов решения задачи и их сравнение с предложенным методом. Полученные результаты могут быть использованы для изучения внутренних особенностей алгоритмов по генерации изображений.
Recently, the landscape of computational infrastructure is in dramatic changes under the pressure of application requirements. The suit of the properties of modern applications can be summarized as follows: distributed, self-sufficient, work in real time, elastic, cross-platform, actively interact and synchronize, and are easy to update. The definitions of these terms are in [1]. For further understanding, it is important to recognize that an application is made up of interrelated components, which we will refer to as application functions. The analysis of requirements of modern application to the computational infrastructure presented in [1] shows the trend of ubiquitous application deployment. We are moving to the era when data processing resources and data transmission resources form a single space for computing - computational infrastructure. In other words, the time has come for the implementation of the slogan "Network is a Computer". Further we will call such computational infrastructure Network Powered by Computing (NPC).
Several versions of Functional Architecture for such new generation of computational infrastructure were proposed. Briefly NPC functional architecture presented in [1] can be described as following (see figure 1). It consists of data processing (DP) plane, data transmission (DT) plane, data processing control (DPC) plane, data transmission control (DTC) plane, administration, orchestration and management plane (AOM plane). DP plane covers all computational resources of NPC. DT plane is an overlay network over underlying physical network. Actually data transmission plane is data transmission network (DTN). DPC plane is responsible for preparation of the application for execution, planning the placement of application components, calculation of the quality of service (QoS) requirements based on the service level agreement (SLA) specified by the user, generation of DTN control plane instructions for setting up overlay tunnels in accordance with the application function interaction topology of the application. DTC plane is responsible for control and monitoring of DTN. AOM plane orchestrates interactions between application components in accordance with application topology, collects NPC resource consumption statistics by every application component, secures management and administration of NPC.
In [2] it has been proposed the new method for optimal traffic routing in overlay DTN of NPC based on decentralized multi-agent reinforcement learning (MARL) with hashing (MAROH). In the cited paper are considered three approaches to MARL optimization: centralized, decentralized with communication and fully decentralized. It is assumed in all these approaches that agents construct their local state based on the environment observations. The agents behavior in these approaches are different:
The main problems of MA methods for traffic control are poor scalability; there are no mathematical models that guarantee convergence to the optimal solution; it is difficult to mathematically frame optimization functional; the extent of deviation from the optimal solution is unknown. It is shown that the new proposed method surmount the problems listed above. MAROH method can be used for optimal traffic engineering is also applicable in traditional data network.
The experiments (see figure 2) show that the traditional approaches in load balancing like Equal Cost Multi-Path (ECMP [3]) or Unequal Cost Multi-Path (UCMP [4]) are ineffective in NPC environment. ECMP assigns the same weight to each possible path to the same destination and balances the flow or packets among these paths evenly. As NPC overlay channels may go through different Internet Service Providers (ISP), the assumption that these channels are similar is wrong. It is only flow load balancing is considered, because packet load balancing disturbs the congestion control algorithm operation and multi-path transport protocols are out of scope of this work. UCMP allows to assign the weight according to its available resources (e.g. bandwidth) and balances flows in the same ratio. However, UCMP doesn’t provide the coordinated choice of weights. It means that NPCRs may overload some channel by simultaneously forwarding flows to it. These shortcomings lead us to consider multi-agent machine learning approaches.
In this paper the MAROH method briefly is presented with NPC architecture. The main contribution of the paper is the comparison of ECMP, UCMP and MAROH.
[1] Smeliansky R. Network Powered by Computing: Next Generation of Computational Infrastructure. In Edge Computing – Technology, Management and Integration. Intechopeт 2023, DOI: 10.5772/intechopen.110178
[2] Stepanov E.P., Smeliansky R.L., Plakunov A.V., Borisov A.V., Xia Zhu, Jianing Pei, Zhen Yao On Fair Traffic allocation and Efficient Utilization of Network Resources based on MARL (preliminary on Researchgate)
[3] ECMP Load Balancing https://www.cisco.com/c/en/us/td/docs/ios-xml/ios/mp_l3_vpns/configuration/xe-3s/asr903/mp-l3-vpns-xe-3s-asr903-book/mp-l3-vpns-xe-3s-asr903-book_chapter_0100.pdf
[4] UCMP Load Balancing https://www.cisco.com/c/en/us/td/docs/ios-xml/ios/mp_l3_vpns/configuration/xe-3s/asr903/17-1-1/b-mpls-l3-vpns-xe-17-1-asr900/m-ucmp.pdf
-
We are investigating the quantum dynamics of a well-collimated electron beam transmitting through planar channels of the Si crystal. Electron states were represented by wave packets, while the electron beam was treated as an ensemble of noninteracting wave packets. The evolution of electron states was obtained using the method of Chebyshev global propagation, specifically modified to give complex wave functions at arbitrarily chosen time instances without compromising the accuracy of the time propagation. The evolution of the ensemble in the configuration and the phase space was obtained by numerical simulation. We have analyzed how electron dynamics depend on the initial mean position and angular divergence. We have also investigated the relationship between the classical caustic pattern and the shape of the electron Wigner function.
Obtained quantum probability densities have multiple maxima generated by an electron’s self-interference. Their sum, which represents an ensemble’s probability density, was found to depend strongly on the beam angular divergence. For small divergence, most peaks are aligned, causing the wavelike behavior of the ensemble. For moderate divergence, the maxima of some are aligned with the minima of others, resulting in the emergence of the classical caustic pattern.
Keywords: structural stability, planar channeling, rainbow scattering, classical-quantum correspondence
-
В работе представлены результаты применения технологий мягких и квантовых вычислений для задач обучения, адаптации и самоорганизации интеллектуальной системы управления стабилизацией давления в азотной криогенной установке на фабрике магнитов в ЛФВЭ ОИЯИ. Проведено сравнение работы системы с применением разных типов моделей управления: ПИД-регулятор, ПИД-регулятор с применением генетического алгоритма, а также контроллеры верхнего уровня на основе нечётких нейронных сетей и квантовых алгоритмов. Показана эффективность применения сквозных квантовых информационных технологий в задачах управления слабострутурированными и плохоформализованными объектами управления.
Одной из областей применения технологий искусственного интеллекта является решение задачи стабилизации при управлении техническими системами, в том числе системами промышленного класса.
В работе представлены результаты исследования по применению эволюционных и адаптивных алгоритмов в интеллектуальных системах управления для стабилизацией давления в азотной криогенной установке на фабрике магнитов ЛФВЭ ОИЯИ. Проведено сравнение разных типов моделей управления. Представлен метод выбора оптимальных траектории изменения коэффициентов усиления ПИД регулятора. Показана эффективность применения сквозных информационных технологий на основе мягких вычислений в задачах управления.
Квантовые компьютеры обладают потенциалом решать проблемы, которые оказываются вычислительно сложными для некоторых классических алгоритмов. Однако создание физических квантовых устройств с большим количеством кубитов и высокой стабильностью остается на текущий момент сложной задачей. Разработка и отладка квантовых алгоритмов на симуляторах с классической архитектурой может использоваться не только для быстрой проверки гипотез перед запуском на квантовых устройствах, но и для решения реальных задач. В докладе описываются преимущества и ограничения симуляторов с классической архитектурой [1], рассматриваются различные инструменты для создания и отладки квантовых алгоритмов на классических компьютерах.
Описан один из вариантов эффективного моделирования в задаче интеллектуального управления азотного охлаждения сверхпроводящих магнитов для одной из систем ускорительного комплекса НИКА на основе масштабирования решений, полученных ранее на фабрике магнитов [2].
Литература
1. П.В. Зрелов, О.В. Иванцова, B.В. Кореньков, Н.В. Рябов, C.В. Ульянов. Оценка возможностей классических компьютеров при реализации симуляторов квантовых алгоритмов. // Программные продукты и системы. — 2022. №.4. C. 618 — 630. DOI: 10.15827/0236-235X.140.618-630.
In the first part of the report, we examined control systems with constant coefficients of the conventional PID controller (based on genetic algorithm) and intelligent control systems based on soft computing technologies. For demonstration, MatLab / Simulink models and a test benchmark of the robot manipulator demonstrated. Advantages and limitations of intelligent control systems based on soft computing technology discussed. Intelligent main element of the control system based on soft computing is a fuzzy controller with a knowledge base in it. In the first part of the article, two ways to implement fuzzy controllers showed. First way applied one controller for all links of the manipulator and showed the best performance. However, such an implementation is not possible in complex control objects, such as a manipulator with seven degrees of freedom (7DOF). The second way use of separated control when an independent fuzzy controller controls each link. The control decomposition due to a slight decrease in the
quality of management has greatly simplified the processes of creating and placing knowledge bases.
In the secund part of the report, to eliminate the mismatch of the work of separate independent fuzzy controllers, methods for organizing coordination control based on quantum computing technologies to create robust intelligent control systems for robotic manipulators with 3DOF and 7DOF described. Quantum supremacy of developed end-to-end IT design of robust intelligent control systems simulation demonstrated [2].
Базы данных научных публикаций в настоящее время насчитывают миллионы статей, методы поиска в них непрерывно развиваются: от традиционного текстового поиска к системам, которые учитывают дополнительную библиометрическую информацию (индексы цитирований), семантический поиск, нейросетевые модели и другие. В частности, популярными поисковыми системами по научной литературе являются Google Scholar и Scopus, алгоритмы ранжирования которых не только выполняют полнотекстовый поиск, но и учитывают данные о цитированиях одних статей другими [1, 2, 3]. Также существуют другие системы с возможностями анализа частоты совместного цитирования и представления результатов в виде графа близких по смыслу статей (CoCites, Connected Papers) [4, 5].
Для увеличения точности поиска по научной литературе важно использовать дополнительные факторы, которые отражают ключевую суть искомой публикации.
В рамках исследования новых методов поиска по научной литературе была разработана система семантического поиска научных публикаций на основе информации о внешнем цитировании с использованием нейросетевых моделей по большим базам научных публикаций.
В качестве источника данных был выбран полнотекстовый архив научных публикаций по биомедицине PubMed Central (PMC) объемом 7,6 миллиона статей (9,1 Тб) [6].
Для увеличения точности поиска по научной литературе были совмещены два подхода:
В текстах статей отбирались предложения, содержащие краткие описания основных результатов других статей и ссылки на них. Такие «сутевые упоминания» были собраны в единый набор данных для последующего поиска. В результате было обработано 350000 статей банка данных PMC open access в виде файлов формата XML с использованием библиотеки Python lxml. Таким образом, было собрано более 550000 упоминаний работ в единый набор данных. Также был собран дополнительный набор данных с метаинформацией статей (идентификатор, название, авторы, аннотация).
В рамках работы над поисковой системой было проведено дообучение нейросетевой модели BERT [7] для задачи мультиклассовой классификации на наборе из 10000 цитат с использованием библиотек transformers, torch, scikit-learn, pandas [8-11]. При дообучении модели ставилась задача сделать векторные представления разных упоминаний одной и той же работы ближе друг к другу в векторном пространстве [12].
В результате работы был реализован поисковый сервис на основе библиотеки flask python [13]. База данных с информацией об упоминаниях статей была токенизирована и подана на вход дообученной нейросетевой модели BERT, после чего построено дерево числовых векторов упоминаний с помощью библиотеки scikit-learn. Сервис принимает запросы, содержащие ключевые слова пользователя, и выполняет поиск ближайших соседей в построенном дереве. По найденным упоминаниям соответствующие им статьи выдаются в качестве результатов. Для отображения найденных статей был реализован веб-интерфейс на основе библиотеки React [14].
According to the Food and Agriculture Organization, the world's food production needs to increase by 60-70 percent by 2050 to feed the growing population. However, the EU agricultural workforce has declined by 35% over the last decade, and 54% of agriculture companies have cited a shortage of staff as their main challenge. This, among other factors, has led to an increased interest in advanced technologies in agriculture, such as IoT, sensors, robots, drones, digitalization, artificial intelligence, and many more. Artificial intelligence (AI) and machine learning have proven valuable for many agriculture tasks, including problems detection, crop health monitoring, yield prediction, price forecasting, yield mapping, optimization of pesticides and fertilizers usage, etc. In this article, we will explore various AI applications in agriculture and share our experience in this field.
Ensuring the confidentiality and protection of personal information in big data is an important aspect in data processing. One of the effective methods to achieve a high level of protection is depersonalization of data. The article presents an overview of modern methods of preserving personal data when conducting various kinds of research, in business analytics, etc. The influence of quasi-identifiers on the probability of re-identification is estimated. To reduce the probability of data de-identification, a hashing method based on the use of the Kessak-256 hash function and the addition of a dynamic random string for each dataset element is proposed. This method allows you to significantly increase the time of hacking and the amount of resources required by the attacker. This approach can be used for secure data transmission, exchange and storage.
In the study of diseases of the elderly, five different types of instruments are used, each of which alone does not allow a reliable diagnosis. In addition, tests and examinations are carried out by a doctor who makes his conclusion. Often the doctor's conclusion contradicts the data of computer diagnostics. In this communication, an attempt is made to construct a computer diagnostics system that allows solving a number of problems of previous approaches. First of all, to solve this problem, significant computing resources are required, which are not available in a conventional medical institution. Therefore, the first task that was solved at this stage was the anonymization of patient data and their transfer to a powerful server for further processing. One of the reasons that does not allow establishing more or less reliable data on diseases is the large number of gaps - the lack of data on one or another dimension for each individual patient. Therefore, the second task that is solved in our study is the restoration of empty spaces, the filling of gaps with the help of some reasonable procedure and the influence of this procedure on subsequent diagnostics. And the third task that we solved was the use of several diagnostic devices to detect the disease. For this purpose, statistical analysis of data obtained from a number of measurements, including fMRI, EEG and others, is used.
The training of teams and crews for the control and operation of complex technical objects implies the training of a big number of people at once, who must jointly solve the assigned tasks. Unlike a simple training system, such simulators are built on the basis of a distributed computing environment, consisting of the workplaces of the training crew members and the control server. In the case of organizing simulators for individual subsystems operating under normal operating conditions, there are no problems with the design of the computing environment. Experience shows that in this case, simplified mathematical models are sufficient for training purposes, and the standard bandwidth of network channels copes with the amount of data transmitted between the server and workstations. The situation changes in the case of organizing complex simulators. In this case, multifunctional modeling of technological processes takes place. Both an increase in jobs and synchronization of the work of various subsystems are required. At the same time, the demand for this type of complex simulators becomes real. Especially important is the coordinated work of various subsystems and their teams in case of extreme situations. Such situations require the full commitment of the entire crew. Traditional schemes for organizing training simulators are no longer effective. The reasons lie both in the complication of the mathematical models used and their heterogeneity, and in the increase for data transmitted between the nodes, which ensures the operation of a big number of mathematical models.
A possible method for solving the problem that has arisen is to complicate the distributed computing environment on which the full-featured simulator is based: more powerful computing nodes, replacement of the network infrastructure, etc. However, such an extensive development path must be recognized as economically unprofitable both due to a significant increase in the price of the computing infrastructure and inefficient use of individual nodes due to downtime in normal operation modes. The paper proposes to use the concept of a virtual private supercomputer for flexible connection of individual subsystems of a complex simulator. Virtualization of computing resources, memory, network, storage allows you to collect together only those resources that are required at the moment. In this case, idle resources are effectively utilized, and as a result, a virtual SMP system is organized to solve a specific problem.
This research paper explores methods for balancing privacy and performance in distributed systems, specifically within multilayered architectures. It proposes a potential solution for secure data exchange on a hybrid blockchain platform, leveraging cryptographic tools to protect sensitive data while maintaining system functionality. The paper emphasizes the importance of considering both privacy and performance in distributed system design and implementation.
Abstract—This paper states the decentralization of a task management algorithm in a distributed environment. At the same time, the main criteria for task management were presented, and various approaches to designing this algorithm were described. The author considered the architecture of a blockchain-based task management system using the Parity Substrate framework of the Polkadot ecosystem. During the design of the system architecture, the advantages and disadvantages of the described approaches were identified.
Keywords—blockchain, decentralization, distributed environment, Polkadot ecosystem, Parity Substrate framework, off-chain worker
The network infrastructure is an integral part of the major research infrastructure project "Multifunctional Information and Computing Complex (MICC) of JINR". The main goal is to provide a guaranteed and reliable traffic transmission that will fully meet the needs of scientific experiments. This presentation provides an overview of the local and external network infrastructure at JINR.
Providing reliable Internet connection is the key to success of any network. In the current paper questions about highly reliable network topology for data transfer between nodes in JINR are considered. The big challenge for the network service is to integrate between the two GRID sites Tier 1 and Tier 2 data centers together with the backbone JINR LAN and upscaling data rates to 100G, and in some cases up to x*100G. Today, the network factory built in 2013 for Tier 1 and Tier 2 data centers using TRILL technology requires modernization. A decision was made to gradually integrate with the new factory, built on the Cisco Application Centric Infrastructure technology, which already integrates the JINR backbone with all laboratories and departments.
Great importance is attached to the monitoring system of sites involved in the processing of data from experiments at the LHC. A detailed description of the creation of the monitoring service GRID sites data center Tier 1 and Tier 2 is given. Also paid attention to the service for monitoring the physical devices of the backbone network. Then, problems of network vulnerabilities are considered. A plan is given to improve network security, which is currently being implemented. The main purpose of the article is to demonstrate the complexity and urgency of correctly designing a network topology based on new data transfer protocols, taking into account all possible aspects of vulnerabilities.
Developing SSO and using it in JINR applications
One of the most important components of the monitoring system LITMon MICC LIT JINR is the data storage system. Initially, it was based on the RRD database and a special pnp4nagios plugin, support for which ended in 2022. Required features no longer work. The RRD database is morally obsolete and has ceased to meet performance requirements and has begun to consume more computing resources of the monitoring system server in comparison with analogues. Migrating data to a database based on InfluxDB software will solve these problems.
Одним из стратегически важных инфраструктурных проектов, с точки зрения долговременного научного плана ОИЯИ, является комплекс NICA для спиновой физики на поляризованных пучках – детектор SPD (Spin Physics Detector).
Ввиду невозможности построения критерия отбора данных на аппаратном уровне, детектор SPD задуман, как безтриггерный. Это в свою очередь на максимальной светимости коллайдера приведет к потоку данных с регистрирующих систем до 20 Гб/сек. Учитывая ограничения по времени работы ускорительного комплекса для SPD, годовой объем данных производимых только установкой можно оценить в 200 Пб. Для кратного уменьшения потока данных разрабатывается специализированный программно-аппаратный комплекс «SPD On-Line filter», обеспечивающий многоступенчатую высокопропускную обработку получаемых данных.
В данном докладе будет представлена микросервисная архитектура и первые прототипы системы управления данными и системы управления процессом обработки входящими в комплекс промежуточного программного обеспечения «SPD On-Line filter» – «Visor», разработанные на стеке современных технологий, таких как Python 3.11, FastAPI, Docker, PostgreSQL, RabbitMQ и другие.
One of the way to provide the transport connection with the required level of Quality of Service (QoS) is properly configured congestion control algorithms (CCA) [1]. CCA allows to control a size of congestion window, that determines the transport connection speed. In this paper a CCA method has been developed, that takes into account the QoS parameters forecast: a probability of packet loss - Loss, round trip time - RTT, available bandwidth - R. It is assumed, that the channel quality forecast is provided for the whole duration of the transport connection.
Existing CCA do not rely on a forecast. This might lead to choosing the non-optimal congestion window, and in a result - higher RTT or under-utilization of available bandwidth. Therefore, an aim of this work - developing an algorithm, that takes into account the channel quality forecast to provide lower latency and higher sending speed. Also, a method to determine invalid channel quality forecast is developed. CCA based on the channel quality forecast can be used in Network powered by computing (NPC) [2], in which the computing resources are connected by overlay channels. According to NPC architecture, the channels quality is monitored periodically. This information can be used to configure the developed CCA. The CCA, suggested in this work, can be used in other overlay networks, for example, CDN [3], CPN [4].
The main idea of the proposed approach is the choice of the congestion window size by supervised machine learning (ML) methods. This task can be described as the regression problem, which aim to get the target feature based on the training ones. In our problem the target feature is the size of congestion window and the training features are Loss, RTT, R. To choose the most suitable model to solve the regression problem we analyzed the prediction error of tree ensembles [5] and methods, based on linear and polynomial regression [6]. Based on the comparative analysis results, we chose the most accurate regressor (let us call it $R_{use}$) from CatBoost library [7] with an error 3.2% in MAPE metric to use in a developed CCA.
The proposed algorithm, which we call BBR FORECAST ML, is based on BBRv2 [8]. We add a FORECAST state to BBRv2, in which CCA sets the congestion window size by $R_{use}$ regressor. The CCA stays in this state while the channel QoS forecast is correct. In case of forecast violation, the proposed CCA falls back to BBRv2.
The forecast check is based on comparing channel parameters Loss, RTT, R with the benchmark values. The benchmark loss and RTT are defined as the corresponding forecast values. The benchmark speed is chosen as the speed $R_{BBR}(Loss, R, RTT)$, which the transport connection can reach on the channel under control of BBRv2. To estimate $R_{BBR}$ we solve a similar regression problem using gradient boosting and random forest methods [5]. The error of speed $R_{BBR}$ calculation in MAPE metrics is 4.9%.
The experimental study of developed algorithm BBR FORECAST ML was conducted using QUIC protocol implementation called ngtcp2 [9]. BBR FORECAST ML showed RTT 1.163 times bigger than the forecasted one, while it is 1.622 times bigger for CUBIC [10] and 1.477 times bigger for BBRv2 [8]. The sending speed of the developed algorithm is 1.804 times and 1.209 times bigger in average than speed of CUBIC [10] and BBRv2 algorithms respectively.
In addition, based on the experimental study results, the area of the proposed CCA applicability was derived in a space of the variables RTT, Loss, R. The possible reasons for forecast violation outside this area are analyzed. A mechanism of forecast violation recognition is developed and analyzed.
Literature list:
[1] M. Allman, V. Paxson. Request for Comments: 5681 - TCP Congestion Control. September 2009
[2] Smeliansky R. Network Powered by Computing //2022 International Conference on Modern Network Technologies (MoNeTec). – IEEE, 2022. – С. 1-5.
[3] Peng G. CDN: Content distribution network //arXiv preprint cs/0411069. – 2004.
[4] Sun Y. et al. Computing power network: A survey //arXiv preprint arXiv:2210.06080. – 2022.
[5] Banfield R. E. et al. A comparison of decision tree ensemble creation techniques //IEEE transactions on pattern analysis and machine intelligence. – 2006. – Т. 29. – No. 1. – С. 173-180.
[6] Montgomery D. C., Peck E. A., Vining G. G. Introduction to linear regression analysis. – John Wiley & Sons, 2021.
[7] Hancock J. T., Khoshgoftaar T. M. CatBoost for big data: an interdisciplinary review //Journal of big data. – 2020. – Т. 7. – No. 1. – С. 1-45.
[8] Cardwell N. et al. BBRv2: A model-based congestion control performance optimization //Proc. IETF 106th Meeting. – 2019. – С. 1-32.
[9] ngtcp2 2022, ngtcp2 github website, accessed 20 May 2023, https://github.com/ngtcp2/ngtcp2/
[10] Rhee I. et al. RFC 8312: CUBIC for Fast Long-Distance Networks. –2018.
Одной из ключевых технических особенностей установки SPD (Spin Physics Detector) является безтриггерный съем данных, обусловленный определенной сложностью исследуемых физических процессов. Система сбора данных (DAQ) осуществляет агрегацию данных с детекторов установки и организацию их в блоки для последующей первичной обработки. Совокупный объем данных после агрегации может достигать 20 Гб/сек, а годовой объем собираемых данных будет измеряться сотнями петабайт. Для решения задачи выявления и фильтрации событий в потоке данных создается специализированная вычислительная система "SPD OnLine filter".
SPD On-Line filter будет представлять собой программно-аппаратный комплекс для высокопропускной обработки данных. Аппаратная составляющая будет включать в себя набор многоядерных вычислительных узлов, построенных на современных технологиях, высокопроизводительные системы хранения данных, некоторое количество управляющих серверов. Программная составляющая включает в себя не только специализированное прикладное программное обеспечение, но и комплекс промежуточного программного обеспечения «SPD On-Line filter» – «Visor», в задачу которого будет входить организация и реализация многоступенчатых процессов обработки данных.
В данном докладе представлен обзор архитектур и прототипов следующих компонентов системы: система управления нагрузкой, которую условно можно разделить на серверную часть, отвечающую за контроль обработки наборов данных путем формирования достаточно количества задач, и агентское приложение, которое обеспечивает выполнение задач на вычислительных узлах.
В научном обществе долгосрочное планирование является необходимым элементом, который позволяет определить наиболее перспективные направления исследований на десятилетия вперед. Одним из примеров такого подхода к решению перспективных физических задач является эксперимент для изучения спиновой физики SPD на строящемся в Дубне коллайдере NICA. Как и большинство современных физических экспериментов, SPD подразумевает генерацию большой потока данных. Для решения связанных с этими данными задач потребуется спроектировать и реализовать их модель хранения и обработки. На текущий момент основным инструментом решения задач организации хранения научных данных является пакет Rucio, разрабатываемый в CERN. Rucio - это система управления данными, разработанная для эффективной обработки больших объемов данных в распределенных научных инфраструктурах. Rucio предоставляет доступ к научным данным через глобальную сеть и обеспечивает использование ресурсов для хранения и обработки данных на удаленных кластерах, а также позволяет автоматически управлять данными, используя множество критериев, таких, как аутентификация и авторизация, географическое положение, тип данных и доступ к данным. С помощью Rucio научные данные могут быть управляемыми, доступными и, что самое важное, репродуцируемыми. В докладе будет рассмотрен вопрос интеграции системы Rucio для обеспечения поддержки задач хранения и обработки данных эксперимента SPD.
Эксперимент SPD должен набрать до триллиона событий (записей результатов столкновений) для хранения и обработки которых потребуются сотни петабайт данных. Для использования в различных физических анализах ожидается аналогичное количество смоделированных данных. Эта информация будет распределена между несколькими хранилищами данных в компьютерных центрах. Для исключения потери данных и повышения производительности записи относящиеся к одному событию будут дублированы. Для эффективного доступа ко всем экземплярам событий необходима информационная система, а именно - разрабатываемый SPD Event Index, каталог всех событий, полученных от детектора или смоделированных, постоянно хранящихся во всех форматах и версиях. Он также предоставит инструменты для сбора информации о событиях, импорта ее в хранилище и обслуживания запросов клиентов через пользовательские интерфейсы и API. Разработка индекса событий начинается с серверной части хранилища, на данный момент это СУБД PostgreSQL. Были созданы простые интерфейс: клиент командной строки и графические веб-интерфейс для выборки событий. Разрабатывается служба обмена сообщениями для асинхронной обработки запросов больших объемов данных. В настоящее время проводятся исследования по повышению скорости загрузки данных в хранилище. Для этой цели разрабатывается методика и программная платформа реализующие полигон для нагрузочного тестирования.
В данной работе мы рассматриваем задачу оптимального формирования групп пользователей для мультивещания при обслуживании одноадресными и многоадресными соединениями с помощью многолучевых антенн. Мы сформировали данную задачу как подкласс задачи упаковки контейнеров (Bin Packing Problem, BPP), и предложили точный алгоритм для оптимального разбиения пользователей с минимизацией использования ширины полосы пропускания. Мы также учли условия над показателями качества обслуживания, таких как отношения сигнал–шум (Signal Noise Ratio, SNR), задержки обслуживания и отправляемой мощности сигнала. Из-за экспоненциальной временной сложности полученного алгоритма мы применяем методы машинного обучения для решения поставленной задачи при большом количестве пользователей, используя как исходные данные точные решения, полученные построенным алгоритмом. По результатам численного эксперимента заключается, что для пользователей малых радиусов, оптимальной стратегией будет обслуживание одним лучом, для пользователей средних радиусов лучшие результаты получаются при использовании предложенного алгоритма и для дальних пользователей, выгоднее всего обслуживать каждого пользователя отдельным одноадресным соединением.
One of the promising areas in the field of high-performance computations is co-scheduling, which allows to schedule computational tasks with the possibility of coexecution on a single node. The common approach of running one task on each node simultaneously does not allow to utilize the resources of the computer network to the full extent. With usage of co-scheduling mechanism it is possible to increase efficiency of HPC system overall as well as to reduce its energy consumption.
In this work several tasks are completed. First of all, several scheduling strategies for computational tasks execution, which can work on arbitrary number of nodes, are introduced. Secondly, a scheduler and proposed strategies are implemented using Docker containerization mechanism and Scala programming language. Thirdly, a computational experiment is performed in order to compare efficiency of strategies. In the experiment strategies' execution time is measured on different combinations of tasks from well-known NASA Parallel Benchmarks (NPB) set.
The results of the computational experiment show that one of the proposed strategies is working better than the trivial one under some assumptions. Further development of this strategy and the scheduler may make this strategy better than the trivial one overall.
В рамках участия в различных экспериментах ОИЯИ предоставляет вычислительные ресурсы в виде batch-кластера, развернутого в виде виртуальных машин в облаке ОИЯИ на базе системы HTCondor. Так как batch-система - многокомпонентная сложная система, то одним из ключевых аспектов обеспечения ее бесперебойной работы является постоянный мониторинг состояния ее основных компонентов. В докладе представлена разработанная система мониторинга кластера HTCondor на базе стека технологий Node Exporter, Prometheus, Grafana. Рассмотрена общая архитектура системы мониторинга, взаимодействие ее подсистем и дополнительно разработанные компоненты: сборщик с возможностью параметризации запуска и динамически генерируемые панели визуализации полученных данных. Описаны процессы, происходящие в системе: начиная со сбора информации и заканчивая ее визуализацией. Разработки открыты и опубликованы, что позволяет свободно интегрировать их в сторонние инфраструктуры.
The talk will cover the current status of work on the development of the BIOHLIT information system (IS), which is being created within a joint project between MLIT and LRB JINR. The system is designed to create a convenient environment for storing, processing and automating the analysis of data from experiments aimed at studying the radiobiological effects of exposure to ionizing radiation at the organismal, tissue and cellular levels. The investigation of behavioral responses of small laboratory animals is based on video data analysis, for which separate IS modules are being developed. The system comprises new web services for automating the analysis of behavioral test data at the Open Field setup and forming a dataset for the Morris Water Maze setup. Algorithmic blocks of the system are based on computer vision methods and the neural network approach.
На базе методов компьютерного зрения разработаны алгоритмы анализа видеоданных полученных при проведении поведенческого теста «Водный лабиринт Морриса», который применяется для оценки функции памяти и обучения пространственной памяти, у мелких лабораторных животных. Работа велась в рамках совместного проекта ЛИТ и ЛРБ ОИЯИ. Для удобства и верификации правильности получаемых траекторий передвижений лабораторных животных разработан веб-сервис, позволяющий автоматизировать анализ видеоданных։ загружать экспериментальные видеофайлы и анализировать полученные траектории движения, формировать размеченный набор данных. Разрабатывается модуль веб-сервиса для классификации траекторий движений лабораторных грызунов (стратегии поиска) с применением нейросетевого подхода. Для этого готовится аннотированный набор данных, включающий в себя разметку поля установки («Водный лабиринт Морриса») и построение траекторий. Разработка проводится на базе экоситсемы ML/DL/HPC Гетерогенной вычислительной платформы HybriLIT.
The research computer infrastructure for working with experimental MRI/fMRI data of the brain of a human or a laboratory animals is described:
- System "Neurovisualization" of the IAP "Digital Laboratory", with the involvement of the supercomputer of the Research Center KI as a computing resource;
- Additional software services based on the IAP "Digital Laboratory" that implement new methods and algorithms for working with data;
- User web interface of IAP "Digital Laboratory";
- A system based on the HybriLit platform (JINR), with the ability to connect via the web interface of the IAP "Digital Laboratory".
Reconstruction of neutron spectra over a wide energy range from $10^-$$^8$ to $10^3$ MeV is very relevant for the purposes of ensuring radiation safety behind biological shields at high-energy accelerators and reactors. A Bonner multi-sphere spectrometer is used for measurements. However, to unfold the entire spectrum from the measurement data, it is necessary to solve the Fredholm integral equation of the first kind. From the point of view of mathematics it is an inverse problem and it belongs to the class of ill-posed problems. In our work, we propose a numerical method for reconstructing neutron spectra based on a modified approach using their expansion either in terms of detector sensitivity functions or in terms of shifted Legendre polynomials. This approach is based on Tikhonov's regularization method, which is usually used to solve the system of linear equations obtained as a result of discretization of the system of integral equations under grid representation. Based on a proposed approach a computer code was developed. Neutron Spectra were unfolded for several locations at the JINR accelerators and reactor and compared with the spectra unfolded by statistical regularization code.
The article proposes algorithms for the automatic diagnosis of the facts of human lung diseases with pneumonia and cancer based on images obtained by radiation irradiation, which allow making decisions with the necessary reliability, that is, by limiting the probabilities of making possible errors to a pre-planned level. The proposed algorithms have been tested using statistical simulation and real data, which fully confirmed the correctness of theoretical reasoning and the ability to make decisions with the required reliability using artificial intelligence.
Данный доклад посвящен исследованию задачи определения положения кистей рук по ключевым точкам на примере открытого датасета с использованием методов машинного обучения. Основное внимание уделено разработке ключевых признаков позволяющих построить качественную и компактную модель машинного обучения . Кроме того, в работе проводится исследования эффективности различных моделей машинного обучения. Полученные результаты могут быть полезны при исследовании трудовых процессов с быстрыми движениями и малыми отрезками времени в алгоритмах распознавания технологических операций ручного труда на видеоданных.
В докладе рассмотрен разработанный авторами метод определения общей̆ координации движения и состояния алкогольного опьянения по данным с виброметрических сенсоров смартфона, расположенного в области верхне-передней части бедра человека (в кармане). Анализ вибросигнала, поступающего с устройства, рассматривается во временной области. В качестве признаков модели машинного обучения рассматривается ряд статистических признаков. Рассматриваются требования к датасетам, достаточные для классификации типов походки человека. Разработанный метод является частью научно-исследовательской работы по разработке комплекса централизованного дистанционного мониторинга основных показателей здоровья сотрудников с использованием технологий искусственного интеллекта. В рамках проведения испытаний данного комплекса демонстрируются результаты работы обученной модели, являющейся ансамблем метода ближайших соседей, рассмотренных на подмножествах множества признаков.
During remote sensing of the Earth, satellite equipment registers solar radiation reflected by the earth's surface. This reflected radiation travels through the atmosphere, distorting the spectral characteristics of the radiation reaching the satellite's sensors. The task of atmospheric correction is to eliminate the influence of these distortions. Currently, data from most satellites are in the public domain, but only a small part of them represent data with previous atmospheric correction. As a rule, users who have access to the data of one or another satellite can perform atmospheric correction of the necessary data using the appropriate application programs. However, this processing is carried out interactively for each image of a specific area.
The paper proposes a method for correcting satellite images using a neural network. This method allows you to automate the process of atmospheric correction for "raw" satellite images. The method is based on a fairly simple neural network based on the encoder-decoder architecture. The pre-prepared dataset contains images without correction and their corresponding already processed images with correction, which are available directly in the satellite data storage. A neural network is trained on this dataset. Then this net is used to perform atmospheric correction of images of this satellite.
Generative models have become widespread over the past few years, taking valuable part in content creation. Generative adversarial networks (GANs) are one of the most popular generative model types. However, computational powers required for training stable, large scale and high resolution models can be enormous, making training or even running such models an expensive process. Study of neural network optimization proposes different techniques of lowering required GPU memory, fastening training time and creating more compact models without noticeable loss in generated sample quality. In this research we apply quantization techniques to GAN and estimate results with a custom dataset.
The study is devoted to developing an algorithm for extracting the names of organizations from poorly structured data. Bibliographic information about the publications from the abstract database Scopus was taken as the initial data.
The main problem in extracting names of organizations from affiliations, apart from the presence of typos, is that the requirements of journals and conferences to spell affiliations are different. This results in affiliations to the same organization being written in different ways, which does not allow for statistical analysis on organizations. In this regard, the authors of the research analyzed 750 records with affiliations of the publication's authors and used them for statistical analysis of affiliation writing templates and compiled a list of the 10 most frequently used ones (186 different templates in total). Based on the templates compiled, an algorithm was developed to identify the names of organizations.
In order to analyze the effectiveness of this method, the authors of the study conducted an experiment comparing the accuracy of identification of the name of the organization using two algorithms: one developed without templates and one developed on the basis of templates. The results of the experiment confirm the effectiveness of the template method for further development of the algorithm before developing it without the use of templates.
Natural language processing technologies are one of the key areas in the field of data analysis. Natural language processing performs a plenty of tasks, which include the task of named-entity recognition. It provides an opportunity to get value information from a large amount of data. The study is devoted to select better program packages for named-entity recognition from Russian news text.
To choose the program packages, a corpus of 70 news articles has been collected from different Internet resources. "Natasha", "SpaCy", "Stanza", "DeepPavlov"’s models («ner rus bert probas», «ner rus bert», «ner ontonotes bert mult») were selected for conducting an experiment. Named entities were extracted manually and using the program packages. After receiving the result of the packages performing, the data was processed and metrics precision, recall, f-measure were calculated.
According to the results of the experiment, the packages "Natasha" and "SpaCy" were selected for their accurate recognition of entities. It was concluded that "Natasha" better recognizes entities such as "PER" (person) and "LOC" (location), "SpaCy" is able to recognize data such as "ORG" (organization) without breaking the semantic part. The obtained result can be used to create an algorithm to recognize entities from Russian-language articles for further data analysis.
The modern world every day is subjected to the digitalization in various spheres of life, due to the need to introduce more efficient and accurate methods of handling data. This paper focuses on the development of a system that can track the publication activity of researchers in a scientific organization. The system is developed in the framework of the project to track the publication activity of the staff of the Joint Institute for Nuclear Research. Key components of the architecture were tested in the framework of this project, namely, methods of programmatic data collection and processing, creation of a data lake and construction of an analytical panel with typical visualizations. Data were collected on 36,785 scientific publications and on 10,245 authors. Once collected, methods and software tools were used to process and saturate the data, such as identifying geo-data of author affiliations or selecting text keywords. The result is an interactive dashboard displaying typical visualizations from a pie chart to an author affiliation map that help to track the publication activity of the scientific organization's researchers.
The research is devoted to the development of requirements for new software that will automate the collection and analysis of data on the processing of a sample of biomaterial from existing systems. The paper compared the possibilities of data storage in information systems that are used in automated medical laboratories in Russia and abroad.
To improve the efficiency of the medical laboratory, it is necessary to conduct data analysis, which, in addition to the lack of important information, is complicated by a large array of test tube data (large laboratories receive 15,000 test tubes per day, for each of which there may be 10-15 records on the way) and storing data on the path of the test tube in different information systems (LIS, Middleware, tracking of samples, monitoring of temperature sensors). This leads to a lengthy collection and analysis of data on the situation in the laboratory.
The authors of the study compiled 3 schemes in BPMN notation, which reflect the path of the test tube at 3 stages – the biomaterial collection point, logistics and production. Based on the obtained schemes, a list of 23 problems for the laboratory was compiled, the reliability and expediency of which was confirmed by experts. To control these problems, a list of 20 timestamps (time points on the path of a medical sample) was compiled. Based on the compiled lists, 12 systems were described that allow solving 23 problems and collecting the necessary 20 timestamps. As part of this study, a timestamp collection system was selected, for which functional requirements were described for their further transfer to the More Data development department and subsequent implementation in the laboratory.
The study of Big Data is important, since the study of technologies in this area allows you to effectively use large amounts of information. The authors of the article studied data from scientific and news sources and present an analysis of development of big data technologies. The analysis examines the process of development of the Big Data market both in the world and in individual countries and focuses on the relationship between this process and scientific research on big data field. Based on the analysis of collected data, the authors of the work identified the global trend in the development of Big Data technologies and the trends of individual countries. The analysis was carried out using visual analytics. Various information sources and services were used, both international and Russian. The results of the study can be useful both for business and scientific research aimed at improving the economic and social development of countries.
Executing millions of scientific high-throughput computing (HTC) jobs on distributed heterogeneous computing resources poses challenges in observing their status and behavior after their completion. To address this, an approach was developed to analyze jobs using scatter plots, showcasing the dependency between job durations and the relative performance of CPU cores they were assigned to. Subsequently, a specialized system was created to automate this analysis process. The system regularly collects relevant data regarding finished jobs within the DIRAC infrastructure.
Using the Django web framework on the server side and the HTML+CSS+JavaScript stack on the client side, a web application was developed, offering the necessary tools and filters to highlight different aspects of the operation, such as final status, processors used, cluster names and the sending user. Highcharts JavaScript library was used to visualize the results. After investigating several approaches it was decided to store the data in CSV files. The web application use these datasets as a data source for analysis.
The developed system has proven to be invaluable, enabling the identification of issues on remote servers and demonstrating performance disparities among different computing resources. It facilitates efficient monitoring and analysis of HTC jobs, improving the overall understanding of their execution behavior.
Distributed systems require an efficient and reliable consensus mechanism to reach agreement between nodes. In recent years, two popular consensus algorithms, Practical Byzantine Fault Tolerance (PBFT) and Raft, have gained wide acceptance in the community due to their advantages. PBFT provides high speed and fault tolerance, while Raft is simple and easy to understand. However, each of them has its limitations, especially when working with scalable systems. This paper proposes a combined approach that combines the advantages of PBFT and Raft to achieve a scalable and fault-tolerant distributed consensus. The proposed method uses PBFT as the bottom layer, providing high speed and fault tolerance in scalable scenarios. Raft is used as the top layer, providing simplicity and reliability in configuration management and leader selection.
Для современной науки имеет большое значение Эффективное управление экспериментальными установками, и ускорительными комплексами различного уровня сложности. Инженерные решения в этом направлении различны, но все они сводятся к использованию специализированных объектно-ориентированных распределенных систем управления аппаратным оборудованием.
В данном докладе представлен алгоритм работы структурных элементов распределенной системы управления Tango Controls, как одной из наиболее эффективных систем управления на сегодняшний день. Данный алгоритм наглядно демонстрирует основные возможности Tango Controls, получать, обрабатывать данные посредством элементов конструкции - сервера устройств и клиента Tango Controls.
The License Management System (LMS) was developed at the JINR Information Technology Laboratory. The purpose of creating an LMS is to automate the management, acquisition, maintenance and use of licensed software products. The article presents the results of development of the system over the last year. From the "EDMS Dubna" the mechanism for coordinating requests (workflow) was imported and based on its implementation adapted for the LMS, various requests were created: a user request for a new license, for adding new software products to the supported software catalog, an auditor's request for the purchase of additional licenses. Work has been carried out to fill the LMS database, convenient presentation of information about licenses and various statistical information. The issues of integration of LMS with other services within the framework of the JINR Digital Eco System are also considered.
JINRLIB - библиотека программ, предназначенных для решения широкого круга математических и физических проблем. Программы JINRLIB написаны в разных системах и на разных языках программирования и относятся к различным направлениям вычислительной математики и вычислительной физики. Есть раздел программ, написанных с использованием технологии параллельных вычислений, в частности, MPI и OpenMP. Программы объединяются в библиотеки или существуют в виде самостоятельных пакетов прикладных программ. Пополняется библиотека в основном программами сотрудников ЛИТ.
Часть программ, написанных на Фортране, объединяется в библиотеки объектных модулей. Описывается опыт использования библиотек программ, написанных на языке Фортран, с современными языками программирования (Python).
The report discusses a corporate geographic information system, which is designed to optimize management decisions for the operation of the laboratory building. The purpose of the implementation of the service is the competent management of the building, its technical communications, monitoring of engineering networks, visual representation of the placement of stuff, keeping a log of ongoing construction and repair work, and the formation of various analytical reports.
The considered information and graphic service allows you to see floor plans of the building with the existing footage and display building axes, display engineering networks in layers: power supply, water supply, heat supply, ventilation, fire alarm, telephone communication, location of departments and personnel in offices.
Аннотация. Рассматривается общая постановка задачи идентификации. Обсуждаются вопросы изменения множества допустимых решений при добавалении дополнительных (изменении существующих) гипотез. Для связи математических моделей с данными используется технология сбалансированной идентификации. На простых наглядных примерах исследуется динамика некоторых статистических оценок точности моделирования при добавлении “правильных и неправильных” гипотез. Приводятся примеры использования предложенных методов при моделировании реальных объектов.
Заключение. Приведенные результаты демонстрируют эффективность использования среднеквадратичной ошибки кроссвалидации и стандартного отклонения для исследования приемлемости дополнительных гипотез.
- Исследование выполнено за счет гранта Российского научного фонда № 22-11-00317, https://rscf.ru/project/22-11-00317/
Scheduling tasks and allocating resources in a cloud (distributed) system is significantly different from resource management of a single computer. Cloud (global) schedulers view the system as a large pool of resources to which they have full access. At the same time, the most urgent task remains to maintain a balance, which consists in managing each individual computing task in such a way that the restrictions associated with it are met, while the total load of the system would meet the requirements of its owner (complete load, return on the provision of resources, etc.). However, as a rule, the execution of a single task is controlled independently of other tasks and the state of the entire system as a whole. In this case, priorities, scheduling options for individual tasks and their parameters are not taken into account. The absence of global coordinating mechanisms leads both to an increase in the execution time of individual tasks and to the underutilization of resources. On the other hand, end users do not have information about the timing of obtaining a solution or service capabilities, which leads to a loss of quality of service. Allowing the user to control system performance expectations for each job improves the quality of service for a particular job, but may have a negative impact on the performance of other jobs.
Classical schedulers are built either on minimizing response time (real time) or on maximizing total resource utilization (time sharing). And since the purpose of global schedulers is to improve the state of the system as a whole, the requirements of individual consumers are practically not taken into account and user tasks can be performed for hours. To minimize service time, it is necessary to adjust the strategy of classical schedulers so that, on the one hand, take into account the interests of users, task priorities and their execution time, and on the other hand, information about the state and distribution of system resources for general optimization of its operation.
The purpose of this work is to present a systematic approach to the construction of high-performance specialized computing systems (SCS) that perform resource-intensive tasks related to a special class, the execution methods of which can be defined as random enumeration with an unknown outcome [1,2]. Here, obtaining a solution is based on enumeration algorithms and comes down to searching for a fragment with predetermined properties in a large array of initial data. Such an array, as a rule, consists of separate, indivisible, identical in size, meaningfully significant fragments. Each task is considered solved as soon as a unique element can be identified in some piece of data. Tasks are of different types. This includes searching for a graphic object on map fragments for its recognition, and searching the Internet for some text on a given fragment, and encryption and decryption tasks.
Based on previous works [3-6], the following conclusions can be drawn. The use of classical schedulers in SCS does not allow to fully realize all their capabilities and reduces productivity and efficiency.
To eliminate this, methods for managing such systems based on intelligent agents (IAs) have been developed. Such IAs, having no information about the initial state of the system, according to the obtained statistical data on its functioning, can significantly increase the productivity and improve the efficiency of the SCS.
The management of the passage of tasks in the SCS is carried out by IAs based on the parameters assigned by them for each task, without having an analytical description of the entire system. Based on this, this type of control can be attributed to artificial intelligence systems.
The report discusses a systematic approach to the construction of various SCS control schemes based on IAs. Approaches to measuring the maximum performance will be shown and an analysis of the quality of work of such systems will be carried out.
Литература
Малашенко Ю.Е., Назарова И.А. Модель управления разнородными вычислительными заданиями на основе гарантированных оценок времени выполнения. // Изв. РАН. ТиСУ. 2012. No 4. С. 29-38.
Купалов-Ярополк И.К., МалашенкоЮ.Е., НазароваИ.А. и др. Модели и программы для системы управления ресурсоемкими вычислениями. М.: ВЦ РАН, 2013. http://www.ccas.ru/depart/malashen/papper/ronzhin2012preprint.pdf.
Голосов П.Е., Гостев И.М. О некоторых имитационных моделях планировщиков операционных систем // Телекоммуникации. 2021 No 6, ст. 10-21.
Голосов П.Е., Гостев И.М. Об имитационном моделировании функционирования операционной системы с вытесняющим планированием // Телекоммуникации. 2021. No 8. С. 2–22.
Голосов П.Е., Гостев И.М. Имитационное моделирование серверов с прерываниями в больших многопроцессорных системах // Известия вузов. Приборостроение. 2021. Т. 64. No 11. С. 879–886. https://doi.org/10.17586/0021-3454-2021-64-11-879-886.
Golosov P.E., Gostev I.M. About one cloud computing simulation model // Systems of Signals Generating and Processing in the Field of on Board Communications, Conference Proceedings. 2021. P. 9416100. https://doi.org/10.1109/IEEECONF51389.2021.9416100
Golosov P.E., Gostev I.M. Cloud computing simulation model with a sporadic mechanism of parallel problem solving control. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2022, vol. 22, no. 2, pp. (in Russian).
The paper considers methods to improve the performance of multi-agent system of knowledge representation and processing. The approach to the development of system and application software agents is described, the methods of distribution of agents on the nodes of the computing system and construction of the optimal logical structure of a distributed knowledge base are considered. The scheme of management of distributed information-computing resources, including methods of determining the availability of microservices, ensuring reliable and coordinated work of computing nodes is presented.
Keywords: distributed system, multi-agent system, knowledge base, software agents, reinforcement learning, optimization of knowledge base structure
В рамках данного доклада будет описана разработка двух плагинов WordPress для сайта Объединенного института ядерных исследований (ОИЯИ) - "Упоминания в СМИ" и "Календарь мероприятий". Плагины были разработаны с целью предоставить администраторам сайта новые инструменты для более гибкой настройки сайта. “Упоминания в СМИ” предназначен для отображения ссылок на материалы в СМИ, связанные с работой института, с возможностью настройки различных параметров, таких как количество новостей на странице, смена заголовка и использование стилей. “Календарь мероприятий”, в свою очередь, способен визуализировать все предстоящие события и мероприятия в виде интерактивного календаря, автоматически заполняемого информацией из плана мероприятий ОИЯИ. Оба плагина реализованы таким образом, что выводимую плагином информацию можно разместить в любом месте любой страницы сайта в виде отдельного блока с помощью так называемого “шорткода”. Несмотря на то, что плагины были разработаны для сайта ОИЯИ, они оба опубликованы под открытой лицензией и могут быть использованы на других сайтах, построенных на платформе WordPress.
В рамках эффективной квантовой теории поля с нелокальным взаимодействием рассмотрены основные свойства мезонов. Создан математический аппарат для самосогласованного решения системы нелинейных интегральных уравнений. Проведен численный расчет спектра масс мезонов и их констант связи.
Исследованы свойства $\eta$ - $\eta'$ мезонов при конечной температуре ядерной материи. Создан алгоритм самосогласованного решения уравнений Швингера-Дайсона и Бете-Салпетера. Проведены численные расчеты спектра масс псевдоскалярных мезонов.
Процесс решения задач гидро- и аэродинамики с помощью CFD требует больших временных затрат и вычислительных мощностей.
В процессе проектирования различных инженерных решений (к примеру, транспортные средства), это критически важный фактор,
так как бывает необходимо рассмотреть множество различных вариантов, многие из которых в дальнейшем не будут использованы.
Для оптимизации процесса проектирования, требуется найти способ быстрой предварительной проверки гидро- и аэродинамических свойств разрабатываемых деталей,
что позволит на ранних этапах выявлять наиболее перспективные.
В данной работе мы рассматриваем способ решения этой проблемы при помощи нейронных сетей, обучая их находить поля давлений и скорости для потока жидкости,
обтекающего препятствие.
Currently, problems related to task placement in clusters play an important role as they significantly reduce the execution time of parallel applications. To efficiently allocate tasks, the scheduler must consider both the topology of the cluster and that of the input task.
In this work, we study various cluster topologies and consider several task placement algorithms. In particular, we propose naive task placement algorithms that do not consider any topology and are based on either random placement or node enumeration. We also consider algorithms that only consider the topology of the cluster and algorithms that consider both the topology of the cluster and that of the task.
To compare these algorithms, we developed an application that implements these algorithms and simulates clusters with 2D and 3D torus topologies, as well as fat tree and thin tree topologies. Through the developed application, we conducted a series of experiments that studied the performance of abstract applications in different situations.
As a result of the study, it was established that the task placement algorithm that considers both the topology of the cluster and that of the task significantly outperforms other task placement algorithms.
An essential part of any system’s security architecture is an authentication mechanism – some algorithm or combination of algorithms making sure that only legitimate users can gain access to the system. Continuous authentication (CA) is a new approach to user authentication in distributed systems. Its main principle is that unlike a “traditional” approach, where a user is only authenticated once at the beginning of a session, in CA the user’s identity is re-verified throughout the entire session. This means that even if a user’s device becomes compromised after a successful log-in, unauthorized access to the system can still be prevented. CA is a part of a larger cybersecurity doctrine known as zero-trust architecture, or ZTA.
Internet of Things (IoT) systems are growing more common and more sophisticated by the day; consequently, the need to provide security for them, including reliable authentication systems, is also becoming more urgent. On the other hand, IoT devices also present unique challenges in regard to implementation of authentication mechanisms; in particular, they might lack computing power necessary for more complex algorithms, as well as conventional user interfaces such as keyboards or touchscreens.
In this paper, distributed system continuous authentication algorithms that can be used with IoT systems are investigated. They include methods using such technologies as blockchain, machine learning, and biometrics. Based on the results of the analysis, new approaches to the task of implementing CA in an IoT context are suggested.
В структуре сверхпроводник-ферромагнетик-сверхпроводник наблюдается аномалный эффект Джозефсона, который заключается в возникновении фазового сдвига и такие переходы называются Фи-0 джозефсоновскими переходами.
В настоящей работе представлена результаты исследования динамики и резонансных свойств Фи-0 перехода. Динамика намагниченности в ферромагнитном слое описана уравнением Ландау-Лифщиц-Гильберта, а динамика разность фаз в сверхпроводящих слоев - уравнением резистивной модели. На основе численного решения системы уравнений показано реализация ферромагнитного резонанса под воздействием джозефсоновских осцилляций в ферромагнитном слое и возникновения резонансной ветви на вольт-амперной характеристике Фи-0 перехода. При определенных пределах параметров модели получены приближенные уравнения для намагниченности на подобии уравнения гармонического осциллятора и нелинейного уравнения осциллятора Даффинга. Проведено сравнение аналитических решений приближенных уравнений с численными решениями и показано их соответствия. также продемонстрированы влияние параметров модели на ферромагнитный резонанс.
To solve the problem of classifying diseases of the cardiovascular system based on the results of Holter monitoring, the following algorithm has been developed:
1. Data preprocessing based on the quantum phase space approach;
2. Binarization of numerical features;
3. Application of machine learning methods to solve the classification problem;
A set of programs has been developed for creating the instantaneous heart rhythm (IHR) function and creating slices of 3D histograms in the language of computer algebra Maple. A set of programs for loading, preprocessing and analyzing the results of 3D histogram slices in Python has been developed. The support vector machine (SVM) method has been implemented for analyzing slices of 3D histograms in order to classify the studied data into categories (normal, deviations from the norm) with an accuracy of 93%.
Keywords: quantization, quantization constant, phase space, instantaneous heart rhythm, visualisation, machine learning.
The advantages of cloud technologies allowed the latter to occupy a certain niche in the field of scientific computing. Around decade ago following that trend a cloud infrastructure had been deployed at JINR and later in the scientific organizations of the its Member States. These cloud resources were integrated into a Distributed Information and Computing Environment (DICE) to combine computational power for solving common scientific tasks, as well as to distribute peak loads across participants. During last couple years the activities on the JINR cloud infrastructure are mostly focused on increasing a quality of service. Main issues faced with and the ways of solving them are given. Some statistics on the resources utilization is provided as well as changes in both the JINR cloud and the DICE infrastructure.
Modern systems and applications are increasingly distributed due to the growing performance, scalability and availability requirements. Distributed computing allows to flexibly aggregate the resources of individual machines into scalable computing infrastructures with required characteristics. However, distributed systems are hard to build, test and operate because of their asynchronous and nondeterministic nature, absence of a global clock, partial failures and large scale. Therefore there is ongoing work both in academia and industry to advance the methods and technologies for solving the related problems. There is also a growing need for education of qualified specialists in this field.
Due to the large scale of considered systems, it is generally not feasible or time-consuming and expensive to conduct experiments and evaluate the proposed solutions in a real system. Also, due to the client behavior, dynamicity and non-determinism of production environments, the experimental conditions are hard to control and the results are not reproducible, which is unsuitable for comparison of several solutions. Building a copy of a real system or even a new system solely for the purpose of research is also economically infeasible. The similar observations can be made for education in distributed computing. While it is possible to build a small real lab environment for students, such environment requires a significant effort to operate and cannot expose the students to all problems that occur in modern large-scale systems.
Replacing a real system with simulation allows to resolve these issues. Simulation significantly reduces the cost and time needed to run experiments, while requiring much less resources. For researchers, it enables studying of alternative system configurations and application scenarios, provides a full control over environment and ensures reproducibility. Simulation can also be used in education to provide students a virtual environment for practical assignments that simulates common problems, such as node crashes and network failures, and allows to deterministically execute and check student solutions.
In this report, a general-purpose software framework for simulation of distributed systems, called DSLab, is presented. The main advantages of DSLab in comparison to other similar projects are versatility and extensibility, convenient and flexible programming model, high performance and ability to simulate large-scale systems. DSLab is organized as a set of loosely coupled software modules, which allows users to flexibly assemble solutions for specific purposes. Current modules include a generic discrete-event simulation engine, models of basic system resources (compute, storage and network), reusable modeling primitives, message passing simulator and a set of domain-specific simulators for different research areas such as task scheduling and cloud resource management. The functionality of these modules, their evaluation and the use in research and educational projects are discussed. DSLab is available as an open source project on GitHub: https://github.com/osukhoroslov/dslab.
Abstract
The article (proceeding) explores the importance of horizontally scalable technologies for storing and processing digital footprints, a crucial component of IT-professional training for accelerating digital transformation. It begins by defining digital footprints, subsequently addressing their increasing role in modern IT- education and digital transformation. The discussion progresses to the pivotal role of horizontal scalability in digital footprints management and introduces the CAP theorem as a fundamental principle affecting the design of distributed systems. An overview of cutting-edge scalable storage and processing technologies follows, including a discussion on the trend towards relaxing ACID properties for scalability, as implied by the CAP theorem. A comparative analysis of NoSQL databases is presented, highlighting their suitability for storing digital footprints considering CAP constraints. The unique capabilities of Intel DAOS for digital footprint management are also examined. The significance of distributed message brokers in the efficient stream processing of digital footprints is addressed, followed by a brief review of the most popular scalable brokers. The article underscores the role of the Virtual Computer Lab in the training process and its potential impact on digital transformation. It concludes by emphasizing the need for partnerships with leading data centers for integrating High Performance Computing (HPC) solutions into the educational process and outlines potential challenges and solutions in this domain.
Introduction
Digital transformation refers to the integration of digital technology into all aspects of a business or organization, fundamentally changing how it operates and delivers value to its customers. It's more than just a change in external business processes—it’s a cultural shift that requires organizations to continually challenge the status quo, experiment, and be comfortable with failure. The transformation may involve changes to business models, ecosystems, and customer engagement, among others, with the end goal of improving operational efficiency and meeting changing customer needs.
The digital transformation journey involves the use of innovative technologies such as cloud computing, big data, artificial intelligence (AI), and the Internet of Things (IoT) to enhance business operations. It also includes the digitization of information, increased use of software and applications, and the use of data analytics to drive decisions. The drive for digital transformation is fueled by changing customer expectations, increased competition, and the need for businesses to stay relevant in a rapidly evolving digital landscape.
Horizontally scalable digital footprints storage and processing technologies refer to systems that can handle increased data load by adding more machines or nodes to the network, rather than upgrading the existing infrastructure. These technologies are designed to accommodate the rapid and often unpredictable growth of digital footprints, which represent the data created and left behind because of individuals' and organizations' digital activities.
In the context of digital footprints, storage refers to the technologies used to store the vast amounts of data that these footprints generate. This can include anything from traditional database systems to modern cloud storage solutions. The key is that these technologies need to be scalable, allowing for the addition of more storage capacity as the volume of digital footprints grows.
On the other hand, processing technologies are those that are used to analyze and extract valuable insights from these digital footprints. These technologies include tools and frameworks for big data analytics, machine learning, and other advanced data processing methods. Just like storage technologies, these processing technologies also need to be horizontally scalable to keep up with the increasing volume and complexity of digital footprints.
These horizontally scalable digital footprints storage and processing technologies are crucial in today's digital age, where the volume of data is growing at an unprecedented rate. They enable organizations to effectively manage and gain insights from their digital footprints, thereby driving innovation, improving decision-making, and ultimately accelerating digital transformation.
IT professional training plays a pivotal role in driving successful digital transformation initiatives. As businesses continue to evolve in response to technological advancements, the need for skilled IT professionals who are conversant with emerging technologies and methodologies is paramount.
Digital transformation often involves the implementation of new technologies and processes that may be unfamiliar to an organization's existing IT staff. Professional training helps bridge this skills gap, enabling IT teams to effectively manage, maintain, and optimize these new systems. Training provides IT professionals with the knowledge and skills to not only manage new technologies but also to identify opportunities for their application. This can drive innovation, as employees use their training to find new ways to solve problems and create value.
With the surge in digital activities, cybersecurity risks have also increased. IT professional training in the latest security practices and technologies is critical to safeguarding an organization's digital assets during and after the transformation process.
Training in areas such as data analytics, AI, and machine learning can equip IT professionals to better understand and respond to customer needs, leading to improved customer experiences – a key objective of many digital transformation initiatives. Digital transformation often requires organizations to be more agile and responsive. IT professional training in areas such as DevOps, agile methodologies, and cloud computing can foster this agility, enabling quicker responses to changing market dynamics.
Definition and overview of digital footprints
Digital footprints refer to the trail of data that individuals and organizations create and leave behind while using the internet and digital services. These footprints can be broadly categorized into two types: active and passive.
Active Digital Footprints: These are intentionally created and shared by individuals or organizations. For instance, social media posts, emails, online articles or blogs, and website content all form part of an active digital footprint. When an organization maintains a website or a social media presence, it's creating an active footprint. Similarly, when an individual posts a photo, updates their status, or writes a review online, they contribute to their active digital footprint.
Passive Digital Footprints: These are created without the direct intentional action of the user. They are usually generated when different digital services and platforms collect and store data about user activities. Examples include browsing history, location data, search logs, and other metadata that can be collected through cookies, tracking pixels, or other similar technologies.
Both types of digital footprints are valuable sources of data. For individuals, they represent their online identity and behavior, which can impact personal reputation, privacy, and even security. For businesses, digital footprints provide a wealth of information about customers, competitors, and market trends. This data can be analyzed to gain valuable insights, informing strategic decisions, improving products and services, and enhancing customer engagement.
As the volume of digital footprints grows with increased use of digital services, effective management, storage, and processing of this data become increasingly critical. This is where horizontally scalable digital footprints storage and processing technologies come into play, enabling organizations to effectively handle the growing data load and extract valuable insights.
The role of scalable digital footprints storage and processing in digital transformation
Scalable digital footprints storage and processing technologies play a crucial role in data management. As businesses generate and collect more data, managing this data effectively becomes increasingly challenging. Scalable technologies enable businesses to store and process larger volumes of data, thereby improving data management. They also ensure that as data volumes grow, businesses can continue to store, access, and analyze this data efficiently and effectively. This improved data management capability can help businesses make more informed decisions and gain a competitive advantage.
With the ability to handle larger volumes of data, scalable storage and processing technologies can significantly enhance data analytics capabilities. They enable businesses to analyze larger, more complex datasets, thereby generating more accurate and comprehensive insights. These insights can inform strategic decision-making, improve operational efficiency, and drive business growth. Additionally, scalable technologies can support real-time or near-real-time analytics, enabling businesses to respond quickly to changing conditions and opportunities.
Scalable digital footprints storage and processing technologies can also support the delivery of more customer-centric services. By enabling businesses to collect, store, and analyze large volumes of customer data, these technologies can provide a more detailed understanding of customer behaviors, preferences, and needs. This can inform the development of more personalized, relevant, and responsive services, thereby enhancing the customer experience and promoting customer loyalty.
Finally, scalable digital footprints storage and processing technologies can drive innovation and business growth. By providing the capacity to handle large volumes of data, these technologies enable businesses to explore new ways of using this data, potentially leading to the development of new products, services, or business models. They also support business growth by enabling businesses to manage increasing data volumes as they expand their operations. Furthermore, they can facilitate the identification of trends and opportunities that can drive business growth [1–37].
The importance of digital footprints in the modern IT-education
A digital footprint becomes an essential aspect of digital citizenship. As our world becomes more digitally interconnected, understanding, and managing digital footprints is becoming an increasingly important skill for students. Education plays a vital role in preparing IT-professionals for this digital reality.
The importance of digital footprints in education is multi-faceted. Here are several ways they can be significant:
• Learning Opportunities: Students can use digital footprints to learn about the importance of online safety, privacy, and ethical behavior. The concept of a digital footprint can serve as a real-world example of the consequences of online activities. Educators can use this topic to teach students about these concepts and discuss their implications.
• Personal Branding: A digital footprint can be viewed as a personal brand. It's the accumulation of your online activities, including your social media posts, blog entries, comments, and more. This brand can be a positive reflection of a student's personality, skills, and accomplishments. Students can learn how to create a positive online presence that can be beneficial for college applications, scholarships, or job prospects.
• Critical Thinking and Media Literacy: Understanding and managing digital footprints can help students develop critical thinking skills. They learn to consider the potential long-term impacts of their online activities and make more informed decisions. This is also tied to media literacy – understanding how information is created, shared, and perceived online.
• Online Safety and Privacy: By learning about digital footprints, students can become more aware of their online safety and privacy. They can better understand how their personal information can be accessed, used, and potentially misused, leading to safer online practices.
• Cyberbullying Prevention: Understanding digital footprints can help prevent cyberbullying. Students learn that their online activities are traceable, potentially leading to consequences if they engage in harmful behaviors. It can also help victims of cyberbullying understand that there are ways to trace and report harmful actions.
• Future Opportunities: Today, colleges and employers often look at the digital footprints of applicants. A well-managed, positive digital footprint can open up opportunities, while a poorly managed one can close them.
By recognizing the significance of digital footprints in education, students can develop crucial skills and knowledge that will serve them well in navigating the digital landscape responsibly and effectively.
As of now, IT professionals training typically covers a broad range of topics, including programming, systems analysis, cybersecurity, and database management. There's a strong emphasis on understanding the fundamentals of computing, problem-solving, and developing software applications. While these subjects are crucial, the rapid growth in data generation and digital transformation initiatives necessitates a shift in focus towards modern data management and processing techniques.
Given the proliferation of data and the increasing reliance on data-driven decision making, it's critical for IT professionals to understand how to manage, store, and process large volumes of data effectively. Businesses are looking for professionals who are familiar with modern, scalable technologies like distributed file systems, NoSQL databases, cloud storages, and distributed computing frameworks. Therefore, incorporating these subjects into IT professional training is crucial to prepare the workforce for the demands of the modern business environment.
Integrating scalable digital footprints storage and processing technologies into IT professional training offers several benefits. It provides IT professionals with the skills needed to manage and analyze large volumes of data, which are critical for driving digital transformation. This training can improve job prospects, as there's a high demand for professionals with these skills. It can also enable professionals to contribute more effectively to their organizations, supporting data-driven decision making and innovation.
System Analysis and Control Department of the Dubna State University have successfully integrated these technologies into their curriculums and offers master’s programs that covers scalable data storage and processing technologies, which include courses on distributed computing and machine learning at scale. Graduates of this program have gone on to work in a variety of data-intensive roles, including data scientist, data engineer, and machine learning engineer [38–40].
Similarly, many businesses are investing in internal training programs to upskill their existing staff. For instance, global retail companies such as X5 Retail Group, implemented training programs covering scalable storage and processing technologies as part of its digital transformation initiative. These programs help to build a team capable of leveraging Big Data to improve customer insights, operational efficiency, and decision making.
The role of horizontal scalability in footprints management
Horizontal scalability, also known as "scaling out," is a method of adding more machines or nodes to a system to improve its performance and capacity as demand increases. This contrasts with vertical scalability, or "scaling up," which involves increasing the capacity of a single machine, such as adding more memory or a faster processor.
In the context of digital footprints storage and processing, horizontal scalability allows a system to handle larger volumes of data by spreading the load across multiple machines. When the system reaches its limit, more machines can be added to continue scaling its capacity. This is typically done in a distributed computing environment, where multiple machines work together to perform a task.
The advantage of horizontal scalability is that it can, theoretically, allow for infinite scaling, as you can continue adding machines as long as you have the resources to do so. It also offers better fault tolerance: if one machine fails, the system can continue to operate by relying on the remaining machines.
Horizontal scalability is a critical feature of modern storage and processing technologies. As the volume and velocity of data generation continue to grow, being able to scale systems horizontally ensures they can handle the increasing load while maintaining performance. This capability is especially important in the realm of big data and real-time processing, where systems must be able to process large volumes of data quickly and efficiently.
Horizontally scalable technologies involve adding more machines or nodes to a system to increase capacity, offer several significant advantages. Here are some of the key benefits:
• Improved Performance: Horizontally scalable technologies can improve system performance by distributing workloads across multiple nodes or machines, reducing the load on any single node, and potentially speeding up processing times.
• Increased Capacity: By adding more machines or nodes, horizontally scalable systems can handle larger volumes of data or transactions. This is particularly valuable in the age of big data, where the volume and velocity of data generation can be massive and unpredictable.
• High Availability and Fault Tolerance: In a horizontally scalable system, if one node fails, the system can continue to operate by relying on the other nodes. This contributes to high availability and fault tolerance, ensuring that services remain up and running, and data loss is minimized.
• Cost-Effective Scaling: While the initial setup of a horizontally scalable system can be complex, it can be more cost-effective to scale over time. Rather than replacing existing hardware with more powerful (and often more expensive) machines, we can simply add relatively inexpensive machines or nodes as our needs grow.
• Flexibility: Horizontal scalability provides flexibility, allowing you to scale your systems based on demand. It becomes possible to add resources during peak times and reduce them when they're not needed, leading to more efficient use of resources.
• Better Load Balancing: Horizontal scalability improves load balancing, as requests can be distributed across multiple servers, reducing the chance of any single server becoming a bottleneck.
• Easier to Manage: While managing a distributed system can have its own complexities, in many ways, adding more similar machines can be easier than constantly upgrading a single machine to a more powerful version.
By leveraging these advantages, horizontally scalable technologies can help businesses effectively manage their digital footprints, improve system performance, and ensure high availability – all critical factors in today's fast-paced digital world.
Overview of topical scalable storage technologies
Scalable storage technologies are designed to handle a growing amount of data while maintaining performance and reliability. These technologies allow for both horizontal and vertical scalability, but for the sake of this discussion, we will focus on those that are horizontally scalable. Here's an overview of some key scalable storage technologies:
• Distributed File Systems: Distributed file systems like Hadoop's HDFS, Google's Cloud Storage, and Amazon's S3 are designed to store large volumes of data across multiple machines in a network. They allow for horizontal scaling by simply adding more machines to the network, thereby increasing storage capacity. They also provide redundancy, ensuring data is not lost even if a machine fails.
• NoSQL Databases: Unlike traditional SQL databases, NoSQL databases like Cassandra, Green Plum, MongoDB, Couchbase, etc. are designed to scale horizontally. They distribute data across multiple nodes, and as data volume grows, more nodes can be added to the network. NoSQL databases are particularly well-suited for handling large volumes of unstructured or semi-structured data.
• Object Storage: Object storage systems like Intel DAOS, Amazon S3, Google Cloud Storage, and Microsoft Azure Blob Storage store data as objects rather than in a file hierarchy or block addresses. This makes them highly scalable and ideal for storing unstructured data like multimedia files, which can vary greatly in size.
• Distributed Block Storage: Distributed block storage systems like Ceph, GlusterFS, CVMFS break data into blocks and distribute them across multiple nodes. They can scale horizontally by adding more nodes, and they offer high performance and reliability.
• Cloud Storage Services: Cloud storage services like Google Cloud Storage, Amazon S3, and Microsoft Azure Storage provide scalable, on-demand storage capacity. They allow businesses to easily scale their storage capacity up or down as needed, without having to invest in additional hardware.
• Software-Defined Storage (SDS): SDS solutions, as Ceph or VMware Virtual SAN separates storage hardware from the software that manages the storage infrastructure. This allows for greater flexibility and scalability, as storage resources can be managed and allocated dynamically based on application needs.
• Hyper-converged Infrastructure (HCI): HCI combines storage, computing, and networking into a single system to reduce data center complexity and increase scalability. HCI systems use software and x86 servers to replace expensive, purpose-built hardware.
• Persistent Memory (PMEM): PMEM, such as Intel's Optane DC, blurs the line between memory (RAM) and storage. It can retain data even when powered off, like storage, but can be accessed at speeds comparable to memory. This can significantly improve the performance of data-intensive applications.
• Automated Storage Tiering: This technology automatically moves data between different types of storage media based on its usage, value, and performance requirements. It helps optimize storage resources and reduce costs.
• Flash Storage (SSDs): Flash storage devices or solid-state drives (SSDs), store data on flash memory chips. They offer faster data access speeds and are more energy-efficient than traditional hard disk drives (HDDs). They are widely used in data centers and for high-performance applications. Of course, they are not new, but nowadays we see the high growth of their storage capacity, durability, high speed, and power – what is very important for green economy and sustainable development.
These scalable and modern storage technologies are integral to managing the large and rapidly growing volumes of data generated in today's digital world. By providing the ability to easily scale storage capacity, they enable businesses to effectively manage their digital footprints and leverage this data to drive insights and innovation.
Overview of the most popular scalable processing technologies
Scalable processing technologies are designed to handle increasing amounts of data and computational tasks efficiently. As data volume grows, these technologies can distribute the load over more machines or resources, improving performance and ensuring tasks are completed in a timely manner. Here are some key scalable processing technologies:
• Distributed Computing Frameworks: Frameworks such as Apache Hadoop and Apache Spark allow for distributed processing of large data sets across clusters of computers. They're designed to scale up from a single server to thousands of machines, with a high degree of fault tolerance.
• Stream Processing Engines: Technologies like Apache Kafka and Apache Flink are designed for processing high-volume, real-time data streams. They allow for horizontal scaling and provide capabilities to handle large influxes of data in real-time.
• NoSQL Databases: NoSQL databases, such as MongoDB or Green Plum, are not only built to manage large volumes of data across many servers, providing high performance and availability but have integrated MapReduce and other processing functionality.
• In-Memory Databases: In-memory databases like Redis and SAP HANA (known as DataMarts) store data in memory rather than on disk for faster processing. They can scale horizontally to handle larger data volumes.
• Container Orchestration Systems: Kubernetes, an open-source system for automating deployment, scaling, and management of containerized applications, allows for horizontal scaling based on the demand or load on the system.
• Serverless Computing: Serverless computing platforms like AWS Lambda and Oracle or Google Cloud Functions allow for automatic scaling of application functionality. They can run code in response to events and automatically manage the resources required by the code.
• GPU-Accelerated Computing: GPU-accelerated computing leverages the parallel processing capabilities of GPU (Graphics Processing Units) for computational tasks. This can dramatically speed up workloads like machine learning, data analysis, and computational science.
• Machine Learning Frameworks: Machine learning frameworks like TensorFlow and PyTorch have capabilities to distribute computation across multiple GPUs, multiple machines, or large-scale cloud-based deployments, enabling scalable data processing and model training.
These scalable processing technologies enable organizations to handle the growing volume and complexity of data, supporting data-driven decision-making, real-time insights, and advanced analytics. They play a critical role in managing and gaining value from digital footprints in the era of big data and digital transformation.
Horizontal scalability in favor of ACID relaxing
The decision not to use ACID (Atomicity, Consistency, Isolation, Durability) for digital footprints could be driven by the need for scalability, real-time processing, analytics efficiency, compatibility with distributed systems, and specific application requirements.
Digital footprints often generate a large volume of data. ACID transactions can introduce overhead and impact performance when processing and storing such high-volume data. By relaxing ACID properties, systems can achieve higher scalability and performance by prioritizing data ingestion and processing speed over transactional consistency.
Digital footprints often capture events and activities that occur in real-time. Achieving strong consistency in such scenarios, where data is continuously changing and distributed across multiple systems, can be challenging. By relaxing ACID properties, systems can adopt eventual consistency, where data consistency is guaranteed over time, but not necessarily at the exact moment of data ingestion.
Digital footprints are frequently used for analytics and reporting purposes, where complex queries and aggregations are performed on the data. ACID transactions may hinder the performance and efficiency of these analytical processes, as they often involve large-scale data processing. By loosening ACID guarantees, systems can optimize query performance and improve overall analytics capabilities.
In modern corporate or social environments, digital footprints are often generated and stored across distributed and decentralized systems, such as cloud-based platforms, consumer, banking, medical, transport or learning management systems, and mobile applications. Coordinating ACID transactions across these disparate systems can be complex and resource intensive. Embracing more relaxed consistency models, like eventual consistency, can simplify the integration and synchronization of data from multiple sources.
The requirements for data consistency and transactional guarantees vary across different applications and use cases. For some digital footprint scenarios, a certain level of inconsistency or data staleness may be tolerable without significantly impacting processes or decision-making. By tailoring the consistency requirements to specific use cases, systems can optimize performance and resource utilization.
Comparative analysis to validate the choice of NoSQL databases for storing digital footprints
Most of NoSQL solutions are designed to handle large amounts of data, but they have different focuses, strengths, and weaknesses, and are designed for different types of workloads.
In our courses at the Institute of System Analysis and Control, we often prefer Apache Cassandra due to its descriptive and demonstrative circle architecture and gossip protocol of metadata exchange, as well as it handles large volumes of data and thousands of concurrent users or operations per second. It allows easily add more servers to increase capacity, providing high availability with no single point of failure and capable of handling a high write load.
It could be a good choice for digital footprints collecting and storage because of scalability, high availability, and performance are critical as well as Cassandra is a column-oriented database, which is excellent for storing and querying large amounts of structured, semi-structured, or unstructured data. That’s possible due to ACID relaxing, through Cassandra is not designed to support complex transactions with multiple operations or joins like a relational database. In terms of CAP (Consistency, Availability, Partition Tolerance) theorem, Apache Cassandra, it is designed to prioritize Availability and Partition Tolerance (AP), but not Consistency, which is common to most NoSQL databases. Cassandra offers eventual consistency, meaning that if no new updates are made to a given data item, eventually all accesses to that item will return the latest updated value. This is a relaxation of the consistency guarantee in favor of availability and partition tolerance. However, despite being primarily AP database, Cassandra allows the consistency level to be tuned per operation. For example, it’s possible to specify that a write must be sent to two, three, or all nodes in a replica set. This provides some flexibility, but still falls short of the full consistency guarantee that CP systems provide, as well as in case of a network partitioning, Cassandra chooses to remain available, accepting writes even if they cannot be immediately replicated to all nodes (hinted handoff feature). Also, Cassandra has write availability option, as long as a single replica for the data being written is up and reachable, the write can succeed.
However, other NoSQL databases can also be used to store digital traces, considering their advantages and disadvantages.
MongoDB:
MongoDB is a document-oriented NoSQL database, making it highly flexible and adaptable. It supports a rich and dynamic data model, which can be an advantage when dealing with unstructured or semi-structured data.
Pros: MongoDB is a document-oriented database that supports a rich and flexible data model. It's easy to scale horizontally and offers automatic sharding. It also provides robust support for developer productivity with multiple SDK as well as high performance for read and write operations, especially for operations that involve large volumes of data. MongoDB supports multiple indexes, including secondary indexes, which can greatly improve the speed of data retrieval.
Cons: MongoDB might not perform as well with transaction-heavy applications. Also, tuning MongoDB for performance can sometimes be complex. It can consume a lot of system memory, especially under heavy load, which might be a concern in resource-constrained environments. MongoDB's query language and indexing options are powerful, but they can also be complex to understand and use correctly, especially for complex queries and aggregations.
Redis:
Redis, which stands for Remote Dictionary Server, is an in-memory data structure store that can be used as a database, cache, or message broker. It supports various types of data. Redis offers data persistence, so it’s possible to snapshot the in-memory database onto disk either by time or by the number of writes since the last snapshot.
Pros: Redis provides very fast data access as it's an in-memory data store, making it ideal for caching and real-time analytics. It supports various data structures like strings, hashes, lists, and sets. Redis has a built-in publish/subscribe messaging system, which is useful for real-time messaging use cases.
Cons: Being an in-memory database, Redis can be limited by memory size. For persistence, it requires periodic saving of the dataset to disk which might impact performance.
CouchDB:
CouchDB is one more NoSQL database developed by Apache, which focuses on ease of use and embracing the web. It is a single-node database that works just as well on a shared server as it does on a large distributed system where multi-master replication, allowing to have multiple copies of your data, thus ensuring high availability and disaster recovery. CouchDB is not designed to handle complex relationships between documents. It's best for use cases where documents can stand alone. In certain use cases, such as large-scale writes or complex queries, CouchDB may not perform as well as other NoSQL databases.
Pros: CouchDB supports a multi-master replication system, making it a good choice for distributed systems. It also provides a RESTful interface for interaction.
Cons: CouchDB might not be the best option for applications that require complex querying or aggregations.
Couchbase:
Couchbase is a NoSQL document database with a distributed architecture for performance, scalability, and availability. It enables developers to build applications easier and faster by leveraging the power of SQL with the flexibility of JSON. Couchbase has an in-memory-first architecture, offering high speed for read and write operations. It provides horizontal scalability with a distributed architecture where data is automatically partitioned across all available nodes. Couchbase offers a SQL-like query language, making it easier for developers coming from a SQL background to create and manage data and has built-in full-text search capabilities, making it easier to find relevant information in a large dataset. It stores data in flexible JSON documents, providing the flexibility to modify the schema on-the-fly.
Pros: Couchbase provides powerful indexing and querying capabilities. It's known for its high performance, scalability, and flexible JSON model.
Cons: Couchbase can be resource intensive compared to some other databases, meaning it might require more powerful hardware to run effectively. Also, the learning curve can be a bit steep due to its unique architecture and features where some features come with a cost, and it might be more expensive than other solutions.
HBase:
HBase is a distributed, scalable, big data store and a part of the Apache Hadoop ecosystem that provides random, real-time read/write capabilities on top of the Hadoop Distributed File System (HDFS). HBase is designed to scale linearly with the addition of more hardware. It can host large tables on top of clusters of commodity hardware.
Pros: HBase, built on Hadoop, is designed for large tables with billions of rows. It provides real-time read/write access and integrates well with Hadoop ecosystem tools. HBase is designed to scale linearly with the addition of more hardware. It can host large tables on top of clusters of commodity hardware. Unlike many other Hadoop tools which are oriented towards batch processing, HBase provides real-time read and write access to your big data. HBase guarantees strong consistency for reads and writes, which can be a critical requirement for certain types of applications. HBase integrates well with other Hadoop ecosystem tools. It uses Hadoop's distributed file system for storing its data and can be a source or destination for MapReduce jobs.
Cons: HBase is not suitable for low-latency applications due to its write-ahead log design. It also requires a fair amount of setup and maintenance.
Kudu:
Pros: Kudu is excellent for fast scans due to its design for columnar storage, which makes it ideal for analytical queries and real-time analytics. Unlike many other Hadoop-compatible storage options, Kudu supports real-time data insertion, updates, and deletes, making it suitable for scenarios requiring fast data modifications. Kudu is designed to integrate well with Hadoop ecosystem tools, like MapReduce, Spark, and Impala, providing a flexibility of choice for processing frameworks as well as it’s designed with a distributed architecture that is meant to scale and handle failures.
Cons: Kudu is not the best choice for storing large objects or blobs and may not perform as well as some other data stores for heavy write workloads. Like other distributed systems, managing and configuring Kudu can be complex.
Greenplum:
Greenplum Database owned by VMware is an open-source, massively parallel processing (MPP) SQL database management system. It's designed to manage large-scale analytic data warehouses and business intelligence workloads.
Pros: The MPP architecture of Greenplum enables it to scale linearly, both in terms of data volume and query performance, by simply adding more nodes to the system. As a relational database management system, Greenplum fully supports SQL, including many advanced features. This makes it easy for users familiar with SQL to use Greenplum. Also, Greenplum offers data compression techniques which allow it to store large amounts of data efficiently and integrates well with various data formats and sources, including CSV, Avro, and Parquet files, as well as external databases via JDBC or ODBC.
Cons: Greenplum is optimized for analytical workloads and large queries across massive amounts of data. It's not designed for transactional workloads (OLTP). Compared to some more widely adopted databases, there may be less community support and fewer readily available resources for troubleshooting and optimization. As with most distributed systems, Greenplum can be complex to set up and manage.
Neo4j:
Neo4j is a highly scalable, native graph database purpose-built to leverage not only data but also the connections between data. It's designed to handle high-complexity queries with ease.
Pros: Neo4j, as a graph database, is excellent for handling data where relationships are key. It supports ACID properties and provides a powerful query language, Cypher.
Cons: Neo4j might not scale horizontally as easily as some other NoSQL databases. Also, it can be more resource-intensive for storing and querying data compared to other types.
Elasticsearch:
Elasticsearch is a distributed, open-source search and analytics engine built on Apache Lucene. It's designed for horizontal scalability, reliability, and easy management, and is often used for log and event data analysis, as well as search functionality in applications. Elasticsearch can easily scale horizontally to handle large amounts of data while maintaining fast response times and Apache Lucene library allows providing of powerful full-text search capabilities with a very comprehensive set of querying and filtering options.
Pros: Elasticsearch is excellent for searching and analyzing large amounts of data in near real-time. It is scalable, distributed, and can index many types of content.
Cons: Elasticsearch might be overkill for simple search use-cases. Also, managing and maintaining an Elasticsearch cluster can be complex.
InfluxDB:
InfluxDB is an open-source database written in Go language and developed by InfluxData. It’s specifically designed for time-series data, which are data points that are timestamped. This makes it highly suitable for logging, sensor data, real-time analytics, and monitoring systems. Influx DB can handle high write loads and still query effectively, making it a good choice for applications that need to write and read data rapidly. With the introduction of InfluxDB 2.0, InfluxData introduced a new scripting and query language called Flux, which is more powerful and flexible than the InfluxQL used in InfluxDB 1.x. Influx DB is a part of InfluxData Stack, a larger set of tools developed by InfluxData, including Telegraf for data collection, Chronograf for visualization, and Kapacitor for real-time streaming data processing and alerting.
Pros: InfluxDB is designed specifically for time-series data, making it a good fit for applications such as monitoring systems, IoT sensor data, real-time analytics, and metrics collection. It offers high write and query performance (for example, in some situations it could be 5x faster than Cassandra, or 1.5x faster than MongoDB). The database is optimized for fast, high-availability storage and retrieval of time series data. InfluxDB uses a lossless data compression, which reduces the amount of storage necessary for large volumes of data. Built-in HTTP API allows for direct interaction with the database without a need for a separate server or middleware, making integrations easier. Flux language is specifically designed for time series data and includes many built-in functions for time series analysis. It follows a functional programming model, which can be more intuitive for certain types of data manipulation, particularly time-based and streaming data. Flux not only retrieves data, but it also offers extensive capabilities for transforming and processing that data and allows for joining of data across different buckets (equivalent to databases in the relational model), which is beneficial for complex queries in Influx DB.
Cons: InfluxDB is a time-series database and is not designed to store complex, relational data. Thus, it may not be suited for applications requiring complex joins or transactions. However, the open-source version of InfluxDB does not support for authentication (SAML/SSO), data replication and scaling, automated backups, high availability, disaster recovery, data encryption, etc.
Riak KV (key-value) and TS (time series):
Riak KV is a distributed NoSQL key-value database with advanced local and multi-cluster replication that guarantees reads and writes even in the event of hardware failures or network partitions. Riak TS is a key-value NoSQL database that has been optimized for time-series data.
Pros: Riak KV is known for its high availability, fault tolerance, and operational simplicity. It offers excellent scalability and easy data recovery. Riak TS supports linear and horizontal scalability, making it suitable for applications that need to grow over time or handle large spikes in traffic as well as special features as automated data co-location, which can improve the efficiency of range queries. Riak is designed to survive network partitions and server failures with no single point of failure, which makes it highly reliable and available.
Cons: Riak KV may not be as efficient for use cases that require complex queries or transactions. Also, its community support is considered less robust compared to other NoSQL databases. Riak doesn't support complex querying capabilities out of the box. Queries are limited to key-value pairs and range queries on keys. As Riak is an AP (Available and Partition-tolerant) system as per the CAP theorem, there can be temporary inconsistencies in data during network partitions. However, it does offer eventual consistency. Compared to other databases, the community and ecosystem around Riak and Riak TS might not be as large, which could lead to fewer resources for troubleshooting and learning.
Storing digital footprints, a task that encompasses the collection, storage, and analysis of varied and extensive sets of user behavior data, requires a database solution that's not only robust and scalable but also flexible enough to handle complex, semi-structured data.
NoSQL databases are particularly well-suited for this task, due to their ability to store non-relational data, horizontal scalability, and flexibility in terms of the schema. However, it's important to consider that while these databases are powerful, they each have their trade-offs. For example, while Redis can provide extremely quick access to data, its in-memory nature might not be best for persistent storage of large datasets. On the other hand, HBase could handle vast datasets but may not be suitable for low-latency applications.
In the end, the ideal database for storing digital footprints will depend on various factors, such as the volume, variety, and velocity of data being generated, the need for real-time processing and analytics, the complexity and type of queries you'll need to perform, and the resources available for database management and optimization. When choosing a database for your specific use case, consider conducting a comprehensive analysis that takes these factors into account to ensure the technology aligns well with your project's requirements.
Intel DAOS capabilities for digital footprint management
Distributed Asynchronous Object Storage (DAOS) is an open-source software-defined object store that provides high bandwidth, low latency, and high I/O operations per second (IOPS) storage containers to HPC applications and workflows. It's developed by Intel and primarily designed to leverage next-generation NVM (Non-Volatile Memory) technologies like Storage Class Memory (SCM), NVMe (Non-Volatile Memory express), Optane Persistent Memory (3D XPoint).
DAOS has strong self-healing capabilities powered by placement maps that are stored on each storage target and I/O node. In case of storage target failure, it can rebuild the target in the background to maintain data redundancy. It achieves fault tolerance using erasure coding and replication. It is designed for extreme-scale storage and supports an almost unlimited number of Pools (storage clusters), Containers (user-defined storage units), and Objects (data units) and allows flexible and efficient resource utilization. It follows a Software-Defined Storage (SDS) approach, separating the data path from the control path. This allows it to bypass the kernel in the data path and make full use of the capabilities of NVM Express SSDs and Optane Persistent Memory.
Also, DAOS is meant to be a part of a larger ecosystem. It can be used in combination with other components like middleware libraries (HDF5, MPI-IO), distributed file systems (like Lustre, NFS), and data management services (like Apache Hadoop and Spark).
In the context of digital footprints, the data structure would likely be event-based, where each event represents a person interaction with a digital tool or platform. Each event could be stored as an object in DAOS, with attributes such as the Identifier (ID), the timestamp of the event, the type of event, and any additional data associated with the event.
The role of distributed message brokers in footprints stream processing
Each action users take (like logging in, viewing a lesson, completing a quiz) can be considered a digital footprint and can be sent as a message to a broker. Multiple consumers (like analytics systems, monitoring tools, recommendation engines) can then independently process these messages.
That’s why message brokers can be especially useful in handling digital footprints for several reasons:
• Decoupling: Message brokers allow different parts of a system to communicate without being directly connected. This can help decouple the system, making it easier to modify, scale, and maintain.
• Reliability: Message brokers often provide features like message persistence, delivery acknowledgments, and retry mechanisms, which help ensure that messages aren't lost even if some parts of the system fail.
• Scalability: Message brokers can help distribute work among multiple consumers. If the volume of digital footprints increases, additional consumers can be added to handle the load.
• Asynchronous Processing: The processing of digital footprints can be done asynchronously, which is especially useful if the processing is time-consuming. The system can continue to accept new digital footprints while processing others.
• Ordering and Timing: Some message brokers can ensure that messages are processed in the order they were sent, or schedule messages to be processed at a certain time.
• Buffering: In the case of spikes in data, message brokers can act as a buffer, holding onto messages until the consumers are ready to process them.
Popular message brokers include Apache Kafka, RabbitMQ, Amazon SQS, etc. Each has its own strengths and is suited to different types of tasks, so the choice of broker would depend on the specific requirements of the system handling digital footprints.
Brief review of the most popular scalable message brokers
Message brokers play a crucial role in modern distributed systems as well as in digital footprints processing. They enable applications to communicate with each other, often in a publish-subscribe model, making them essential for event-driven architectures and real-time data processing tasks.
Apache Kafka:
Pros: Kafka is designed to handle real-time, high-volume data streams. It can be scaled horizontally to handle more data by adding more machines to the network. Kafka stores streams of records in categories called topics. Each topic is replicated across a configurable number of Kafka brokers to ensure data is not lost if a broker fails. Also, Kafka can be used with real-time processing systems like Apache Storm or Apache Samza.
Cons: Kafka's distributed system, while powerful, brings complexity and can be challenging to set up and manage. It’s possible to encounter with lack of advanced message routing, due to Kafka primarily relies on topic-based routing. Also, Kafka relies on ZooKeeper for managing and coordinating brokers, which adds to its complexity.
RabbitMQ:
Pros: RabbitMQ supports several messaging protocols, including AMQP, STOMP, MQTT, and HTTP. It allows advanced message routing and offers a variety of message routing options through exchanges, including direct, topic, headers and fanout. RabbitMQ is developer-friendly and has a large and active community and propose excellent developer support and client libraries in many languages.
Cons: RabbitMQ has lower throughput and may not perform as well as Kafka under high volumes of data. Also, It stores messages in memory, which can lead to high memory usage.
Apache Pulsar:
Pros: Pulsar is a unified messaging and streaming systems that provides both messaging (comparable to RabbitMQ) and event streaming (comparable to Kafka) capabilities, making it versatile for different use cases. Pulsar's architecture separates serving and storage layers, allowing for independent scaling and potentially improving performance and stability as well as it supports multi-tenancy for special needs to isolate different teams or applications within the same cluster. Pulsar supports configuring message replication across multiple datacenters out of the box, which is useful for creating distributed and resilient applications.
Cons: Pulsar is not as mature as Kafka, RabbitMQ, or Amazon SQS, which means it may not have as large of a community or as many resources available for troubleshooting and support. While Pulsar has more built-in features compared to some other systems, managing these features can add operational complexity.
Of course, it’s possible to use fully managed Cloud Message Queuing for microservices, distributed systems, and serverless applications such as Amazon SQS, Azure Event Grid, Notifications Hub, or Google Cloud Pub/Sub etc., but this is beyond the scope of our review.
Message brokers act as a central hub to collect, integrate, and route data from various sources and facilitate seamless data integration, ensuring that digital footprints are captured efficiently. Also, they enable real-time processing of digital footprints. As data is ingested into the message broker, it can be immediately processed, transformed, and analyzed in real time as well as routed or filtered based on specific criteria or rules. Message brokers can store digital footprints data for a certain period or until consumed by the consuming applications or systems. This provides a temporary storage mechanism that ensures data availability and fault tolerance. Additionally, it allows replaying or reprocessing of data in case of failures or the need for historical analysis.
Role of the Virtual Computer Lab in training process and its anticipated impact on Digital Transformation
Open educational cloud datacenter «Virtual Computer Lab» created in the Institute of System Analysis and Control by Mikhail Belov (https://belov.global) in 2007. Nowadays it’s being actively developed by all the institute's leading professionals and plays a crucial (and possibly a critical) role in IT-professional training, particularly in the context of learning scalable digital footprints storage and processing technologies. Virtual Computer Lab provides a virtual environment where students can learn, practice, and experiment with these technologies. Here are some ways in which Virtual Computer Lab contributes:
• Practical Experience: Virtual Computer Lab allows students to gain hands-on experience with the technologies they are learning about. They can run experiments, troubleshoot issues, and see the effects of their actions in real-time, which can enhance their understanding and skills.
• Accessibility: With the Virtual Computer Lab, students can access the lab environment from anywhere, at any time. This makes learning more flexible and convenient, as students can practice and learn at their own pace, without being constrained by the physical availability of lab resources.
• Scalability and Flexibility: Virtual Computer Lab can be easily scaled up or down to accommodate different numbers of students or different learning needs. They can also be easily updated or reconfigured to incorporate new technologies or tools, making them a flexible learning resource.
• Safe Environment for Learning: In the Virtual Computer Lab, students can experiment freely without the risk of causing damage to physical equipment. When students make a mistake, they can simply reset the virtual environment and start over. This encourages experimentation and learning from mistakes, which is critical for mastering new technologies.
• Real-world Simulation: Virtual Computer Lab is designed to mimic real-world scenarios, providing students with practical experience that is directly applicable to the workplace. For example, students can learn how to manage and analyze large volumes of data in a simulated business environment, preparing them for real-world data management tasks.
In the context of learning scalable digital footprints storage and processing technologies, the Virtual Computer Lab provides a powerful platform for developing practical skills and understanding. It helps prepare IT-professionals to drive digital transformation initiatives effectively and efficiently.
Virtual Computer Lab contributes a better understanding of scalable digital footprints storage and processing technologies among IT-professionals. As more professionals are equipped with the necessary skills to handle large volumes of data and leverage these for insights, businesses can more effectively and efficiently transition their operations to digital platforms, resulting in an overall acceleration in the pace of digital transformation.
When IT-professionals are trained to use scalable technologies effectively, it can lead to significant improvements in efficiency and productivity. These technologies enable businesses to manage and analyze large volumes of data more efficiently, leading to faster decision-making and improved operational efficiency. This can result in higher productivity and better business outcomes.
By leveraging scalable digital footprints storage and processing technologies, companies can gain a competitive edge in the market. With more professionals trained in these technologies, businesses can more effectively harness their data for insights, leading to innovations in products, services, and business models. This can help businesses differentiate themselves from competitors and gain a significant competitive advantage [41–64].
Fundamentals of strategy for digital footprints integration into the training of IT-professionals
Developing a comprehensive curriculum is the first step towards integrating scalable digital footprints storage and processing technologies into IT professional training. The curriculum should cover key topics such as distributed computing, NoSQL databases, cloud storage, and data analytics at scale. It should also include modules on emerging technologies and trends to keep students abreast of the latest developments. Moreover, the curriculum should be designed in a way that builds on foundational IT knowledge and progressively introduces more complex concepts and skills [40].
Hands-on practical training and simulations in the Virtual Computer Lab are critical for effective learning. They allow students to apply the theoretical knowledge they gain in a practical context, enhancing their understanding and skills. Training programs should include lab sessions, projects, and simulations where students can work with real-world data and use scalable storage and processing technologies. These practical experiences can help students understand the challenges of managing and processing large volumes of data and learn how to overcome them.
Collaborating with industry partners can enrich IT professional training. Industry partners can provide valuable insights into the real-world applications of scalable digital footprints storage and processing technologies, helping to ensure that the training is relevant and practical. They can also offer internships, projects, and guest lectures, providing students with practical experience and exposure to industry practices. Such collaborations can help bridge the gap between academia and industry and ensure that students are job-ready when they graduate.
Given the rapid pace of technological advancement, continual learning and upskilling are crucial. IT-professionals need to regularly update their knowledge and skills to stay relevant. Training programs should therefore provide opportunities for continual learning, such as advanced courses, workshops, and seminars on emerging technologies and trends. They should also encourage students to pursue industry certifications, which can enhance their skills and employability. Moreover, a culture of lifelong learning should be fostered, encouraging students to take responsibility for their own professional development.
The importance of partnership with leading data centers to introduce HPC solutions into the educational process
Partnerships with leading data centers such as JINR (Joint Institute for Nuclear Research) or CERN (European Organization for Nuclear Research) and are of great importance for the implementation of HPC (High-Performance Computing) solutions in preparing IT-professionals. Here are some reasons why such partnerships are significant:
• Cutting-edge Infrastructure: Collaborating with renowned data centers like JINR or CERN provides access to state-of-the-art infrastructure and supercomputing resources. These institutions invest heavily in high-performance computing systems, enabling advanced computational capabilities that are essential for training IT professionals in complex and data-intensive tasks.
• Expertise and Knowledge Sharing: Partnering with these leading data centers allows for valuable knowledge sharing and collaboration. JINR and CERN are home to some of the brightest minds in scientific research and computational science. Working closely with their experts provides an opportunity to exchange ideas, best practices, and innovative techniques in HPC, thus enriching the training of IT professionals.
• Reputation and Credibility: Partnering with internationally recognized institutions like CERN and JINR enhances the credibility and reputation of an organization involved in IT professional training. It signifies a commitment to excellence and cutting-edge technologies, attracting talented individuals and establishing credibility among potential employers.
We can see the greatest example, when Vladimir V. Korenkov, the legendary IT-expert in Russian Federation, the Scientific Director of Meshcheryakov Laboratory of Information Technologies at JINR is responsible for setting strategic direction for the integration of HPC in education, determining what resources are necessary and how they could best be deployed to benefit students and researchers. He leads project teams in the development and implementation of HPC solutions. Vladimir Korenkov provides significant technical insights and guidance, helping to solve problems and make decisions on which technologies to use. He also plays a tremendous role in developing educational materials and courses that teach students and researchers how to use and benefit from these HPC resources.
It’s very important because the proliferation of digital technologies and increased internet penetration globally has led to an explosion in the quantity of digital footprints. Every click, like, comment, share, download, or upload that we perform online leaves a trace. These traces, known as digital footprints, are generated at an unprecedented volume, velocity, and variety. This not only includes social media interactions but also extends to e-commerce transactions, web searches, and even sensor data from IoT devices.
Every day, billions of people around the world use the internet, each leaving their unique digital footprints. The sheer scale of this data is enormous and still growing. According to estimates, the global data sphere will grow to 175 zettabytes by 2025, up from 33 zettabytes in 2018. A significant portion of this data will be digital footprints. Processing and making sense of this vast ocean of data using traditional methods or standard computing systems is not feasible due to the size and complexity of the data.
HPC solutions are designed to process and analyze massive amounts of data efficiently. They use parallel processing to perform high-speed computation tasks, making them well-suited for handling the volume, velocity, and variety of digital footprints. HPC can help in the real-time processing of these data, spotting trends, patterns, and anomalies.
Furthermore, the use of HPC is not just about handling the sheer size of the data. It's also about the need for sophisticated, high-speed analytics. This might involve complex machine learning algorithms to predict future behavior based on digital footprints or advanced graph analytics to understand the relationships between different entities. These tasks can be computationally intensive, further justifying the need for HPC solutions.
In essence, the enormous and ever-growing number of digital footprints necessitates the use of HPC solutions. Not only can HPC manage the scale of the data, but it can also facilitate the type of high-speed, advanced analytics needed to extract meaningful insights from these footprints. In this way, HPC becomes not just desirable, but essential in the era of Big Data.
Challenges and Potential Solutions
Integrating scalable digital footprints storage and processing technologies into IT-professional training comes with its own set of challenges. Technologies evolve rapidly, and staying current can be a daunting task. Training providers must constantly update their curriculum and teaching methods to ensure relevance. Some educational institutions might struggle with limited resources, both in terms of finances and expertise. Investing in the necessary tools and technologies, as well as training the trainers, could be a challenge. Bridging the gap between theoretical knowledge and practical skills is a significant challenge. Without adequate practical exposure, learners might struggle to understand the real-world applications of the technologies.
Institutions must adopt a dynamic approach towards curriculum design, constantly updating their content to keep up with technological advancements. Collaborating with industry partners can help mitigate the resource constraint issue. Industry partners can provide the necessary financial support, tools, and expertise. blend of theoretical instruction and practical exposure can ensure that students acquire both knowledge and hands-on skills. Virtual Computer Lab, project-based learning, and internships can help facilitate practical learning.
As AI and machine learning continue to advance, these technologies will play a crucial role in processing and analyzing digital footprints. The rise of online learning and Virtual Computer Labs will provide more flexible and accessible learning opportunities for IT professionals worldwide.
Conclusion
This article delved into the concept of digital transformation and how horizontally scalable digital footprints storage and processing technologies are integral to it. It discussed how IT-professional training is a key factor in accelerating global digital transformation efforts. The benefits of horizontal scalability were outlined, along with a deep dive into scalable storage and processing technologies.
The crucial role of these technologies in enhancing data management, fostering customer-centric services, influencing data analytics, and promoting business growth was highlighted. It was demonstrated how integrating these technologies into IT professional training could significantly enhance the readiness of IT professionals and thus contribute to digital transformation efforts.
The importance of strategies for effective integration, including curriculum development, hands-on practical training, industry partnerships, and continual learning opportunities, was also discussed. The anticipated impact of integrating these technologies into IT-professional training on global digital transformation was examined, pointing towards an accelerated pace of digital transformation, improved efficiency and productivity, and fostered innovation.
In conclusion, the importance of horizontally scalable digital footprints storage and processing technologies in IT-professional training cannot be overstated. The ability to effectively manage, store, and process ever-increasing volumes of data is a core skill set for the IT professionals of today and tomorrow.
As businesses increasingly turn to data for decision-making and innovation, having IT-professionals trained in these technologies is a critical element in accelerating global digital transformation. It provides businesses with the necessary technical expertise to leverage their data effectively and can drive significant improvements in efficiency, productivity, and innovation.
In an era characterized by rapid technological change, it is crucial for IT professional training programs to continually evolve and incorporate the latest technologies and practices. In doing so, they will equip the IT-professionals with the skills they need to navigate the digital landscape, drive digital transformation efforts, and ultimately contribute to the growth and success of their organizations.
Under the continued leadership of professor Evgenia N. Cheremisina, the Institute of System Analysis and Control has emerged as a leading institution in the realm of information technology, systems analysis, and control systems. With a focus on cutting-edge research and innovative educational practices, the institute is making notable contributions to the scientific community and the wider world.
Evgenia Cheremisina’ s guidance has been pivotal in driving the institute towards excellence. Her vision and commitment to innovation have steered the institute's focus towards pivotal technologies like horizontally scalable digital footprints storage and processing technologies, positioning the institute at the forefront of the digital transformation era.
With a strong emphasis on high-quality education and industry-relevant training, Evgenia Cheremisina and her successor Elena Yu. Kirpicheva’ s leadership have played an instrumental role in preparing the next generation of IT-professionals. Under their guidance, the institute has designed comprehensive IT-professional training programs that integrate the latest technologies and practices, preparing students to drive digital transformation efforts in their future roles.
Overall, the Institute of System Analysis and Control has positioned itself as a pioneering institution in the field of IT, continually advancing knowledge, driving innovation, and shaping the future IT-professionals. Evgenia Cheremisina, Elena Kirpicheva, Nadezhda Tokareva, Snezhana Potemkina’ s unwavering commitment to excellence, innovation, and student success has been instrumental in this regard and promises an exciting future for the institute.
References
1. Muhammad S.S., Dey B.L., Syed Alwi S.F., Kamal M.M., Asaad Y. Consumers’ willingness to share digital footprints on social media: the role of affective trust // Inf. Technol. People. 2023. Vol. 36, № 2. P. 595–625.
2. Jayasuriya D.D., Ayaz M., Williams M. The use of digital footprints in the US mortgage market // Account. Finance. 2023. Vol. 63, № 1. P. 353–401.
3. Tucakovic L., Bojic L. Computer-based personality judgments from digital footprints: theoretical considerations and practical implications in politics // Srp. Polit. Misao. 2022. Vol. 74, № 4/2021. P. 207–226.
4. Shiells K., Di Cara N., Skatova A., Davis O., Haworth C., Skinner A., Thomas R., et al. Participant acceptability of digital footprint data collection strategies: an exemplar approach to participant engagement and involvement in the ALSPAC birth cohort study. // Int. J. Popul. Data Sci. 2022. Vol. 5, № 3.
5. Rowe F. Using digital footprint data to monitor human mobility and support rapid humanitarian responses // Reg. Stud. Reg. Sci. 2022. Vol. 9, № 1. P. 665–668.
6. Pereverzeva E., Komov A. The mechanism for digital footprints formation // Vestn. St Petersburg Univ. Minist. Intern. Aff. Russ. 2022. Vol. 2022, № 1. P. 128–133.
7. Loutfi A.A. A framework for evaluating the business deployability of digital footprint based models for consumer credit // J. Bus. Res. 2022. Vol. 152. P. 473–486.
8. Grove W., Goldin J.A., Breytenbach J., Suransky C. Taking togetherness apart: From digital footprints to geno-digital spores // Hum. Geogr. 2022. Vol. 15, № 2. P. 163–175.
9. Gorbatov S., Krasnova E.A., Samara State Transport University. A digital footprint as a mechanism of individualizing a student’s educational trajectory (on the case of the “Digital technologies of self-education” course) // Perspect. Sci. Educ. 2022. Vol. 58, № 4. P. 193–208.
10. Feng S., Chong Y., Yu H., Ye X., Li G. Digital financial development and ecological footprint: Evidence from green-biased technology innovation and environmental inclusion // J. Clean. Prod. 2022. Vol. 380. P. 135069.
11. Dyachenko M., Leonov A. Digital footprint in education as a driver of professional growth in the digital age // E-Manag. 2022. Vol. 5. P. 23–30.
12. Vayndorf-Sysoeva M.E., Pchelyakova V.V. Prospects for Using the Digital Footprint in Educational and Scientific Processes // Vestn. Minin Univ. 2021. Vol. 9, № 3.
13. Shmatko A., Barykin S., Sergeev S., Thirakulwanich A. Modeling a Logistics Hub Using the Digital Footprint Method—The Implication for Open Innovation Engineering // J. Open Innov. Technol. Mark. Complex. 2021. Vol. 7, № 1. P. 59.
14. Pozdeeva E., Shipunova O., Popova N., Evseev V., Evseeva L., Romanenko I., Mureyko L. Assessment of Online Environment and Digital Footprint Functions in Higher Education Analytics // Educ. Sci. 2021. Vol. 11, № 6. P. 256.
15. Pavlenko D., Barykin L., Dadteev K. Collection and analysis of digital footprints in LMS // Procedia Comput. Sci. 2021. Vol. 190. P. 666–669.
16. Liu X., Huang Q., Gao S., Xia J. Activity knowledge discovery: Detecting collective and individual activities with digital footprints and open source geographic data // Comput. Environ. Urban Syst. 2021. Vol. 85. P. 101551.
17. Lapchik D.M., Fedorova G.A., Gaidamak E.S. Digital Footprint in the Educational Environment as a Regulator of Student Vocational Guidance to the Teaching Profession // J. Sib. Fed. Univ. Humanit. Soc. Sci. 2021. Vol. 14, № 9. P. 1388–1398.
18. Bushuyev S., Onyshchenko S., Bushuiev D., Bushuieva V., Bushuyeva N. Dynamics and impact of digital footprint on project success // Sci. J. Astana IT Univ. 2021. № 6. P. 15–22.
19. Songsom N., Nilsook P., Wannapiroon P., Fung L.C.C., Wong K.W. System Design of a Student Relationship Management System Using the Internet of Things to Collect the Digital Footprint // Int. J. Inf. Educ. Technol. 2020. Vol. 10, № 3. P. 222–226.
20. Olinder N., Tsvetkov A., Fedyakin K., Zaburdaeva K. Using Digital Footprints in Social Research: an Interdisciplinary Approach // WISDOM. 2020. Vol. 16, № 3. P. 124–135.
21. Hicks B., Culley S., Gopsill J., Snider C. Managing complex engineering projects: What can we learn from the evolving digital footprint? // Int. J. Inf. Manag. 2020. Vol. 51. P. 102016.
22. Sürmelioğlu Y., Seferoğlu S.S. An examination of digital footprint awareness and digital experiences of higher education students // World J. Educ. Technol. Curr. Issues. 2019. Vol. 11, № 1. P. 48–64.
23. Songsom N., Nilsook P., Wannapiroon P., Chun Che Fung L., Wong K.W. System Architecture of a Student Relationship Management System using Internet of Things to collect Digital Footprint of Higher Education Institutions // Int. J. Emerg. Technol. Learn. IJET. 2019. Vol. 14, № 23. P. 125.
24. Martin F., Gezer T., Wang C. Educators’ Perceptions of Student Digital Citizenship Practices // Comput. Sch. 2019. Vol. 36, № 4. P. 238–254.
25. Boudlaie H., Nargesian A., Keshavarz Nik B. Digital footprint in Web 3.0: Social Media Usage in Recruitment // AD-Minist. Universidad EAFIT, 2019. № 34. P. 131–148.
26. Muhammad S.S., Dey B.L., Weerakkody V. Analysis of Factors that Influence Customers’ Willingness to Leave Big Data Digital Footprints on Social Media: A Systematic Review of Literature // Inf. Syst. Front. 2018. Vol. 20, № 3. P. 559–576.
27. Micheli M., Lutz C., Büchi M. Digital footprints: an emerging dimension of digital inequality // J. Inf. Commun. Ethics Soc. 2018. Vol. 16, № 3. P. 242–251.
28. Gutiérrez Puebla J. Big Data y nuevas geografías: la huella digital de las actividades humanas // Doc. Anàlisi Geogràfica. 2018. Vol. 64, № 2. P. 195.
29. Guha S., Kumar S. Emergence of Big Data Research in Operations Management, Information Systems, and Healthcare: Past Contributions and Future Roadmap // Prod. Oper. Manag. 2018. Vol. 27, № 9. P. 1724–1735.
30. Azucar D., Marengo D., Settanni M. Predicting the Big 5 personality traits from digital footprints on social media: A meta-analysis // Personal. Individ. Differ. 2018. Vol. 124. P. 150–159.
31. Buchanan R., Southgate E., Smith S.P., Murray T., Noble B. Post no photos, leave no trace: Children’s digital footprint management strategies // E-Learn. Digit. Media. 2017. Vol. 14, № 5. P. 275–290.
32. Önder I., Koerbitz W., Hubmann-Haidvogel A. Tracing Tourists by Their Digital Footprints: The Case of Austria // J. Travel Res. 2016. Vol. 55, № 5. P. 566–573.
33. Lewis K. Three fallacies of digital footprints // Big Data Soc. 2015. Vol. 2, № 2. P. 205395171560249.
34. Thatcher J. Living on Fumes: Digital Footprints, Data Fumes, and the Limitations of Spatial Big Data // Int. J. Commun. 2014. Vol. 8. P. 19.
35. Lambiotte R., Kosinski M. Tracking the Digital Footprints of Personality // Proc. IEEE. 2014. Vol. 102, № 12. P. 1934–1939.
36. Gantz J., Reinsel D. The digital universe in 2020: Big data, bigger digital shadows, and biggest growth in the far east // IDC IView IDC Anal. Future. 2012. Vol. 2007, № 2012. P. 1–16.
37. Weaver S.D., Gahegan M. Constructing, Visualizing, and Analyzing A Digital Footprint // Geogr. Rev. 2007. Vol. 97, № 3. P. 324–350.
38. CC2020 Task Force. Computing Curricula 2020: Paradigms for Global Computing Education. New York, NY, USA: Association for Computing Machinery, 2020. 205 p.
39. Куприяновский В.П., Сухомлин В.А., Добрынин А.П., Райков А.Н., Шкуров Ф.В., Дрожжинов В.И., Федорова Н.О., Намиот Д.Е. Навыки в цифровой экономике и вызовы системы образования // Int. J. Open Inf. Technol. 2017. Vol. 5, № 1.
40. Сухомлин В.А., Зубарева Е.В. Куррикулумная парадигма - методическая основа современного образования // Современные Информационные Технологии И Ит-Образование. 2015. Vol. 11, № 1.
41. Grishko S., Belov M., Cheremisina E., Sychev P. Model for creating an adaptive individual learning path for training digital transformation professionals and Big Data engineers using Virtual Computer Lab // Creativity in Intelligent Technologies and Data Science / ed. Kravets A.G., Shcherbakov M., Parygin D., Groumpos P.P. Cham: Springer International Publishing, 2021. P. 496–507.
42. Belov M., Grishko S., Cheremisina E., Tokareva N. Concept of peer-to-peer caching database for transaction history storage as an alternative to blockchain in digital economy // CEUR Workshop Proc. 2021. Vol. 3041. P. 494–497.
43. Belov M.A., Korenkov V.V., Potemkina S.V., Lishilin M.V., Cheremisina E.N., Tokareva N.A., Krukov Y.A. Methodical aspects of training data scientists using the data grid in a virtual computer lab environment // CEUR Workshop Proc. 2019. Vol. 2507. P. 236–240.
44. Belov M.A., Krukov Y.A., Mikheev M.A., Lupanov P.E., Tokareva N.A., Cheremisina E.N. Essential aspects of it training technology for processing, storage and data mining using the virtual computer lab // CEUR Workshop Proc. 2018. Vol. 2267. P. 207–212.
45. Cheremisina E.N., Belov M.A., Tokareva N.A., Nabiullin A.K., Grishko S.I., Sorokin A.V. Embedding of containerization technology in the core of the Virtual Computing Lab // CEUR Workshop Proc. 2017. Vol. 2023. P. 299–302.
46. Belov M.A., Tokareva N.A., Cheremisina E.N. F1: The cloud-based virtual computer laboratory - An innovative tool for training // 1st Int. Conf. IT Geosci. 2012. 2012. P. undefined-undefined.
47. Belov M., Korenkov V., Tokareva N., Cheremisina E. Architecture of a compact Data GRID cluster for teaching modern methods of data mining in the Virtual Computer Lab // EPJ Web Conf. / ed. Adam Gh., Buša J., Hnatič M. 2020. Vol. 226. P. 03004.
48. Белов М.А., Живетьев А.В., Подгорный С.А., Токарева Н.А., Черемисина Е.Н. Подход к управлению виртуальной компьютерной лабораторией на основе концептуальной модели операционных рисков // Моделирование Оптимизация И Информационные Технологии. 2023. Vol. 11, № 1 (40).
49. Белов М.А., Лишилин М.В., Черемисина Е.Н., Стифорова Е.Г. Роль проектно-ориентированного технологического предпринимательства в стратегии развития ит-образования в условиях цифровой трансформации // Современные Наукоемкие Технологии. 2022. № 11. P. 86–96.
50. Белов М.А., Гришко С.И., Живетьев А.В., Подгорный С.А., Токарева Н.А. Применение методов нечеткой логики для формирования адаптивной индивидуальной траектории обучения на основе динамического управления сложностью курса // Моделирование Оптимизация И Информационные Технологии. 2022. Vol. 10, № 4 (39). P. 7–8.
51. Белов М.А., Гришко С.И., Черемисина Е.Н., Токарева Н.А. Подготовка ИТ-специалистов в условиях глобальной цифровой трансформации. Концепция автоматизированного управления профилями компетенций в образовательных программах будущего // Современные Информационные Технологии И ИТ-Образование. 2021. Vol. 17, № 3. P. 658–669.
52. Белов М.А., Гришко С.И., Лишилин М.В., Осипов П.А., Черемисина Е.Н. Стратегия подготовки ИТ-специалистов с применением инновационного учебного дата-центра “Виртуальная компьютерная лаборатория” для эффективного решения задач цифровой трансформации и акселерации цифровой экономики // Современные Информационные Технологии И ИТ-Образование. 2021. Vol. 17, № 1. P. 134–144.
53. Белов М.А., Лупанов П.Е., Минзов А.С., Токарева Н.А. Система управления виртуальной инфраструктурой на основе визуальных моделей в среде виртуальной компьютерной лаборатории // Современная Наука Актуальные Проблемы Теории И Практики Серия Естественные И Технические Науки. 2019. № 6–2. P. 41–46.
54. Белов М.А., Крюков Ю.А., Михеев М.А., Лупанов П.Е., Токарева Н.А., Черемисина Е.Н. Повышение продуктивности освоения распределённых информационных систем в виртуальной компьютерной лаборатории на основе применения технологий контейнеризации и оркестровки контейнеров // Современные Информационные Технологии И ИТ-Образование. 2018. Vol. 14, № 4. P. 823–832.
55. Белов М.А., Крюков Ю.А., Лупанов П.Е., Михеев М.А., Черемисина Е.Н. Концепция когнитивного взаимодействия с виртуальной компьютерной лабораторией на основе визуальных моделей и экспертной системы // Современная Наука Актуальные Проблемы Теории И Практики Серия Естественные И Технические Науки. 2018. № 10. P. 27–35.
56. Белов М.А., Лупанов П.Е., Токарева Н.А., Черемисина Е.Н. Концепция усовершенствованной архитектуры виртуальной компьютерной лаборатории для эффективного обучения специалистов по распределённым информационным системам различного назначения и инструментальным средствам проектирования // Современные Информационные Технологии И ИТ-Образование. 2017. Vol. 13, № 1. P. 182–189.
57. Лишилин М.В., Белов М.А., Токарева Н.А., Сорокин А.В. Концептуальная модель системы управления знаниями для формирования профессиональных компетенций в области ит в среде виртуальной компьютерной лаборатории // Фундаментальные Исследования. 2015. № 11–5. P. 886–890.
58. Белов М.А., Лишилин М.В., Токарева Н.А., Антипов О.Е. От виртуальной компьютерной лаборатории к управлению знаниями. Итоги и перспективы // Качество Инновации Образование. 2014. № 9 (112). P. 3–14.
59. Черемисина Е.Н., Белов М.А., Лишилин М.В. Анализ ключевых активностей жизненного цикла управления знаниями в вузе и формирование концептуальной модели архитектуры системы управления знаниями // Открытое Образование. 2013. № 3 (98). P. 34–41.
60. Черемисина Е.Н., Митрошин П.А., Белов М.А. Комплексные системы электронного обучения как инструментарий оценки компетенций учащихся // Наука И Бизнес Пути Развития. 2013. № 5 (23). P. 113–122.
61. Черемисина Е.Н., Белов М.А., Антипов О.Е., Сорокин А.В. Инновационная практика компьютерного образования в университете “Дубна” с применением виртуальной компьютерной лаборатории на основе технологии облачных вычислений // Программная Инженерия. 2012. № 5. P. 34–41.
62. Белов М.А., Антипов О.Е. Контрольно-измерительная система оценки качества обучения в виртуальной компьютерной лаборатории // Качество Инновации Образование. 2012. № 3 (82). P. 28–32.
63. Антипов О.Е., Белов М.А. Технология применения виртуальной компьютерной лаборатории в учебных курсах ВУЗа // Естественные И Технические Науки. 2012. № 1 (57). P. 260–268.
64. Антипов О.Е., Белов М.А., Токарева Н.А. Архитектура виртуальной компьютерной лаборатории для подготовки специалистов в области информационных технологий // Компьютерные Инструменты В Образовании. 2011. № 4. P. 37–44.
Linked open data is crucial for Semantic Web development due to the ability to provide both unambiguous computer interpretation and human understanding of information. Despite the active growth, including the variety of standards, methods, and tools for preparing linked data (LD), there is the gap between the idea and its ubiquity. It is still not easy to discover LD, difficult to link them, and rather hard to use for collaborative processing. The reuse of LD as well as the general implementation FAIR principles, designed to counteract semantic chaos and provide information sharing remains the most challenging. In this paper, the authors highlight the insufficiency of basic semantic standards (e.g. JSON-LD and schema.org stack) and consider the possibility of semantic enrichment of data by the creation of an open LD interpretation environment. Special attention is given to technical solutions aimed at improving the LD exploring capabilities.
Efficient management and retrieval of scientific information are crucial in the era of big data and machine learning. This study presents a prototype of a recommendation system that helps researchers select the most suitable journal for publishing their scientific articles. The system utilizes metadata and keyword filtering techniques to retrieve relevant information from open APIs. By analyzing factors such as citation counts, publication dates, and keywords, the system compiles a thematic list of significant scientific sources. This comprehensive list aids researchers in knowledge exploration and research direction identification.
Additionally, the system generates visualizations that provide researchers with insights into the distribution of scientific articles, popular keywords, and overall trends. These visualizations offer valuable information for exploring related papers, identifying influential authors, and discovering emerging trends in the field. While visualizations are not interactive, they improve understanding of the research area and facilitate informed decision-making.
Compared to existing systems the created prototype offers some advantages, in particular it leverages advanced machine learning algorithms for accurate and personalized journal recommendations.
In summary, the prototype recommendation system improves scientific information retrieval. By considering various factors and utilizing advanced algorithms, it assists researchers in selecting suitable journals, maximizing the impact and visibility of their work within the scientific community.
Определение степени семантической близости текстов является ключевым этапом в решении целого ряда задач – в поисковых системах, системах автоматического перевода, других областях, связанных с обработкой текста на естественном языке, и включает предварительную обработку текстов, векторизацию, извлечение признаков, выбор метрики, построение модели и т. д.
В статье [1] представлена аналитическая платформа, реализующая автоматизированный мониторинг и интеллектуальный анализ кадровых потребностей рынка труда по номенклатуре специальностей вуза и определение востребованности программ, профилей и направлений подготовки системы высшего образования. Аналитическим ядром платформы является модуль семантического анализа и сравнения текстов объявлений о вакансиях на рынке труда и формулировок образовательных профессиональных компетенций. Сопоставление основано на использовании векторных представлений слов, выражений и текстов. Проведено сравнение различных метрик и методов сопоставления. Для определения востребованности образовательных программ работодателями проведен анализ на основе описанных в них компетенций.
В докладе представлены результаты, связанные с развитием методов и подходов сопоставления программ высшего профессионального образования и потребностей рынка труда, разработанных ранее в [1-3]. Выбор исходных данных со стороны системы высшего образования производится на компетентностном уровне – используются тексты двух из трех уровней освоения, а также тексты индикаторов достижения компетенций. Исследуется использование ряда метрик и моделей, имеющих различную архитектуру и обученных на различных текстовых корпусах. Приведено сравнение точности получаемых результатов. Расширен список анализируемых образовательных программ.
Литература
1. S.D. Belov et al. Methods and algorithms of the analytical platform for analyzing the labor market and the compliance of the higher education system with market needs. Proceedings of Science, ISSN:1824-8039, Изд: SISSA. DOI: https://doi.org/10.22323/1.429.0028.
2. Зрелов П.В., Кореньков В.В., Кутовский Н.А., Петросян А.Ш., Румянцев Б.Д., Семенов Р.Н., Филозова И.А. Мониторинг потребностей рынка труда в выпускниках вузов на основе аналитики с интенсивным использованием данных // В сборнике: Аналитика и управление данными в областях с интенсивным использованием данных DAMDID/RCDL'2016. XVIII международная конференция. Октябрь 11-14, 2016. С. 124-131.
3. Валентей С.Д., Зрелов П.В., Кореньков В.В., Белов С.Д., Кадочников И.С. Мониторинг соответствия профессионального образования потребностям рынка труда // Общественные науки и современность. 2018. № 3. С. 5-16.
Работа посвящена построению рекомендательной системы для анализа эффективности алгоритмов решения крупноразмерных задач многомерной оптимизации. Рассматривается несколько существующих систем выбора эффективных алгоритмов. Предлагается подход прогнозирования эффективности, основанный на проведении статистического анализа гибридными методами фильтрации данных, приводится сравнение с существующими системами.
Humans and other animals can understand concepts from only a few examples. while standard machine learning algorithms require a large number of examples to extract hidden features. Unsupervised learning is procedure of revealing hidden features from unlabeled data.
In deep neural network training, unsupervised data pre-training increases the final accuracy of the algorithm by decreasing an initial parameter space from which fine-tuning begins. However, there are few theoretical papers devoted to detail unsupervised learning description. Crucial reason is that unsupervised learning process in deep neural network is usually complicated. That’s why understanding the mechanism of it in elementary models plays an important role.
Boltzmann machines are the basic unit for developing deep-belief networks. Due to their ability to reveal hidden internal representations and solve complex combinatorial problems, they are used in machine learning and statistical patterns infer. Boltzmann machines are neural networks with symmetrically connected layers, divided into two categories − visible and hidden. In this work we consider a Restricted Boltzmann machine (RBM), with links between neurons of different layers, but without internal.
To solve computational problems, the machine firstly pass training, where its parameters − neuron activation thresholds θ and weights on the edges ξ, are stochastically changed according to the selected algorithms. After that, the visible layer is initialized with a given state, and the system evolves to a stationary distribution. Finally, the output layer represents the solution of the problem.
Dealing with deep networks often issues the loss of interpretation of the obtained features, i.e., the loss of physical essence.
Despite success in practical applications, the rigorous mathematical description of Boltzmann machines remains a challenge. In studies, the coefficients of weights on edges are considered fixed, and their distribution is extracted in training. The study of RBM’s can be done with statistical mechanics, which development was contributed by the famous Soviet scientist N.N. Bogoliubov. The symmetry property of the weights matrix and the equality of the main diagonal determine the similarity of the Boltzmann machine with the physical model of spin glasses. RBM with binary bonds is equivalent to a bipartite spin glass with layer variables of different nature. The visible layer consists of binary Ising spins and the hidden layer consists of real Gaussian spins.
The purpose of this work is to physically describe the OMB and study its process modes by analytical and numerical methods.
This work was supported by the RFBR No. 18-29-10014. This research has been supported by the Interdisciplinary Scientific and Educational School of Moscow University «Photonic and Quantum Technologies. Digital Medicine».
One of the important tasks of gamma-ray astronomy is the modeling of Extesnive Air Showers (EAS) generated by cosmic rays. Monte Carlo generators are commonly used. One of the most popular programs for generating events in gamma-ray astronomy is the CORSIKA package based on the GEANT4 program. The problem with such generators is the extrime consumption of computer resources. One alternative approach is to use artificial neural networks.
In this report, we present the results of a study of two types of generators for modelling of EAS images registrated by Cherenkov telescope. One of them is based on GAN network and the other one is based on variational autoencoder. We also compare the obtained results with the traditional approach.
The work was supported by RNF, gran no.22-21-00442. The work was done using the data of UNU "Astrophysical Complex of MSU-ISU» (agreement EB-075-15-2021-675)
Imaging Atmospheric Cherenkov Telescopes (IACT) of gamma ray observatory TAIGA detect the Extesnive Air Showers (EASs), originating from the cosmic or gamma rays interactions with the atmosphere. Thereby telescopes obtain images of the EASs. The ability to extract the gamma rays from hadronic cosmic ray background in images is one of the main features of this type of detectors. However, in actual IACT observations the background and the source of gamma ray simultaneous are observed.
In this work the results of the application of neural networks (NN) in image classification task on MC images of TAIGA-IACTs are presented. The Wobbling mode are considered together with the image adaptation for adequate analysis by the NN. At the same time, several neural network structures are considered that classify events both directly from images or through Hillas parameters extracted from images. Also taking into account all necessary image modifications the estimation of the quality selection by the NN for the rare gamma events selection in MC simulation are given.
The work was supported by RNF, gran no.22-21-00442. The work was done using the data of UNU "Astrophysical Complex of MSU-ISU» (agreement EB-075-15-2021-675)
Volunteer computing (VC) is a strong way to harness distributed computing resources to perform large-scale scientific tasks. Its success directly depends on the number of participants, PC (some other devices) and time of their work. So project managers/organizers in search of a mechanism to encourage participation in VC-projects use conditional points accrual mechanism. The number of these points («credits») depends on the provided capacities, the time of participation in projects, and other characteristics of the activity of volunteers and their teams. The availability of constant statistics for all projects, in addition to tracking various ratings, provokes various virtual competitions (“challenges”) between participants and teams. And many of the projects create an environment for the competitors by volume computations are done, both individually and in the team event. Thus modus operandi of VC – is spirit of competition.
If volunteers are members of a team, they are simultaneously competing with the other teams on the project on the more immediate goal of racking up the most contributions and coming out on the top of the table of statistics documenting contributions.
This form of cooperation and competition individually and in combination (in teams) demonstrates a new type of online scientific collaborative network – co-opetition. This term was proposed by Brandenburger A. M. and Nalebuff B. J. in 1996 year to describe a new phenomenon of cooperative competition of firms. Co-opetition is a kind of interaction of firms when companies interact with partial congruence of interests. As Dagnino G. and Padula Giovanna noted, they cooperate with each other to reach a higher value creation if compared to the value created without interaction and struggle to achieve competitive advantage. Often co-opetition takes place when companies that are in the same market work together in the exploration of knowledge and research of new products. Holohan A. and Garg A. used this term metaphorically in their paper to describe a collaboration of volunteers in VC-projects to affect a performance of computing. To prove the possibility of describing the collaboration of volunteers using this term, Holohan A., Garg A. relied a number of theoretical concepts and the results of their sociological surveys. However, they do not describe either the process of co-competition in VC-projects and neither its results that enhances the computing.
I think that nowadays we can use this term not only in a metaphorical or theoretical sense, but also as an actual mechanism for managing volunteer computing. In this paper we investigate Russian community of VC and discuss the suggestion that phenomenon of online co-opetition can capture people`s motivation better than only intrinsic motives.
Развивается подход к построению распределенных вычислительных сетей, основанный на разделении основных координирующих и управляющих функций единственного центра среди совокупности элементов, «покрывающих» (доминирующих) все остальные узлы системы коммуникаций. Полученные ранее результаты позволяют оценить достижимость, т.е. длину наибольшей кратчайшей цепи (диаметр) соответствующего графа и предложить процедуру ее формирования. На основе этой характеристики возникает возможность для нахождения верхней границы времени передачи сообщений в сети. В частности, оказалось, что достижимость (диаметр) не превосходит утроенного числа доминирующих узлов без единицы. Также были представлены некоторые варианты к построению распределенных сетей с фиксированными значениями достижимости.
Самостоятельный интерес представляет вопрос организации структуры взаимодействий между элементами центральной доминирующей совокупности. Рассмотрена единая комбинированная схема к конструированию конфигураций путем введением дополнительных связей на доминирующем множестве. В качестве добавляемых соединений предлагаются два типа: линейные и полные. Показано, что величина достижимости и вид экстремальной цепи определяются размерами этого множества и новых образований, причем имеется возможность использования лишь одного из них.
Ключевые слова: распределенные вычисления, сети коммуникаций, доминирование,
комбинированные структуры, достижимость.
Технология сбалансированной идентификации математических моделей по экспериментальным данным [1, 2] давно и успешно используется в различных областях прикладных исследований [3]. Следуя авторской рекомендации, далее, для краткости, будем называть её SvF-технологией (Simplicity vs Fitting). Алгоритмическими основами этой математической технологии являются: Тихоновская регуляризация, перекрёстная проверка на наборах экспериментальных данных и решение задачи двухуровневой оптимизации специального вида. Именно в этой задаче, где определяются оптимальные коэффициенты регуляризации, на нижнем уровне, требуется многократно решать наборы независимых задач математического программирования. Для практического применение SvF-технологии требуются пакеты численных методов оптимизации и достаточный объём вычислительных ресурсов, чтобы завершить процедуру вычислений за приемлемое время. По этой причине выполнение расчётов было реализовано в среде распределенных вычислений на платформе Everest [4].
В докладе представлен SvF-сервис (фактически - Everest-сервис) [5], сочетающий возможности давно развиваемого Python пакета SvF [6,7] и Everest-сервиса SSOP [8], позволяющего решать задачи математического программирования на пуле неоднородных вычислительных ресурсов (от настольных компьютеров до вычислительных кластеров), подключённых к платформе Everest. Отличительными особенностями SvF-сервиса являются: удобство применения для достаточно продолжительных расчётов по принципу «запустил и жди оповещения на почту»; возможность описания задачи сбалансированной идентификации в символьном виде (включая выражения исследуемой математической модели); возможности прозрачного изменения состава указанного пула неоднородных ресурсов даже во время выполнения вычислительного задания.
Исследование выполнено за счет гранта Российского научного фонда № 22-11-00317, https://rscf.ru/project/22-11-00317/
Ссылки
The rapid growth of distributed computing systems has necessitated the development of efficient platforms for running optimization workflows. In this talk, we present a comprehensive approach to running optimization workflows using the Everest cloud platform, developed at IITP RAS. Everest enables the publication of code as web services and facilitates the allocation of computing resources, including standalone servers, computing clusters, grids, and clouds. Leveraging a user-friendly web interface and API, users can execute compute jobs for services on available resources.
We have extensively utilized the Everest platform in numerous optimization projects, benefiting from its ability to seamlessly integrate optimization models with external systems through an API. Furthermore, Everest provides a convenient debugging environment, allowing for fine-tuning of optimization models by rerunning them from the intuitive web interface.
Our approach encompasses several essential components. Firstly, an application within Everest is configured following our established guidelines. Secondly, a cloud-based computing resource, such as Google Cloud Platform or Yandex Cloud, is employed within Everest. Additionally, the optimization model's source code is hosted in a Software as a Service (SaaS) git repository, such as GitLab or GitHub, and adheres to our prescribed guidelines in terms of inputs and outputs.
Moreover, Everest offers the flexibility to set up computing resources that dynamically create and destroy virtual machines in the cloud as per demand. We have developed comprehensive guidelines to simplify the utilization of optimization models within Everest, ensuring a streamlined workflow.
To facilitate effective setup, we provide a concise checklist of necessary configurations. In Everest, the application should include essential input parameters, such as the model's version, model parameters in JSON or YAML format, and input data as an archive. The output structure should comprise a version identifier, standard output and error streams, and an archive containing the results. Furthermore, a bootstrap script is set in the application, facilitating the retrieval of the specified model version from the input parameters.
Within the cloud environment, an image is configured with all the requisite Everest agent dependencies, alongside the installation of optimization model dependencies, including tools like git for retrieving code from the SaaS git provider.
Our approach represents a robust and versatile framework for running optimization workflows in the Everest platform. By leveraging its features and adhering to the prescribed guidelines, researchers and practitioners can effectively integrate and execute optimization models, ultimately contributing to the advancement of distributed computing systems.
This work was supported by the Russian Science Foundation under grant no. 22-11-00317, https://rscf.ru/project/22-11-00317/
Keywords: distributed computing systems, optimization workflows, Everest cloud platform, web services, computing resources, debugging, integration, guidelines.
The paper considers the adaptation of federated deep learning on a desktop grid system using the example of an image classification problem. Restrictions are imposed on data transfer between the nodes of the desktop grid only for a part of the dataset. The implementation of federated deep learning on a desktop grid system based on the BOINC platform is considered. Methods for generating local datasets for desktop grid nodes are discussed. The results of numerical experiments are presented.
The report is devoted to the practical aspects of modeling the operation of the BOINC computing infrastructure. The practical expediency of preliminary modeling for solving optimization problems by means of an evolutionary algorithm is substantiated. Important attention is paid to modeling the occurrence of abnormal situations in the work of the BOINC project. The main considered abnormal situations include shutdown of the computing node after receiving the job and the lack of jobs on the server. The practice of applying general and special metrics to grid systems from personal computers is considered. Discrete-event and probabilistic modeling methods are considered. The strengths and weaknesses of each of the approaches are given. The practice and prospects for the implementation of these methods will be discussed.
References
Nikolay P. Khrapov, Valery V. Rozen, Artem I. Samtsevich, Mikhail A. Posypkin, Vladimir A. Sukhomlin, Artem R. Oganov. Using virtualization to protect the proprietary material science applications in volunteer computing. Open Eng. 2018, v.8, pp. 57-60. DOI: 10.1515/eng-2018-0009.
Ivashko, E., Nikitina N., Rumyantsev A.: Discrete Event Simulation Model of a Desktop Grid System. Communications in Computer and Information Science, vol. 1331, pp. 574-585 (2020). DOI: 10.1007/978-3-030-64616-5\_49.
Khrapov N. P. Metrics of Efficiency and Performance when Using Evolutionary Algorithm on Desktop Grids. Programming and Computer Software volume 47, pages 882–886 (2021). DOI: 10.1134/S0361768821080041.
Simulation of a public desktop grid system or a project of voluntary distributed computing on a ComBoS software is considered. The features and limitations of a desktop grid system are determined using the example of a desktop grid on the BOINC platform. Scenarios with asynchronous calculation of several computing applications in one desktop grid are considered. The features of the modeling implementation on the ComBoS simulator are discussed. The results of various desktop grid modeling scenarios are presented.
В связи с ростом объема прикладных вычислений в области обработки больших данных и искусственного интеллекта, а также при решении традиционных задач численного моделирования, возникает необходимость в программах, способных развертываться и исполняться на гибридных окружениях, состоящих из произвольной совокупности сетевых вычислительных ресурсов. Такими ресурсами являются виртуальные машины различных облачных провайдеров; парк компьютеров компаний, временно простаивающий в нерабочее время; компьютеры “добровольцев”, как в проекте BOINC и других подобных проектах; свободные узлы крупных вычислительных кластеров.
Применение гибридных окружений позволяет уменьшить себестоимость вычислений, а также добиться их высокой производительности. Однако ключевыми проблемами, которые приходится решать при написании приложений для вычислительного окружения гибридного типа, являются проблемы отказоустойчивости и балансировки нагрузки. В силу специфики гибридного окружения само приложение, а не его вычислительное окружение (например, облако за счет виртуализации оборудования), должно решать обозначенные проблемы: компоненту приложения необходимо быстро приступать к вычислениям на вновь подключаемом ресурсе, в тоже время внезапное или плановое отключение ресурса с развернутым на нем компонентом приложения не должно приводить к отказу всего приложения. В работе представлено экспериментальное исследование метода организации вычислений и архитектура приложения, обладающего перечисленными свойствами при параллельной обработке множества независимых задач.
Особенностью предложенной организации вычислений является запуск функционально идентичных копий SPMD-приложений, которые синхронизируются и разделяют между собой нагрузку благодаря случайному выбору решаемых задач и обмену данными через журнал событий. Событиями в исследуемой архитектуре являются результаты выполнения задач. В экспериментах на имитационной модели проведена оценка объема избыточных вычислений вследствие такой организации приложений и накладных расходов на управление журналом событий через традиционный механизм запуска задач публичной облачной платформы Everest ИППИ РАН.
Показано, что предлагаемая организация вычислений позволяет успешно решать проблему отказоустойчивости и балансировки нагрузки, обеспечивая ускорение тогда, когда допустим избыточный объем вычислений (при доступности и низкой себестоимости вычислительных ресурсов). Предложенный метод организации вычислений может быть адаптирован для приложений с динамически формируемым множеством зависимых задач и реализаций на основе технологий блокчейна. Программный код и результаты его нагрузочного тестирования доступны в репозитории проекта Templet.
Status of the INP BSU Tier 3 grid site BY-NCPHEP presented. The experience of operation, efficience, flexibility of the cloud based structure is discussed.
The importance of the development of computing infrastructure and services for support of research data accumulation, storage, and processing is permanently increasing. e-Infrastructures became more and more demanded and universal tool for support modern science that provide wide set of services for operation with research data, data collection, systematization and archiving, computing resources for complex data processing applications development, porting and execution.
Work on the implementation of distributed computing infrastructure in Moldova started in 2007 when the first Agreement on the creation of the MD-GIRD Joint Research Unit Consortium and accompanying Memorandums of Understanding were signed by seven universities and research institutes of Moldova. Since this time, the works started on the deployment of the national distributed computing infrastructure that included integration of computing clusters and servers deployed in the main national Universities and research institutions. For effective integration of different types of computing resources was into the common distributed infrastructure was used high-capacity communication backbone provided by NREN RENAM [1].
The common computing infrastructure continue to develop due to support of various international and national projects. This distributed infrastructure now unites three main datacenters located in the State University of Moldova (SUM), Vladimir Andrunachievich Institute of Mathematics and Computer Science (VA IMCS) and RENAM Association that are permanently developing and common computing resources now comprises more than 320 CPU cores, 2 NVIDIA T4 Tensor Core GPU units and 54 TB of storage resources [2]. The elaborated concept of the creation of the heterogenous computing infrastructure that includes multi-zone IaaS Cloud infrastructure (fig. 1), pool of virtualized servers that used for permanent resources allocation to execute production services, multiprocessor clusters and bare metal serves that used for running intensive data processing applications. The distributed infrastructure comprises dedicated storage sub-systems for large amount of data archiving and providing resources for the whole distributed infrastructure data backup.
In the first stage, it is planned to deploy a multi-zone IaaS Cloud infrastructure that combines the resources of VA IMCS, SUM and RENAM into distributed computing network for processing scientific data, performing intensive scientific calculations, as well as storing and archiving research data and results of computational experiments. Works on deployment of an updated Scientific multi-zone IaaS Cloud Infrastructure that is based on OpenStack Ussuri have begun in 2021 and are progressing now taking in account continuation of physical computing resources upgrading by installation of new servers’ equipment in all three main datacenters. As a result, today in VA IMCS, SUM and RENAM in parallel are available and operating previously deployed resources, based on an of outdated OpenStack versions, updated Cloud platform, based on the OpenStack Ussuri version, and now is deploying Openstack version 2023.1 Antelope, which is currently the most recent stable release and will be actively maintained at least for the upcoming year, offering more features, more processing power, and flexibility of operation [3].
https://cloud.math.md/apps/files_sharing/publicpreview/ZJKysCqoKZ4X6df?x=1914&y=619&a=true&file=Figure%25201.png
Figure 1. Multi-zone Cloud: IMI – RENAM – SUM.
In the current distributed cloud understructure is already implemented useful and important components - block storage and Virtual eXtensible Local Area Network (VXLAN) traffic tagging. These tools will be deployed and used in the developing extended cloud infrastructure too. Block storage allows creation of volumes for organizing persistent storage. In general, in OpenStack, as in other modern Cloud systems, several concepts exist for providing storage resources. When creating a virtual machine, you can choose a predefined flavor, with a predefined number of CPU, RAM, and HDD space; but previously, when you delete a virtual machine, all data stored on the machine instantly disappears. The Block storage component, used in the created multi-zone IaaS Cloud Infrastructure, is deployed on a separate storage sub-system and allows to create block storage devices and mount them on a virtual machine through special drivers over the network. This is a kind of network flash drive that can be mounted to any virtual machine associated with the project, unmounted and remounted to another, etc., and most importantly, this type of volume is persistent storage that can be reused when the virtual machines are deleted. Thus, you can no longer worry about data safety and easily move data from one virtual machine to another, or quickly scale up VM performance by creating a virtual machine with larger resources and simply mount volumes to it with all scientific data available for further processing.
VXLAN is a more advanced and flexible model of interaction with the network. In the upgraded cloud infrastructure, in addition to the usual "provider network" model, which allocates one real IP address from the pool of provider network addresses to each virtual machine, a self-service network is also available. A self-service network allows each project to create its own local network with Internet access via NAT (Network Address Translation). For a Self-service network, the user can create a virtual router for the project with its own address space for the local network. VXLAN traffic tagging is used to create such overlay networks that prevent the occurrence of address conflicts between projects in case several projects will use network addresses from the same range. To ensure the functioning of NAT, one IP address from the provider network is allocated to the external interface of the virtual router, which serves as a gateway for virtual machines within the project. Also, when using the self-service model, the floating IP technology becomes available, which allows you to temporarily bind the IP address from the provider network to any of the virtual machines in the project, and at any time detach it and reassign it to any other virtual machine of the project. Moreover, the replacement occurs seamlessly, that is, the address does not change inside the machine, but remains the same - the address is from the internal network of the project, but the changes occur at the level of the virtual router. Incoming to the external address packets are forwarded by the virtual router to the internal interface of the selected virtual machine. This allows you efficiently to use IP addresses and not allocate an external address to each virtual machine. The external IP address remains assigned to the project and can be reused by other machines within the project.
For the deployment of new computing infrastructure, the process of transition to a 10G network has started according to the elaborated plan. The New Juniper switch already has been installed and all storage servers with 10G cards on board have been connected to this switch. Now is realizing procedure of switching connection of all remaining servers to N x 10G interfaces [4].
Distributed computing infrastructure providing for uses of national research and educational community the following production services, software platform and tools:
• Jupyter Notebook - is a web-based interactive computing platform. The notebook combines live code, equations, narrative text, visualizations. Jupyter Notebook allows users to compile all aspects of a data project in one place making it easier to show the entire process of a project to your intended audience. Through the web-based application, users can create data visualizations and other components of a project to share with others via the platform.
• BigBlueButton is a purpose-built virtual classroom that empowers teachers to teach and learners to learn.
• TensorFlow 2 - an end-to-end open source machine learning platform for everyone
• Keras: Deep Learning for humans. Keras is a high-level, deep learning API developed by Google for implementing neural networks. It is written in Python and is used to make the implementation of neural networks easy. It also supports multiple backend neural network computation.
• Anaconda Distribution: equips individuals to easily search and install thousands of Python/R packages and access a vast library of community content and support.
• The Apache Tomcat® software is an open-source implementation of the Java Servlet, JavaServer Pages, Java Expression Language and Java WebSocket technologies.
• Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.
Nextcloud is a self-hosted, open-source file sharing and collaboration platform that allows users to store, access, and share their data from any device or location.
Acknowledgment
This work was supported by “EU4Digital: Connecting research and education communities (EaPConect2)” project funded by the EU (grant contract ENI/2019/407-452) and Grants from the National Agency for Research and Development of Moldova (grant No. 20.80009.5007.22 and grant No. 21.70105.9ȘD).
References
1. P. Bogatencov, G. Secrieru, N. Degteariov, N. Iliuha. Scientific Computing Infrastructure and Services in Moldova. In Journal “Physics of Particles and Nuclei Letters”, SpringerLink, 04 September 2016, DOI: 10.1134/S1547477116050125), http://link.springer.com/journal/11497/13/5/page/2
2. Petru Bogatencov, Grigore Secrieru, Boris Hîncu, Nichita Degteariov. Development of computing infrastructure for support of Open Science in Moldova. In Workshop on Intelligent Information Systems (WIIS2021) Proceedings, Chisinau, IMI, 2021, pp. 34-45, ISBN 978-9975-68-415-6.
3. Petru Bogatencov, Grigore Secrieru, Radu Buzatu, Nichita Degteariov. Distributed computing infrastructure for complex applications development. In the Proceedings of Workshop on Intelligent Information Systems (WIIS2022), Chisinau, VA IMCS, 2022, pp. 55-65, ISBN 978-9975-68-461-3.
4. Grigore Secrieru, Peter Bogatencov, Nichita Degteariov. DEVELOPMENT OF EFFECTIVE ACCESS TO THE DISTRIBUTED SCIENTIFIC AND EDUCATIONAL E-INFRASTRUCTURE. Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and Education" (GRID'2021). Dubna, Russia, 5-9 July 2021; Vol-3041, urn:nbn:de:0074-3041-3, pp. 503-507; DOI:10.54546/MLIT.2021.21.73.001, ISSN 1613-0073
Containers are gaining more and more traction both in industry and science. The HEP community is also showing significant interest in adopting container technologies for software distribution. Encapsulating software inside of containers helps scientists create portable and reproducible research environments. However, running containerized workloads at scale via distributed computing infrastructures has some challenges, one of which is efficient container delivery. In a typical scenario thousands of copies of user containers need to be delivered to the worker nodes simultaneously posing excessive load on the container registry. The talk describes how a container registry able to cope with such high loads can be built, reviews existing major public services based on CVMFS and shows an example of such a service implementation at JINR using GitLab.
Неотъемлемой частью проводимых научными группа исследований является совместная работа над различными видами документов: статьи, тезисы, протоколы совещаний, презентации, заявки на гранты, отчеты, справочники и т.д. Для этого в ОИЯИ используется система DocDB, которая предназначена для задач хранения, контроля версий, совместного доступа и обмена документами между группами численностью до нескольких сотен человек. На данный момент развернуто несколько независимых экземпляров системы DocDB для экспериментов SPD, BM@N, Baikal-GVD, ускорительного отдела ЛФВЭ и участников нейтринной программы ОИЯИ. При эксплуатации системы было обнаружено, что она не в полной мере соответствует функциональным требованиям современных экспериментов в ОИЯИ, требованиям безопасности и надежности. Поэтому система требует доработки и внесений изменений. Но из-за того, что она является технологически устаревшей, внесение изменений представляет собой сложный и трудоемкий процесс. В связи с этим было принято решение разработать собственную платформу документооборота SciDocCloud на более современном стеке технологий. На ее основе планируется разработать цифровой сервис документооборота для экспериментов и групп пользователей ОИЯИ. В докладе будет представлен анализ существующей системы DocDB и текущая версия разрабатываемой платформы.
The widespread use of web technologies is a trend in software development at the present time. More powerful modern hybrid architectures computing resources become available to the user located at any geographical point connected by a network with a supercomputer. Since modern high energy physics experiments are characterized by the complexity of the latest detectors and a very large channels number for reading information, the amount of data to be analyzed becomes extremely large. Cloud technologies are successfully used to store and process such data.
Our work is devoted to the development of a web application for fitting experimental data and its deployment on the local cloud infrastructure of LIT JINR [1]. When implementing our application, the ROOT software package [2] was used, which is high energy physics standard and has a large set of tools for distributed data processing.
The web application for fitting experimental data is a continuation of the FITTER program [3], designed to fit the results of measurements of small-angle neutron scattering with a selected theoretical multi-parameter function. The user has the ability to edit the theoretical model program code, carry out fitting in a given range, select the minimization method and a specific algorithm for fitting, and also change a number of other parameters. When implementing the web interface, the component included in the ROOT distribution, FitPanel, was used. Web windows separate tabs are the ROOT Canvas, which displays the experimental data and the theoretical curve in a graphical form, and the console, which controls the fitting process. The result of fitting is saved as an image and text files containing fitting parameters and theoretical function values.
References
Resource management in cloud computing, and especially in FaaS (Function-as-a-Service), is a very active area of research, given the rise in the popularity of serverless computing. In serverless computing users do not explicitly manage VMs or containers, instead, they upload their functions to the cloud, and the functions are executed when triggered by events, such as HTTP requests. All work regarding resource management, server provisioning, and execution is done by the cloud platform. Developing algorithms for FaaS scheduling and resource management requires a thorough evaluation, as poor decisions may result in huge financial losses. However, running large-scale experiments on a real platform is very time-consuming and expensive. Therefore researchers working in this area require a performant, accurate, and customizable simulation framework. Although there exist various tools for simulating older and more established cloud computing models (e. g. IaaS), for FaaS the choice is much more narrow, and the existing simulation tools are far from perfect.
In this report, we present a modular FaaS simulation framework, called DSLab FaaS. This framework allows for a detailed simulation of FaaS cloud platforms with customizable implementations of the main components: users can implement their own logic for the scheduler, invoker, CPU sharing, container lifetime management, and container deployment. The framework supports main workload trace formats and the users can integrate their own formats into the framework in a convenient way. The framework allows users to collect a variety of metrics related to resource management efficiency, both globally and for each app or function separately. The applicability of DSLab FaaS is shown by reproducing some of the research works related to resource management in FaaS platforms in much less time: several seconds of simulation instead of several hours of experiments on a real platform. The framework is implemented in Rust with a strong focus on performance, its runtime scales linearly to the data size and it is able to run multi-threaded experiments. The performance and scalability of the simulator are verified by computational benchmarks against other FaaS simulators on real workload trace from Microsoft Azure Functions service.