Speaker
Description
Pilot systems are widely used in distributed computing as a flexible mechanism for dynamic workload management and resource allocation. They have proven effective in large-scale experiments and high-performance environments thanks to their scalability and adaptability. However, the absence of a common abstraction and unified best practices has led to a variety of implementations, often with limited interoperability.
In this presentation, we will explore the architectural principles and operational models underlying pilot frameworks, with special attention to late binding — a key feature that supports efficient resource utilization and adaptive task scheduling. We introduce our implementation tailored for the SPD experiment: a two-layer solution combining a pilot process and a monitoring daemon. The system employs multithreading to ensure effective scheduling, supervision, and reporting. We will share practical lessons learned from deploying this framework in the SPD online filter system, emphasizing its impact on distributed workload execution.