The ALICE Grid Workflow for LHC Run 3 Storetvedt, Maxim; Betev, Latchezar; Helstrup, Håvard ...
EPJ Web of Conferences,
2024, Letnik:
295
Journal Article, Conference Proceeding
Recenzirano
Odprti dostop
In preparation for LHC Run 3 and 4 the ALICE Collaboration has moved to a new Grid middleware, JAliEn, and workflow management system. The migration was dictated by the substantially higher ...requirements on the Grid infrastructure in terms of payload complexity, increased number of jobs and managed data volume, all of which required a complete rewrite of the middleware using modern software languages and technologies. Through containerisation, self-contained binaries, managed by the JAliEn middleware, we provide a uniform execution environment across sites and various architectures, including accelerators. The model and implementation have proven their scalability and can be easily deployed across sites with minimal intervention. This contribution outlines the architecture of the new Grid workflow as deployed in production and the workflow process. Specifically shown is how core components are moved and bootstrapped through CVMFS, enabling the middleware to run anywhere fully independent of the host system. Furthermore, we examine how new middleware releases, containers and their runtimes are centrally maintained and easily deployed across the Grid, also by the means of a common build system.
This contribution introduces the job optimizer service for the nextgeneration ALICE Grid middleware, JAliEn (Java Alice Environment). It is a continuous service running on central machines and is ...essentially responsible for splitting jobs into subjobs, to then be distributed and executed on the ALICE grid. There are several ways of creating subjobs based on various strategies relevant to the aim of any particular grid job. Therefore a user has to explicitly declare that a job is to be split, and also define the strategy to be used. The new job optimizer service aims to retain the old ALICE grid middleware functionalities from the user’s point of view while increasing the performance and throughput. One aspect of increasing performance is looking at how the job optimizer interacts with the job queue database. A different way of describing subjobs in the database is presented, to minimize resource usage. There is also a focus on limiting communications with the database, as this is already a congested area. Furthermore, a new solution to splitting based on the locality of job input data will be presented, aiming to split into subjobs more efficiently, therefore making better use of resources on the grid to further increase throughput. Added options for the user regarding splitting by locality, such as setting a minimum limit for a subjob size, will also be explored.
The new JAliEn (Java ALICE Environment) middleware is a Grid framework designed to satisfy the needs of the ALICE experiment for the LHC Run 3, such as providing a high-performance and ...high-scalability service to cope with the increased volumes of collected data. This new framework also introduces a split, two-layered job pilot, creating a new approach to how jobs are handled and executed within the Grid. Each layer runs on a separate JVM, with a separate authentication token, allowing for a finer control of permissions and improved isolation of the payload. Having these separate layers also allows for the execution of job payloads within containers. This allows for the further strengthening of isolation and creates a cohesive environment across computing sites, while avoiding the resource overhead associated with traditional virtualisation.
This contribution presents the architecture of the new split job pilot found in JAliEn, and the methods used to achieve the execution of Grid jobs while maintaining reliable communication between layers. Specifically, how this is achieved despite the possibility of a layer being run in a separate container, while retaining isolation and mitigating possible security risks. Furthermore, we discuss how the implementation remains agnostic to the choice of container platform, allowing it to run within popular platforms such as Singularity and Docker.
Virtualization and containers are established tools for providing simplified deployment, elasticity and workflow isolation. These benefits are especially advantageous in containers, which dispense ...with the resource overhead associated with virtual machines in cases where virtualization of the full hardware stack is not considered necessary. Containers are also simpler to setup and maintain in production environments–deployed and currently operational systems serving end-users, where service disruptions should be avoided. This contribution addresses container configuration and deployment to run central and site services on the ALICE Grid system, specifically to achieve containerized VO-boxes. We describe the methods through which we minimize the manual interaction, while retaining the simplicity and scalability associated with container deployment, the so-called ”service in a box”. Furthermore, we explore ways to increase fault tolerance, aimed at reducing the risk of service downtime, and identify possible performance bottlenecks.
First experimental results are presented on event-by-event net-proton fluctuation measurements in Pb–Pb collisions at sNN=2.76 TeV, recorded by the ALICE detector at the CERN LHC. The ALICE detector ...is well suited for such studies due to its excellent particle identification capabilities and large acceptance, which is crucial for fluctuation analysis. The studies are focussed on second order cumulants, but the analysis technique used is more general and will be applied, in the near future, also to higher order cumulants.