The second generation of the ATLAS Production System called ProdSys2 is a distributed workload manager that runs daily hundreds of thousands of jobs, from dozens of different ATLAS specific ...workflows, across more than hundred heterogeneous sites. It achieves high utilization by combining dynamic job definition based on many criteria, such as input and output size, memory requirements and CPU consumption, with manageable scheduling policies and by supporting different kind of computational resources, such as GRID, clouds, supercomputers and volunteer-computers. The system dynamically assigns a group of jobs (task) to a group of geographically distributed computing resources. Dynamic assignment and resources utilization is one of the major features of the system, it didn't exist in the earliest versions of the production system where Grid resources topology was predefined using national or/and geographical pattern. Production System has a sophisticated job fault-recovery mechanism, which efficiently allows to run multi-Terabyte tasks without human intervention. We have implemented "train" model and open-ended production which allow to submit tasks automatically as soon as new set of data is available and to chain physics groups data processing and analysis with central production by the experiment. We present an overview of the ATLAS Production System and its major components features and architecture: task definition, web user interface and monitoring. We describe the important design decisions and lessons learned from an operational experience during the first year of LHC Run2. We also report the performance of the designed system and how various workflows, such as data (re)processing, Monte-Carlo and physics group production, users analysis, are scheduled and executed within one production system on heterogeneous computing resources.
The PanDA WMS - Production and Distributed Analysis Workload Management System - has been developed and used by the ATLAS experiment at the LHC (Large Hadron Collider) for all data processing and ...analysis challenges. BigPanDA is an extension of the PanDA WMS to run ATLAS and non-ATLAS applications on Leadership Class Facilities and supercomputers, as well as traditional grid and cloud resources. The success of the BigPanDA project has drawn attention from other compute intensive sciences such as biology. In 2017, a pilot project was started between BigPanDA and the Blue Brain Project (BBP) of the Ecole Polytechnique Federal de Lausanne (EPFL) located in Lausanne, Switzerland. This proof of concept project is aimed at demonstrating the efficient application of the BigPanDA system to support the complex scientific workflow of the BBP, which relies on using a mix of desktop, cluster and supercomputers to reconstruct and simulate accurate models of brain tissue.
PanDA - Production and Distributed Analysis Workload Management System has been developed to address ATLAS experiment at LHC data processing and analysis challenges. Recently PanDA has been extended ...to run HEP scientific applications on Leadership Class Facilities and supercomputers. The success of the projects to use PanDA beyond HEP and Grid has drawn attention from other compute intensive sciences such as bioinformatics. Recent advances of Next Generation Genome Sequencing (NGS) technology led to increasing streams of sequencing data that need to be processed, analysed and made available for bioinformaticians worldwide. Analysis of genomes sequencing data using popular software pipeline PALEOMIX can take a month even running it on the powerful computer resource. In this paper we will describe the adaptation the PALEOMIX pipeline to run it on a distributed computing environment powered by PanDA. To run pipeline we split input files into chunks which are run separately on different nodes as separate inputs for PALEOMIX and finally merge output file, it is very similar to what it done by ATLAS to process and to simulate data. We dramatically decreased the total walltime because of jobs (re)submission automation and brokering within PanDA. Using software tools developed initially for HEP and Grid can reduce payload execution time for Mammoths DNA samples from weeks to days.
One of the most important studies dedicated to be solved for ATLAS physical analysis is a reconstruction of proton-proton events with large number of interactions in Transition Radiation Tracker. ...Paper includes Transition Radiation Tracker performance results obtained with the usage of the ATLAS GRID and Kurchatov Institute’s Data Processing Center including Tier-1 grid site and supercomputer as well as analysis of CPU efficiency during these studies.
The Large Hadron Collider (LHC), operating at the international CERN Laboratory in Geneva, Switzerland, is leading Big Data driven scientific explorations. ATLAS, one of the largest collaborations ...ever assembled in the the history of science, is at the forefront of research at the LHC. To address an unprecedented multi-petabyte data processing challenge, the ATLAS experiment is relying on a heterogeneous distributed computational infrastructure. To manage the workflow for all data processing on hundreds of data centers the PanDA (Production and Distributed Analysis)Workload Management System is used. An ambitious program to expand PanDA to all available computing resources, including opportunistic use of commercial and academic clouds and Leadership Computing Facilities (LCF), is realizing within BigPanDA and megaPanDA projects. These projects are now exploring how PanDA might be used for managing computing jobs that run on supercomputers including OLCF’s Titan and NRC-KI HPC2. The main idea is to reuse, as much as possible, existing components of the PanDA system that are already deployed on the LHC Grid for analysis of physics data. The next generation of PanDA will allow many data-intensive sciences employing a variety of computing platforms to benefit from ATLAS experience and proven tools in highly scalable processing.
The LHC experiments are preparing for the precision measurements and further discoveries that will be made possible by higher LHC energies from April 2015 (LHC Run2). The need for simulation, data ...processing and analysis would overwhelm the expected capacity of grid infrastructure computing facilities deployed by the Worldwide LHC Computing Grid (WLCG). To meet this challenge the integration of the opportunistic resources into LHC computing model is highly important. The Tier-1 facility at Kurchatov Institute (NRC-KI) in Moscow is a part of WLCG and it will process, simulate and store up to 10% of total data obtained from ALICE, ATLAS and LHCb experiments. In addition Kurchatov Institute has supercomputers with peak performance 0.12 PFLOPS. The delegation of even a fraction of supercomputing resources to the LHC Computing will notably increase total capacity. In 2014 the development a portal combining a Tier-1 and a supercomputer in Kurchatov Institute was started to provide common interfaces and storage. The portal will be used not only for HENP experiments, but also by other data- and compute-intensive sciences like biology with genome sequencing analysis; astrophysics with cosmic rays analysis, antimatter and dark matter search, etc.
The Large Hadron Collider (LHC), operating at the international CERN Laboratory in Geneva, Switzerland, is leading Big Data driven scientific explorations. Experiments at the LHC explore the ...fundamental nature of matter and the basic forces that shape our universe, and were recently credited for the discovery of a Higgs boson. ATLAS, one of the largest collaborations ever assembled in the sciences, is at the forefront of research at the LHC. To address an unprecedented multi-petabyte data processing challenge, the ATLAS experiment is relying on a heterogeneous distributed computational infrastructure. The ATLAS experiment uses PanDA (Production and Data Analysis) Workload Management System for managing the workflow for all data processing on over 140 data centers. Through PanDA, ATLAS physicists see a single computing facility that enables rapid scientific breakthroughs for the experiment, even though the data centers are physically scattered all over the world. While PanDA currently uses more than 250000 cores with a peak performance of 0.3+ petaFLOPS, next LHC data taking runs will require more resources than Grid computing can possibly provide. To alleviate these challenges, LHC experiments are engaged in an ambitious program to expand the current computing model to include additional resources such as the opportunistic use of supercomputers. We will describe a project aimed at integration of PanDA WMS with supercomputers in United States, Europe and Russia (in particular with Titan supercomputer at Oak Ridge Leadership Computing Facility (OLCF), Supercomputer at the National Research Center “Kurchatov Institute”, IT4 in Ostrava, and others). The current approach utilizes a modified PanDA pilot framework for job submission to the supercomputers batch queues and local data management, with light-weight MPI wrappers to run singlethreaded workloads in parallel on Titan’s multi-core worker nodes. This implementation was tested with a variety of Monte-Carlo workloads on several supercomputing platforms. We will present our current accomplishments in running PanDA WMS at supercomputers and demonstrate our ability to use PanDA as a portal independent of the computing facility’s infrastructure for High Energy and Nuclear Physics, as well as other data-intensive science applications, such as bioinformatics and astro-particle physics.
The track reconstruction algorithms of the ATLAS experiment have demonstrated excellent performance in all of the data delivered so far by the LHC. The expected large increase in the number of ...interactions per bunch crossing introduces new challenges both in the computational aspects and physics performance of the algorithms. With the aim of taking advantage of modern CPU design and optimizing memory and CPU usage in the reconstruction algorithms a number of projects are being pursued. These include rationalization of the event data model, vectorization of the core components of the algorithms, and removing algorithm bottlenecks by using modern code analysis tools. Recent results of the advances made in these ongoing projects indicate up to three-fold speedup in the optimized modules while in some modules the code size could be reduced by up to 97% leading to higher readability, maintainability and decreased interface complexity.
Creation of global e-Infrastructure involves an integration of isolated local resources into common heterogeneous computing environment. In 2014 a pioneering work to develop a large scale data- and ...task- management system for federated heterogeneous resources has been started at the National Research Centre “Kurchatov Institute” (NRC KI, Moscow). As a part of this work, we have designed, developed and deployed a portal to submit payloads to heterogeneous computing infrastructure. It combines Tier-1, Cloud-infrastructure, and a supercomputer at the Kurchatov institute. This portal is aimed to provide a common interface to submit tasks to Grid sites, commercial and academic clouds and supercomputers. Integration of Tier-1 and the supercomputer has allowed to notably increase total CPU capacity available for Large Hadron Collider (LHC) experiments. The portal can be used not only for High Energy Physics (HEP) applications, but also for other compute-intensive sciences such as bioinformatics with genome and sequence analysis; astrophysics with cosmic rays analysis, antimatter and dark matter search, etc.
This article describes developed portal as a top layer for computing facilities infrastructure for High Energy Physics and other compute-intensive science applications. The article presents the results of using PanDA at NRC KI supercomputer/Cloud as underlying technology for the portal.