HEP software today is a rich and diverse domain in itself and exists within the mushrooming world of open source software. As HEP software developers and users we can be more productive and effective ...if our work and our choices are informed by a good knowledge of what others in our community have created or found useful. The HEP Software and Computing Knowledge Base, hepsoftware.org, was created to facilitate this by serving as a collection point and information exchange on software projects and products, services, training, computing facilities, and relating them to the projects, experiments, organizations and science domains that offer them or use them. It was created as a contribution to the HEP Software Foundation, for which a HEP S&C knowledge base was a much requested early deliverable. This contribution will motivate and describe the system, what it offers, its content and contributions both existing and needed, and its implementation (node.js based web service and javascript client app) which has emphasized ease of use for both users and contributors.
The ATLAS collaboration started a process to understand the computing needs for the High Luminosity LHC era. Based on our best understanding of the computing model input parameters for the HL-LHC ...data taking conditions, results indicate the need for a larger amount of computational and storage resources with respect to the projection of constant yearly budget for computing in 2026. Filling the gap between the projection and the needs will be one of the challenges in preparation for HL-LHC. While the gains from improvements in offline software will play a crucial role in this process, a different model for data processing, management, access and bookkeeping should also be envisaged to optimise resource usage. In this contribution we will describe a straw man of this model, founded on basic principles such as single event level granularity for data processing and virtual data. We will explain how the current architecture will evolve adiabatically into the future distributed computing system, through the prototyping of building blocks that would be integrated in the production infrastructure as early as possible, so that specific use cases can be covered much earlier with respect to the HL-LHC time scale. We will also discuss how such system would adapt to and drive the evolution of the WLCG infrastructure in terms of facilities and services.
BigPanDA monitoring is a web application that provides various processing and representation of the Production and Distributed Analysis (PanDA) system objects states. Analysing hundreds of millions ...of computation entities, such as an event or a job, BigPanDA monitoring builds different scales and levels of abstraction reports in real time mode. Provided information allows users to drill down into the reason of a concrete event failure or observe the broad picture such as tracking the computation nucleus and satellites performance or the progress of a whole production campaign. PanDA system was originally developed for the ATLAS experiment. Currently, it manages execution of more than 2 million jobs distributed over 170 computing centers worldwide on daily basis. BigPanDA is its core component commissioned in the middle of 2014 and now is the primary source of information for ATLAS users about the state of their computations and the source of decision support information for shifters, operators and managers. In this work, we describe the evolution of the architecture, current status and plans for the development of the BigPanDA monitoring.
The Production and Distributed Analysis (PanDA) system has been developed to meet ATLAS production and analysis requirements for a data-driven workload management system capable of operating at the ...Large Hadron Collider (LHC) data processing scale. Heterogeneous resources used by the ATLAS experiment are distributed worldwide at hundreds of sites, thousands of physicists analyse the data remotely, the volume of processed data is beyond the exabyte scale, dozens of scientific applications are supported, while data processing requires more than a few billion hours of computing usage per year. PanDA performed very well over the last decade including the LHC Run 1 data taking period. However, it was decided to upgrade the whole system concurrently with the LHC's first long shutdown in order to cope with rapidly changing computing infrastructure. After two years of reengineering efforts, PanDA has embedded capabilities for fully dynamic and flexible workload management. The static batch job paradigm was discarded in favor of a more automated and scalable model. Workloads are dynamically tailored for optimal usage of resources, with the brokerage taking network traffic and forecasts into account. Computing resources are partitioned based on dynamic knowledge of their status and characteristics. The pilot has been re-factored around a plugin structure for easier development and deployment. Bookkeeping is handled with both coarse and fine granularities for efficient utilization of pledged or opportunistic resources. An in-house security mechanism authenticates the pilot and data management services in off-grid environments such as volunteer computing and private local clusters. The PanDA monitor has been extensively optimized for performance and extended with analytics to provide aggregated summaries of the system as well as drill-down to operational details. There are as well many other challenges planned or recently implemented, and adoption by non-LHC experiments such as bioinformatics groups successfully running Paleomix (microbial genome and metagenomes) payload on supercomputers. In this paper we will focus on the new and planned features that are most important to the next decade of distributed computing workload management.
The Production and Distributed Analysis System (PanDA) plays a key role in the ATLAS distributed computing infrastructure. All ATLAS Monte-Carlo simulation and data reprocessing jobs pass through the ...PanDA system. We will describe how PanDA manages job execution on the grid using dynamic resource estimation and data replication together with intelligent brokerage in order to meet the scaling and automation requirements of ATLAS distributed computing. PanDA is also the primary ATLAS system for processing user and group analysis jobs, bringing further requirements for quick, flexible adaptation to the rapidly evolving analysis use cases of the early datataking phase, in addition to the high reliability, robustness and usability needed to provide efficient and transparent utilization of the grid for analysis users. We will describe how PanDA meets ATLAS requirements, the evolution of the system in light of operational experience, how the system has performed during the first LHC data-taking phase and plans for the future.
The second generation of the ATLAS Production System called ProdSys2 is a distributed workload manager that runs daily hundreds of thousands of jobs, from dozens of different ATLAS specific ...workflows, across more than hundred heterogeneous sites. It achieves high utilization by combining dynamic job definition based on many criteria, such as input and output size, memory requirements and CPU consumption, with manageable scheduling policies and by supporting different kind of computational resources, such as GRID, clouds, supercomputers and volunteer-computers. The system dynamically assigns a group of jobs (task) to a group of geographically distributed computing resources. Dynamic assignment and resources utilization is one of the major features of the system, it didn't exist in the earliest versions of the production system where Grid resources topology was predefined using national or/and geographical pattern. Production System has a sophisticated job fault-recovery mechanism, which efficiently allows to run multi-Terabyte tasks without human intervention. We have implemented "train" model and open-ended production which allow to submit tasks automatically as soon as new set of data is available and to chain physics groups data processing and analysis with central production by the experiment. We present an overview of the ATLAS Production System and its major components features and architecture: task definition, web user interface and monitoring. We describe the important design decisions and lessons learned from an operational experience during the first year of LHC Run2. We also report the performance of the designed system and how various workflows, such as data (re)processing, Monte-Carlo and physics group production, users analysis, are scheduled and executed within one production system on heterogeneous computing resources.
Contemporary scientific experiments produce significant amount of data as well as scientific publications based on this data. Since volumes of both are constantly increasing, it becomes more and more ...problematic to establish a connection between a given paper and the underlying data. However, such an association is one of the crucial pieces of information for performing various tasks, such as validating the scientific results presented in paper, comparing different approaches to deal with a problem or even simply understanding the situation in some area of science. Authors of this paper are working under the Data Knowledge Base (DKB) R&D project, initiated in 2016 to solve this issue for the ATLAS experiment at CERN. This project is aimed at developing of the software environment, providing the storage and a coherent representation of the basic information objects. In this paper authors present a metadata model developed for the ATLAS experiment, the architecture of the DKB system and its main components. Special attention is paid to the Kafka-based ETL subsystem implementation and mechanism for extraction of meta-information from the texts of ATLAS publications
Experiments at the Large Hadron Collider (LHC) face unprecedented computing challenges. Heterogeneous resources are distributed worldwide at hundreds of sites, thousands of physicists analyse the ...data remotely, the volume of processed data is beyond the exabyte scale, while data processing requires more than a few billion hours of computing usage per year. The PanDA (Production and Distributed Analysis) system was developed to meet the scale and complexity of LHC distributed computing for the ATLAS experiment. In the process, the old batch job paradigm of locally managed computing in HEP was discarded in favour of a far more automated, flexible and scalable model. The success of PanDA in ATLAS is leading to widespread adoption and testing by other experiments. PanDA is the first exascale workload management system in HEP, already operating at more than a million computing jobs per day, and processing over an exabyte of data in 2013. There are many new challenges that PanDA will face in the near future, in addition to new challenges of scale, heterogeneity and increasing user base. PanDA will need to handle rapidly changing computing infrastructure, will require factorization of code for easier deployment, will need to incorporate additional information sources including network metrics in decision making, be able to control network circuits, handle dynamically sized workload processing, provide improved visualization, and face many other challenges. In this talk we will focus on the new features, planned or recently implemented, that are relevant to the next decade of distributed computing workload management using PanDA.
The PanDA (Production and Distributed Analysis) workload management system was developed to meet the scale and complexity of distributed computing for the ATLAS experiment. PanDA managed resources ...are distributed worldwide, on hundreds of computing sites, with thousands of physicists accessing hundreds of Petabytes of data and the rate of data processing already exceeds Exabyte per year. While PanDA currently uses more than 200,000 cores at well over 100 Grid sites, future LHC data taking runs will require more resources than Grid computing can possibly provide. Additional computing and storage resources are required. Therefore ATLAS is engaged in an ambitious program to expand the current computing model to include additional resources such as the opportunistic use of supercomputers. In this paper we will describe a project aimed at integration of ATLAS Production System with Titan supercomputer at Oak Ridge Leadership Computing Facility (OLCF). Current approach utilizes modified PanDA Pilot framework for job submission to Titan's batch queues and local data management, with lightweight MPI wrappers to run single node workloads in parallel on Titan's multi-core worker nodes. It provides for running of standard ATLAS production jobs on unused resources (backfill) on Titan. The system already allowed ATLAS to collect on Titan millions of core-hours per month, execute hundreds of thousands jobs, while simultaneously improving Titans utilization efficiency. We will discuss the details of the implementation, current experience with running the system, as well as future plans aimed at improvements in scalability and efficiency. Notice: This manuscript has been authored, by employees of Brookhaven Science Associates, LLC under Contract No. DE-AC02-98CH10886 with the U.S. Department of Energy. The publisher by accepting the manuscript for publication acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes.