The ATLAS experiment has just concluded its first running period which commenced in 2010. After two years of remarkable performance from the LHC and ATLAS, the experiment has accumulated more than 25 ...fb−1 of data. The total volume of beam and simulated data products exceeds 100 PB distributed across more than 150 computing centres around the world, managed by the experiment's distributed data management system. These sites have provided up to 150,000 computing cores to ATLAS's global production and analysis processing system, enabling a rich physics programme including the discovery of the Higgs-like boson in 2012. The wealth of accumulated experience in global data-intensive computing at this massive scale, and the considerably more challenging requirements of LHC computing from 2015 when the LHC resumes operation, are driving a comprehensive design and development cycle to prepare a revised computing model together with data processing and management systems able to meet the demands of higher trigger rates, energies and event complexities. An essential requirement will be the efficient utilisation of current and future processor technologies as well as a broad range of computing platforms, including supercomputing and cloud resources. We will report on experience gained thus far and our progress in preparing ATLAS computing for the future.
Rucio is the next-generation Distributed Data Management (DDM) system benefiting from recent advances in cloud and "Big Data" computing to address HEP experiments scaling requirements. Rucio is an ...evolution of the ATLAS DDM system Don Quijote 2 (DQ2), which has demonstrated very large scale data management capabilities with more than 140 petabytes spread worldwide across 130 sites, and accesses from 1,000 active users. However, DQ2 is reaching its limits in terms of scalability, requiring a large number of support staff to operate and being hard to extend with new technologies. Rucio will deal with these issues by relying on a conceptual data model and new technology to ensure system scalability, address new user requirements and employ new automation framework to reduce operational overheads. We present the key concepts of Rucio, including its data organization/representation and a model of how to manage central group and user activities. The Rucio design, and the technology it employs, is described, specifically looking at its RESTful architecture and the various software components it uses. We show also the performance of the system.
The ATLAS experiment is commissioning its computing system in preparation for LHC data. Part of this activity consists in testing the data flow from the online data acquisition to the offline ...processing system, and the distribution of raw and processed data to the external computing centres. A series of functional and rate tests has been performed in 2006 and 2007, allowing the optimisation of the hardware and software components of this system; the last phase of commissioning, the so-called Final Dress Rehearsal, consisting of an integration tests of all components, will take place later in 2007. This paper describes the performed tests, the problems that we encountered, and the solutions we found.
Rucio is the next-generation of Distributed Data Management (DDM) system benefiting from recent advances in cloud and ”Big Data” computing to address HEP experiments scaling requirements. Rucio is an ...evolution of the ATLAS DDM system Don Quixote 2 (DQ2), which has demonstrated very large scale data management capabilities with more than 160 petabytes spread worldwide across 130 sites, and accesses from 1,000 active users. However, DQ2 is reaching its limits in terms of scalability, requiring a large number of support staff to operate and being hard to extend with new technologies. Rucio addresses these issues by relying on new technologies to ensure system scalability, cover new user requirements and employ new automation framework to reduce operational overheads. This paper shows the key concepts of Rucio, details the Rucio design, and the technology it employs, the tests that were conducted to validate it and finally describes the migration steps that were conducted to move from DQ2 to Rucio.
ATLAS fast physics monitoring: TADA Sabato, G; Elsing, M; Gumpert, C ...
Journal of physics. Conference series,
10/2017, Volume:
898, Issue:
9
Journal Article
Peer reviewed
Open access
The ATLAS experiment at the LHC has been recording data from proton-proton collisions with 13 TeV center-of-mass energy since spring 2015. The collaboration is using a fast physics monitoring ...framework (TADA) to automatically perform a broad range of fast searches for early signs of new physics and to monitor the data quality across the year with the full analysis level calibrations applied to the rapidly growing data. TADA is designed to provide fast feedback directly after the collected data has been fully calibrated and processed at the Tier-0. The system can monitor a large range of physics channels, offline data quality and physics performance quantities. TADA output is available on a website accessible by the whole collaboration. It gets updated twice a day with the data from newly processed runs. Hints of potentially interesting physics signals or performance issues identified in this way are reported to be followed up by physics or combined performance groups. The note reports as well about the technical aspects of TADA: the software structure to obtain the input TAG files, the framework workflow and structure, the webpage and its implementation.
ATLAS data preparation in run 2 Laycock, PJ; Chelstowska, MA; Donszelmann, TC ...
Journal of physics. Conference series,
10/2017, Volume:
898, Issue:
4
Journal Article
Peer reviewed
Open access
In this contribution, the data preparation workflows for Run 2 are presented. The challenges posed by the excellent performance and high live time fraction of the LHC are discussed, and the solutions ...implemented by ATLAS are described. The prompt calibration loop procedures are described and examples are given. Several levels of data quality assessment are used to quickly spot problems in the control room and prevent data loss, and to provide the final selection used for physics analysis. Finally the data quality efficiency for physics analysis is shown.
The ATLAS Distributed Data Management system stores more than 150PB of physics data across 120 sites globally. To cope with the anticipated ATLAS workload of the coming decade, Rucio, the ...next-generation data management system has been developed. Replica management, as one of the key aspects of the system, has to satisfy critical performance requirements in order to keep pace with the experiment's high rate of continual data generation. The challenge lies in meeting these performance objectives while still giving the system users and applications a powerful toolkit to control their data workflows. In this work we present the concept, design and implementation of the replica management in Rucio. We will specifically introduce the workflows behind replication rules, their formal language definition, weighting and site selection. Furthermore we will present the subscription component, which offers functionality for users to proclaim interest in data that has not been created yet. This contribution describes the concept and the architecture behind those components and will show the benefits made by this system.
Rucio is the successor of the current Don Quijote 2 (DQ2) system for the distributed data management (DDM) system of the ATLAS experiment. The reasons for replacing DQ2 are manifold, but besides high ...maintenance costs and architectural limitations, scalability concerns are on top of the list. Current expectations are that the amount of data will be three to four times as it is today by the end of 2014. Further is the availability of more powerful computing resources pushing additional pressure on the DDM system as it increases the demands on data provisioning. Although DQ2 is capable of handling the current workload, it is already at its limits. To ensure that Rucio will be up to the expected workload, a way to emulate it is needed. To do so, first the current workload, observed in DQ2, must be understood in order to scale it up to future expectations. The paper discusses how selected core concepts are applied to the workload of the experiment and how knowledge about the current workload is derived from various sources (e.g. analysing the central file catalogue logs). Finally a description of the implemented emulation framework, used for stress-testing Rucio, is given.
This paper describes a popularity prediction tool for data-intensive data management systems, such as ATLAS distributed data management (DDM). It is fed by the DDM popularity system, which produces ...historical reports about ATLAS data usage, providing information about files, datasets, users and sites where data was accessed. The tool described in this contribution uses this historical information to make a prediction about the future popularity of data. It finds trends in the usage of data using a set of neural networks and a set of input parameters and predicts the number of accesses in the near term future. This information can then be used in a second step to improve the distribution of replicas at sites, taking into account the cost of creating new replicas (bandwidth and load on the storage system) compared to gain of having new ones (faster access of data for analysis). To evaluate the benefit of the redistribution a grid simulator is introduced that is able replay real workload on different data distributions. This article describes the popularity prediction method and the simulator that is used to evaluate the redistribution.
ATLAS maintains a rich corpus of event-by-event information that provides a global view of the billions of events the collaboration has measured or simulated, along with sufficient auxiliary ...information to navigate to and retrieve data for any event at any production processing stage. This unique resource has been employed for a range of purposes, from monitoring, statistics, anomaly detection, and integrity checking, to event picking, subset selection, and sample extraction. Recent years of data-taking provide a foundation for assessment of how this resource has and has not been used in practice, of the uses for which it should be optimized, of how it should be deployed and provisioned for scalability to future data volumes, and of the areas in which enhancements to functionality would be most valuable. This paper describes how ATLAS event-level information repositories and selection infrastructure are evolving in light of this experience, and in view of their expected roles both in wide-area event delivery services and in an evolving ATLAS analysis model in which the importance of efficient selective access to data can only grow.