With the exponential growth of LHC (Large Hadron Collider) data in the years 2010-2012, distributed computing has become the established way to analyse collider data. The ATLAS experiment Grid ...infrastructure includes more than 130 sites worldwide, ranging from large national computing centres to smaller university clusters. So far the storage technologies and access protocols to the clusters that host this tremendous amount of data vary from site to site. HTTP WebDAV offers the possibility to use a unified industry standard to access the storage. We present the deployment and testing of HTTP WebDAV for local and remote data access in the ATLAS experiment for the new data management system Rucio and the PanDA workload management system. Deployment and large scale tests have been performed using the Grid testing system HammerCloud and the ROOT HTTP plugin Davix.
The monitoring and controlling interfaces of the previous data management system DQ2 followed the evolutionary requirements and needs of the ATLAS collaboration. The new data management system, ...Rucio, has put in place a redesigned web-based interface based upon the lessons learnt from DQ2, and the increased volume of managed information. This interface encompasses both a monitoring and controlling component, and allows easy integration for usergenerated views. The interface follows three design principles. First, the collection and storage of data from internal and external systems is asynchronous to reduce latency. This includes the use of technologies like ActiveMQ or Nagios. Second, analysis of the data into information is done massively parallel due to its volume, using a combined approach with an Oracle database and Hadoop MapReduce. Third, sharing of the information does not distinguish between human or programmatic access, making it easy to access selective parts of the information both in constrained frontends like web-browsers as well as remote services. This contribution will detail the reasons for these principles and the design choices taken. Additionally, the implementation, the interactions with external systems, and an evaluation of the system in production, both from a technological and user perspective, conclude this contribution.
The ATLAS Distributed Data Management system stores more than 150PB of physics data across 120 sites globally. To cope with the anticipated ATLAS workload of the coming decade, Rucio, the ...next-generation data management system has been developed. Replica management, as one of the key aspects of the system, has to satisfy critical performance requirements in order to keep pace with the experiment's high rate of continual data generation. The challenge lies in meeting these performance objectives while still giving the system users and applications a powerful toolkit to control their data workflows. In this work we present the concept, design and implementation of the replica management in Rucio. We will specifically introduce the workflows behind replication rules, their formal language definition, weighting and site selection. Furthermore we will present the subscription component, which offers functionality for users to proclaim interest in data that has not been created yet. This contribution describes the concept and the architecture behind those components and will show the benefits made by this system.
Rucio is the successor of the current Don Quijote 2 (DQ2) system for the distributed data management (DDM) system of the ATLAS experiment. The reasons for replacing DQ2 are manifold, but besides high ...maintenance costs and architectural limitations, scalability concerns are on top of the list. Current expectations are that the amount of data will be three to four times as it is today by the end of 2014. Further is the availability of more powerful computing resources pushing additional pressure on the DDM system as it increases the demands on data provisioning. Although DQ2 is capable of handling the current workload, it is already at its limits. To ensure that Rucio will be up to the expected workload, a way to emulate it is needed. To do so, first the current workload, observed in DQ2, must be understood in order to scale it up to future expectations. The paper discusses how selected core concepts are applied to the workload of the experiment and how knowledge about the current workload is derived from various sources (e.g. analysing the central file catalogue logs). Finally a description of the implemented emulation framework, used for stress-testing Rucio, is given.
This paper describes a popularity prediction tool for data-intensive data management systems, such as ATLAS distributed data management (DDM). It is fed by the DDM popularity system, which produces ...historical reports about ATLAS data usage, providing information about files, datasets, users and sites where data was accessed. The tool described in this contribution uses this historical information to make a prediction about the future popularity of data. It finds trends in the usage of data using a set of neural networks and a set of input parameters and predicts the number of accesses in the near term future. This information can then be used in a second step to improve the distribution of replicas at sites, taking into account the cost of creating new replicas (bandwidth and load on the storage system) compared to gain of having new ones (faster access of data for analysis). To evaluate the benefit of the redistribution a grid simulator is introduced that is able replay real workload on different data distributions. This article describes the popularity prediction method and the simulator that is used to evaluate the redistribution.
Rucio is the next-generation data management system of the ATLAS experiment. The software engineering process to develop Rucio is fundamentally different to existing software development approaches ...in the ATLAS distributed computing community. Based on a conceptual design document, development takes place using peer-reviewed code in a test-driven environment. The main objectives are to ensure that every engineer understands the details of the full project, even components usually not touched by them, that the design and architecture are coherent, that temporary contributors can be productive without delay, that programming mistakes are prevented before being committed to the source code, and that the source is always in a fully functioning state. This contribution will illustrate the workflows and products used, and demonstrate the typical development cycle of a component from inception to deployment within this software engineering process. Next to the technological advantages, this contribution will also highlight the social aspects of an environment where every action is subject to detailed scrutiny.
ATLAS DQ2 to Rucio renaming infrastructure Serfon, C; Barisits, M; Beermann, T ...
Journal of physics. Conference series,
01/2014, Letnik:
513, Številka:
4
Journal Article
Recenzirano
Odprti dostop
To prepare the migration to the new ATLAS Data Management system called Rucio, a renaming campaign of all the physical files produced by ATLAS is needed. It represents around 300 million files split ...between ~120 sites with 6 different storage technologies. It must be done in a transparent way in order not to disrupt the ongoing computing activities. An infrastructure to perform this renaming has been developed and is presented in this paper as well as its performance.
All major experiments at the Large Hadron Collider (LHC) need to measure real storage usage at the Grid sites. This information is equally important for resource management, planning, and operations. ...To verify the consistency of central catalogs, experiments are asking sites to provide a full list of the files they have on storage, including size, checksum, and other file attributes. Such storage dumps, provided at regular intervals, give a realistic view of the storage resource usage by the experiments. Regular monitoring of the space usage and data verification serve as additional internal checks of the system integrity and performance. Both the importance and the complexity of these tasks increase with the constant growth of the total data volumes during the active data taking period at the LHC. The use of common solutions helps to reduce the maintenance costs, both at the large Tier1 facilities supporting multiple virtual organizations and at the small sites that often lack manpower. We discuss requirements and solutions to the common tasks of data storage accounting and verification, and present experiment-specific strategies and implementations used within the LHC experiments according to their computing models.
Evolving ATLAS Computing For Today's Networks Campana, S; Barreiro Megino, F; Jezequel, S ...
Journal of physics. Conference series,
01/2012, Letnik:
396, Številka:
3
Journal Article
Recenzirano
Odprti dostop
The ATLAS computing infrastructure was designed many years ago based on the assumption of rather limited network connectivity between computing centres. ATLAS sites have been organized in a ...hierarchical model, where only a static subset of all possible network links can be exploited and a static subset of well connected sites (CERN and the Tier-1s) can cover important functional roles such as hosting master copies of the data.
Effective distributed user analysis requires a system which meets the demands of running arbitrary user applications on sites with varied configurations and availabilities. The challenge of tracking ...such a system requires a tool to monitor not only the functional statuses of each grid site, but also to perform large-scale analysis challenges on the ATLAS grids. This work presents one such tool, the ATLAS GangaRobot, and the results of its use in tests and challenges. For functional testing, the GangaRobot performs daily tests of all sites; specifically, a set of exemplary applications are submitted to all sites and then monitored for success and failure conditions. These results are fed back into Ganga to improve job placements by avoiding currently problematic sites. For analysis challenges, a cloud is first prepared by replicating a number of desired DQ2 datasets across all the sites. Next, the GangaRobot is used to submit and manage a large number of jobs targeting these datasets. The high-loads resulting from multiple parallel instances of the GangaRobot exposes shortcomings in storage and network configurations. The results from a series of cloud-by-cloud analysis challenges starting in fall 2008 are presented.