Optimizing a computing infrastructure on the scale of LHC requires a quantitative understanding of a complex network of many different resources and services. For this purpose the CERN IT department ...and the LHC experiments are collecting a large multitude of logs and performance probes, which are already successfully used for short-term analysis (e.g. operational dashboards) within each group. The IT analytics working group has been created with the goal to bring data sources from different services and on different abstraction levels together and to implement a suitable infrastructure for mid- to long-term statistical analysis. It further provides a forum for joint optimization across single service boundaries and the exchange of analysis methods and tools. To simplify access to the collected data, we implemented an automated repository for cleaned and aggregated data sources based on the Hadoop ecosystem. This contribution describes some of the challenges encountered, such as dealing with heterogeneous data formats, selecting an efficient storage format for map reduce and external access, and will describe the repository user interface. Using this infrastructure we were able to quantitatively analyze the relationship between CPU/wall fraction, latency/throughput constraints of network and disk and the effective job throughput. In this contribution we will first describe the design of the shared analysis infrastructure and then present a summary of first analysis results from the combined data sources.
The ATLAS Distributed Data Management system manages more than 160PB of physics data across more than 130 sites globally. Rucio, the next generation Distributed Data Management system of the ATLAS ...experiment, replaced DQ2 in December 2014 and will manage the experiment's data throughout Run 2 of the LHC and beyond. The previous data management system pursued a rather simplistic approach for resource management, but with the increased data volume and more dynamic handling of data workflows required by the experiment, a more elaborate approach is needed. Rucio was delivered with an initial quota system, but during the first months of operation it turned out to not fully satisfy the collaboration's resource management needs. We consequently introduce a new concept of declaring quota policies (limits) for accounts in Rucio. This new quota concept is based on accounts and RSE (Rucio storage element) expressions, which allows the definition of hierarchical quotas in a dynamic way. This concept enables the operators of the data management system to implement very specific policies for users, physics groups and production systems while, at the same time, lowering the operational burden. This contribution describes the concept, architecture and workflow of the system and includes an evaluation measuring the performance of the system.
The monitoring and controlling interfaces of the previous data management system DQ2 followed the evolutionary requirements and needs of the ATLAS collaboration. The new data management system, ...Rucio, has put in place a redesigned web-based interface based upon the lessons learnt from DQ2, and the increased volume of managed information. This interface encompasses both a monitoring and controlling component, and allows easy integration for usergenerated views. The interface follows three design principles. First, the collection and storage of data from internal and external systems is asynchronous to reduce latency. This includes the use of technologies like ActiveMQ or Nagios. Second, analysis of the data into information is done massively parallel due to its volume, using a combined approach with an Oracle database and Hadoop MapReduce. Third, sharing of the information does not distinguish between human or programmatic access, making it easy to access selective parts of the information both in constrained frontends like web-browsers as well as remote services. This contribution will detail the reasons for these principles and the design choices taken. Additionally, the implementation, the interactions with external systems, and an evaluation of the system in production, both from a technological and user perspective, conclude this contribution.
The ATLAS Distributed Data Management system stores more than 150PB of physics data across 120 sites globally. To cope with the anticipated ATLAS workload of the coming decade, Rucio, the ...next-generation data management system has been developed. Replica management, as one of the key aspects of the system, has to satisfy critical performance requirements in order to keep pace with the experiment's high rate of continual data generation. The challenge lies in meeting these performance objectives while still giving the system users and applications a powerful toolkit to control their data workflows. In this work we present the concept, design and implementation of the replica management in Rucio. We will specifically introduce the workflows behind replication rules, their formal language definition, weighting and site selection. Furthermore we will present the subscription component, which offers functionality for users to proclaim interest in data that has not been created yet. This contribution describes the concept and the architecture behind those components and will show the benefits made by this system.
Rucio is the successor of the current Don Quijote 2 (DQ2) system for the distributed data management (DDM) system of the ATLAS experiment. The reasons for replacing DQ2 are manifold, but besides high ...maintenance costs and architectural limitations, scalability concerns are on top of the list. Current expectations are that the amount of data will be three to four times as it is today by the end of 2014. Further is the availability of more powerful computing resources pushing additional pressure on the DDM system as it increases the demands on data provisioning. Although DQ2 is capable of handling the current workload, it is already at its limits. To ensure that Rucio will be up to the expected workload, a way to emulate it is needed. To do so, first the current workload, observed in DQ2, must be understood in order to scale it up to future expectations. The paper discusses how selected core concepts are applied to the workload of the experiment and how knowledge about the current workload is derived from various sources (e.g. analysing the central file catalogue logs). Finally a description of the implemented emulation framework, used for stress-testing Rucio, is given.
This paper describes a popularity prediction tool for data-intensive data management systems, such as ATLAS distributed data management (DDM). It is fed by the DDM popularity system, which produces ...historical reports about ATLAS data usage, providing information about files, datasets, users and sites where data was accessed. The tool described in this contribution uses this historical information to make a prediction about the future popularity of data. It finds trends in the usage of data using a set of neural networks and a set of input parameters and predicts the number of accesses in the near term future. This information can then be used in a second step to improve the distribution of replicas at sites, taking into account the cost of creating new replicas (bandwidth and load on the storage system) compared to gain of having new ones (faster access of data for analysis). To evaluate the benefit of the redistribution a grid simulator is introduced that is able replay real workload on different data distributions. This article describes the popularity prediction method and the simulator that is used to evaluate the redistribution.
Rucio is the next-generation data management system of the ATLAS experiment. The software engineering process to develop Rucio is fundamentally different to existing software development approaches ...in the ATLAS distributed computing community. Based on a conceptual design document, development takes place using peer-reviewed code in a test-driven environment. The main objectives are to ensure that every engineer understands the details of the full project, even components usually not touched by them, that the design and architecture are coherent, that temporary contributors can be productive without delay, that programming mistakes are prevented before being committed to the source code, and that the source is always in a fully functioning state. This contribution will illustrate the workflows and products used, and demonstrate the typical development cycle of a component from inception to deployment within this software engineering process. Next to the technological advantages, this contribution will also highlight the social aspects of an environment where every action is subject to detailed scrutiny.
ATLAS DQ2 to Rucio renaming infrastructure Serfon, C; Barisits, M; Beermann, T ...
Journal of physics. Conference series,
01/2014, Letnik:
513, Številka:
4
Journal Article
Recenzirano
Odprti dostop
To prepare the migration to the new ATLAS Data Management system called Rucio, a renaming campaign of all the physical files produced by ATLAS is needed. It represents around 300 million files split ...between ~120 sites with 6 different storage technologies. It must be done in a transparent way in order not to disrupt the ongoing computing activities. An infrastructure to perform this renaming has been developed and is presented in this paper as well as its performance.
Rucio is the next-generation data management system of the ATLAS experiment. Historically, clients interacted with the data management system via specialised tools, but in Rucio additional methods ...are provided. To support filesystem-like interaction with all ATLAS data, a plugin to the DMLite software stack has been developed. It is possible to mount Rucio as a filesystem, and execute regular filesystem operations in a POSIX fashion. This is exposed via various protocols, for example, WebDAV or NFS, which then removes any dependency on Rucio for client software. The main challenge for this work is the mapping of the set-like ATLAS namespace into a hierarchical filesystem, whilst preserving the high performance features of the former. This includes listing and searching for data, creation of files, datasets and containers, and the aggregation of existing data – all within directories with potentially millions of entries. This contribution details the design and implementation of the plugin. Furthermore, an evaluation of the performance characteristics is given, to show that this approach can scale to the requirements of ATLAS physics analysis.
Managing the data of scientific projects is an increasingly complicated challenge, which was historically met by developing experiment-specific solutions. However, the ever-growing data rates and ...requirements of even small experiments make this approach very difficult, if not prohibitive. In recent years, the scientific data management system Rucio has evolved into a successful open-source project that is now being used by many scientific communities and organisations. Rucio is incorporating the contributions and expertise of many scientific projects and is offering common features useful to a diverse research community. This article describes the recent experiences in operating Rucio, as well as contributions to the project, by ATLAS, Belle II, CMS, ESCAPE, IGWN, LDMX, Folding@Home, and the UK’s Science and Technology Facilities Council (STFC).