This paper reports on the activities aimed at improving the architecture and performance of the ATLAS EventIndex implementation in Hadoop. The EventIndex contains tens of billions of event records, ...each of which consists of ∼100 bytes, all having the same probability to be searched or counted. Data formats represent one important area for optimizing the performance and storage footprint of applications based on Hadoop. This work reports on the production usage and on tests using several data formats including Map Files, Apache Parquet, Avro, and various compression algorithms. The query engine plays also a critical role in the architecture. We report also on the use of HBase for the EventIndex, focussing on the optimizations performed in production and on the scalability tests. Additional engines that have been tested include Cloudera Impala, in particular for its SQL interface, and the optimizations for data warehouse workloads and reports.
The ATLAS experiment has produced hundreds of petabytes of data and expects to have one order of magnitude more in the future. This data are spread among hundreds of computing Grid sites around the ...world. The EventIndex is the complete catalogue of all ATLAS events, real and simulated, keeping the references to all permanent files that contain a given event in any processing stage. It provides the means to select and access event data in the ATLAS distributed storage system, and provides support for completeness and consistency checks and trigger and offline selection overlap studies. The EventIndex employs various data handling technologies like Hadoop and Oracle databases, and it is integrated with other parts of the ATLAS distributed computing infrastructure, including systems for data, metadata, and production management. The project has been in operation since the start of LHC Run 2 in 2015, and it is in permanent development in order to satisfy the production and analysis demands and follow technology evolution. The main data store in Hadoop, based on MapFiles and HBase, has worked well during Run 2 but new solutions are being explored for the future. This paper reports on the current system performance and on the studies of a new data storage prototype that can carry the EventIndex through Run 3.
The EventIndex is the complete catalogue of all ATLAS events, keeping the references to all files that contain a given event in any processing stage. It replaces the TAG database, which had been in ...use during LHC Run 1. For each event it contains its identifiers, the trigger pattern and the GUIDs of the files containing it. Major use cases are event picking, feeding the Event Service used on some production sites, and technical checks of the completion and consistency of processing campaigns. The system design is highly modular so that its components (data collection system, storage system based on Hadoop, query web service and interfaces to other ATLAS systems) could be developed separately and in parallel during LSI. The EventIndex is in operation for the start of LHC Run 2. This paper describes the high-level system architecture, the technical design choices and the deployment process and issues. The performance of the data collection and storage systems, as well as the query services, are also reported.
The Event Index project consists in the development and deployment of a complete catalogue of events for experiments with large amounts of data, such as the ATLAS experiment at the LHC accelerator at ...CERN. Data to be stored in the EventIndex are produced by all production jobs that run at CERN or the GRID; for every permanent output file, a snippet of information, containing the file unique identifier and the relevant attributes for each event, is sent to the central catalogue. The estimated insertion rate during the LHC Run 2 is about 80 Hz of file records containing ∼15 kHz of event records. This contribution describes the system design, the initial performance tests of the full data collection and cataloguing chain, and the project evolution towards the full deployment and operation by the end of 2014.
Modern scientific experiments collect vast amounts of data that must be catalogued to meet multiple use cases and search criteria. In particular, high-energy physics experiments currently in ...operation produce several billion events per year. A database with the references to the files including each event in every stage of processing is necessary in order to retrieve the selected events from data storage systems. The ATLAS EventIndex project is studying the best way to store the necessary information using modern data storage technologies (Hadoop, HBase etc.) that allow saving in memory key-value pairs and select the best tools to support this application from the point of view of performance, robustness and ease of use. This paper describes the initial design and performance tests and the project evolution towards deployment and operation during 2014.
The ATLAS EventIndex is a data catalogue system that stores event-related metadata for all (real and simulated) ATLAS events, on all processing stages. As it consists of different components that ...depend on other applications (such as distributed storage, and different sources of information) we need to monitor the conditions of many heterogeneous subsystems, to make sure everything is working correctly. This paper describes how we gather information about the EventIndex components and related subsystems: the Producer-Consumer architecture for data collection, health parameters from the servers that run EventIndex components, EventIndex web interface status, and the Hadoop infrastructure that stores EventIndex data. This information is collected, processed, and then displayed using CERN service monitoring software based on the Kibana analytic and visualization package, provided by CERN IT Department. EventIndex monitoring is used both by the EventIndex team and ATLAS Distributed Computing shifts crew.
The ATLAS EventIndex is the catalogue of the event-related metadata for the information collected from the ATLAS detector. The basic unit of this information is the event record, containing the event ...identification parameters, pointers to the files containing this event as well as trigger decision information. The main use case for the EventIndex is event picking, as well as data consistency checks for large production campaigns. The EventIndex employs the Hadoop platform for data storage and handling, as well as a messaging system for the collection of information. The information for the EventIndex is collected both at Tier-0, when the data are first produced, and from the Grid, when various types of derived data are produced. The EventIndex uses various types of auxiliary information from other ATLAS sources for data collection and processing: trigger tables from the condition metadata database (COMA), dataset information from the data catalogue AMI and the Rucio data management system and information on production jobs from the ATLAS production system. The ATLAS production system is also used for the collection of event information from the Grid jobs. EventIndex developments started in 2012 and in the middle of 2015 the system was commissioned and started collecting event metadata, as a part of ATLAS Distributed Computing operations.
ATLAS maintains a rich corpus of event-by-event information that provides a global view of the billions of events the collaboration has measured or simulated, along with sufficient auxiliary ...information to navigate to and retrieve data for any event at any production processing stage. This unique resource has been employed for a range of purposes, from monitoring, statistics, anomaly detection, and integrity checking, to event picking, subset selection, and sample extraction. Recent years of data-taking provide a foundation for assessment of how this resource has and has not been used in practice, of the uses for which it should be optimized, of how it should be deployed and provisioned for scalability to future data volumes, and of the areas in which enhancements to functionality would be most valuable. This paper describes how ATLAS event-level information repositories and selection infrastructure are evolving in light of this experience, and in view of their expected roles both in wide-area event delivery services and in an evolving ATLAS analysis model in which the importance of efficient selective access to data can only grow.
TAG Based Skimming In ATLAS Doherty, T; Cranshaw, J; Hrivnac, J ...
Journal of physics. Conference series,
01/2012, Letnik:
396, Številka:
5
Journal Article
Recenzirano
Odprti dostop
The ATLAS detector at the LHC takes data at 200–500 Hz for several months per year accumulating billions of events for hundreds of physics analyses. TAGs are event-level metadata allowing a quick ...search for interesting events based on selection criteria defined by the user. They are stored in a file-based format as well as in relational databases. The overall TAG system architecture encompasses a range of interconnected services that provide functionality for the required use cases such as event selection, display, extraction and skimming. Skimming can be used to navigate to any of the pre-TAG data products. The services described in this paper address use cases that range in scale from selecting a handful of interesting events for an analysis specific study to creating physics working group samples on the ATLAS production system. This paper will focus on the workflow aspects involved in creating pre and post TAG data products from a TAG selection using the Grid in the context of the overall TAG system architecture. The emphasis will be on the range of demands that the implemented use cases place on these workflows and on the infrastructure. The tradeoffs of various workflow strategies will be discussed including scalability issues and other concerns that occur when integrating with data management and production systems.
In the ATLAS experiment, Tag Data, or short TAG, are event-level metadata -thumbnail information about events to support efficient identification and selection of events of interest to a given ...analysis. TAG quantities range from detector status and trigger information to basic physics quantities, e. g. the number of loose electrons candidates and kinematic information for a limited number of these candidates sorted by their transverse momentum. The average TAG size per event is around 1kB, which is a factor 100 smaller than the Analysis Object Data (AOD) used for physics analysis. TAGs are primarily produced from AODs and stored in ROOT files. For easier access and usability TAGs are also stored in a database. Queries to the database can produce again TAG files. In a standard ATLAS analysis job, TAGs can be used to preselect events based on the TAG quantities before accessing the full AOD content. This allows for a significant speed up of the processing time. This paper will discuss the different analysis work flows using TAGs and compare them with other analysis work flows within ATLAS. Further, the performance for preselecting events using either directly AODs or TAG files is measured and compared. Peak performance is estimated on a single machine with local disk access, while more realistic performance is estimated using Grid like data access.