Abstract
The performance of I/O intensive applications is largely determined by the organization of data and the associated insertion/extraction techniques. In this paper we present the design and ...implementation of an application that is targeted at managing data received (upto ~ 150 Gb/s payload throughput) into host DRAM, buffering data for several seconds, matched with the DRAM size, before being dropped. All data are validated, processed and indexed. The features extracted from the processing are streamed out to subscribers over the network; in addition, while data resides in the buffer, about 0.1 ‰ of them are served to remote clients upon request. Last but not least, the application must be able to locally persist data at full input speed when instructed to do so. The characteristics of the incoming data stream (fixed or variable rate, fixed or variable payload size) heavily influences the choice of implementation of the buffer management system. The application design promotes the separation of interfaces (concepts) and application oriented specializations (models) that makes it possible to generalize most of the workflows and only requires minimal effort to integrate new data sources. After the description of the application design, we will present the hardware platform used for validation and benchmarking of the software, and the performance results obtained.
The ATLAS experiment at the Large Hadron Collider (LHC) operated very successfully in the years 2008 to 2013, a period identified as Run 1. ATLAS achieved an overall data-taking efficiency of 94%, ...largely constrained by the irreducible dead-time introduced to accommodate the limitations of the detector read-out electronics. Out of the 6% dead-time only about 15% could be attributed to the central trigger and DAQ system, and out of these, a negligible fraction was due to the Control and Configuration sub-system. Despite these achievements, and in order to improve the efficiency of the whole DAQ system in Run 2 (2015-2018), the first long LHC shutdown (2013-2014) was used to carry out a complete revision of the control and configuration software. The goals were three-fold: properly accommodate additional requirements that could not be seamlessly included during steady operation of the system; re-factor software that had been repeatedly modified to include new features, thus becoming less maintainable; and seize the opportunity of modernizing software written even before Run 1, thus profiting from the rapid evolution in IT technologies. This upgrade was carried out retaining the important constraint of minimally impacting the mode of operation of the system and public APIs, in order to maximize the acceptance of the changes by the large user community. This paper presents, using a few selected examples, how the work was approached and which new technologies were introduced into the ATLAS DAQ system, and how they were performing in course of Run 2. Despite these being specific to this system, many solutions can be considered and adapted to different distributed DAQ systems.
Trigger and Data Acquisition (TDAQ) of the ATLAS experiment is a large distributed and heterogeneous system: it consists of thousands of interconnected computers and electronics devices that operate ...coherently to read out and select relevant physics data. Advanced testing and diagnostics capabilities of the TDAQ control system are a crucial feature which contributes significantly to smooth operation and fast recovery in case of problem and, finally, to the high efficiency of the whole experiment. The base layer of the verification and diagnostic functionality is a test management framework. We have developed a flexible test management system that allows experts to define and configure tests for different components, indicate follow-up actions to test failures and describe inter-dependencies between TDAQ or detector elements. This development is based on the experience gained with the previous test system that was used during the first three years of data taking. We discovered that more emphasis needed to be put on the flexibility and configurability of the verification and diagnostics functionality by the many people that are, each, knowledgeable and expert on individual components of the experiment. In this paper we describe the design and implementation of the test management system and also some aspects of its exploitation during the ATLAS data taking in the LHC Run 2.
The ATLAS experiment at CERN is planning full deployment of a new unified optical link technology for connecting detector front end electronics on the timescale of the LHC Run 4 (2025). It is ...estimated that roughly 8000 GBT (GigaBit Transceiver) links, with transfer rates up to 10.24 Gbps, will replace existing links used for readout, detector control and distribution of timing and trigger information. A new class of devices will be needed to interface many GBT links to the rest of the trigger, data-acquisition and detector control systems. In this paper FELIX (Front End LInk eXchange) is presented, a PC-based device to route data from and to multiple GBT links via a high-performance general purpose network capable of a total throughput up to O(20 Tbps). FELIX implies architectural changes to the ATLAS data acquisition system, such as the use of industry standard COTS components early in the DAQ chain. Additionally the design and implementation of a FELIX demonstration platform is presented and hardware and software aspects will be discussed.
The ATLAS experiment at the Large Hadron Collider at CERN relies on a complex and highly distributed Trigger and Data Acquisition (TDAQ) system to gather and select particle collision data obtained ...at unprecedented energy and rates. The Run Control (RC) system is the component steering the data acquisition by starting and stopping processes and by carrying all data-taking elements through well-defined states in a coherent way. Taking into account all the lessons learnt during LHC's Run 1, the RC has been completely re-designed and re-implemented during the LHC Long Shutdown 1 (LS1) phase. As a result of the new design, the RC is assisted by the Central Hint and Information Processor (CHIP) service that can be truly considered its "brain". CHIP is an intelligent system able to supervise the ATLAS data taking, take operational decisions and handle abnormal conditions. In this paper, the design, implementation and performances of the RC CHIP system will be described. A particular emphasis will be put on the way the RC and CHIP cooperate and on the huge benefits brought by the Complex Event Processing engine. Additionally, some error recovery scenarios will be analysed for which the intervention of human experts is now rendered unnecessary.
The Resource Manager is one of the core components of the Data Acquisition system of the ATLAS experiment at the LHC. The Resource Manager marshals the right for applications to access resources ...which may exist in multiple but limited copies, in order to avoid conflicts due to program faults or operator errors. The access to resources is managed in a manner similar to what a lock manager would do in other software systems. All the available resources and their association to software processes are described in the Data Acquisition configuration database. The Resource Manager is queried about the availability of resources every time an application needs to be started. The Resource Manager's design is based on a client-server model, hence it consists of two components: the Resource Manager "server" application and the "client" shared library. The Resource Manager server implements all the needed functionalities, while the Resource Manager client library provides remote access to the "server" (i.e., to allocate and free resources, to query about the status of resources). During the LHC's Long Shutdown period, the Resource Manager's requirements have been reviewed at the light of the experience gained during the LHC's Run 1. As a consequence, the Resource Manager has undergone a full re-design and re-implementation cycle with the result of a reduction of the code base by 40% with respect to the previous implementation. This contribution will focus on the way the design and the implementation of the Resource Manager could leverage the new features available in the C++11 standard, and how the introduction of external libraries (like Boost multi-container) led to a more maintainable system. Additionally, particular attention will be given to the technical solutions adopted to ensure the Resource Manager could effort the typical requests rates of the Data Acquisition system, which is about 30000 requests in a time window of few seconds coming from more than 1000 clients.
Complex Event Processing (CEP) is a methodology that combines data from many sources in order to identify events or patterns that need particular attention. It has gained a lot of momentum in the ...computing world in the past few years and is used in ATLAS to continuously monitor the behaviour of the data acquisition system, to trigger corrective actions and to guide the experiment's operators. This technology is very powerful, if experts regularly insert and update their knowledge about the system's behaviour into the CEP engine. Nevertheless, writing or modifying CEP rules is not trivial since the used programming paradigm is quite different with respect to what developers are normally familiar with. In order to help experts verify that the rules work as expected, we have thus developed a complete testing and validation environment. This system consists of three main parts: the first is the data reader from existing storage of all relevant data streams that are produced during data taking, the second is a playback tool that allows to re-inject data of specific data taking sessions from the past into the CEP engine, and the third is a reporting tool that shows the output that the rules loaded into the engine would have produced in the live system. In this paper we describe the design and implementation of this validation system, highlight its strengths and shortcomings and indicate how such a system could be reused in similar projects.
The Trigger and Data Acquisition (TDAQ) system of the ATLAS detector at the Large Hadron Collider at CERN is composed of a large number of distributed hardware and software components (about 3000 ...computers and more than 25000 applications) which, in a coordinated manner, provide the data-taking functionality of the overall system. During data taking runs, a huge flow of operational data is produced in order to constantly monitor the system and allow proper detection of anomalies or misbehaviours. In the ATLAS trigger and data acquisition system, operational data are archived and made available to applications by the P-BEAST (Persistent Back-End for the Atlas Information System of TDAQ) service, implementing a custom time-series database. The possibility to efficiently visualize both realtime and historical operational data is a great asset facilitating both online identification of problems and post-mortem analysis. This paper will present a web-based solution developed to achieve such a goal: the solution leverages the flexibility of the P-BEAST archiver to retrieve data, and exploits the versatility of the Grafana dashboard builder to offer a very rich user experience. Additionally, particular attention will be given to the way some technical challenges (like the efficient visualization of a huge amount of data and the integration of the P-BEAST data source in Grafana) have been faced and solved.