The ATLAS experiment at the Large Hadron Collider (LHC) operated very successfully in the years 2008 to 2013, a period identified as Run 1. ATLAS achieved an overall data-taking efficiency of 94%, ...largely constrained by the irreducible dead-time introduced to accommodate the limitations of the detector read-out electronics. Out of the 6% dead-time only about 15% could be attributed to the central trigger and DAQ system, and out of these, a negligible fraction was due to the Control and Configuration sub-system. Despite these achievements, and in order to improve the efficiency of the whole DAQ system in Run 2 (2015-2018), the first long LHC shutdown (2013-2014) was used to carry out a complete revision of the control and configuration software. The goals were three-fold: properly accommodate additional requirements that could not be seamlessly included during steady operation of the system; re-factor software that had been repeatedly modified to include new features, thus becoming less maintainable; and seize the opportunity of modernizing software written even before Run 1, thus profiting from the rapid evolution in IT technologies. This upgrade was carried out retaining the important constraint of minimally impacting the mode of operation of the system and public APIs, in order to maximize the acceptance of the changes by the large user community. This paper presents, using a few selected examples, how the work was approached and which new technologies were introduced into the ATLAS DAQ system, and how they were performing in course of Run 2. Despite these being specific to this system, many solutions can be considered and adapted to different distributed DAQ systems.
Trigger and Data Acquisition (TDAQ) of the ATLAS experiment is a large distributed and heterogeneous system: it consists of thousands of interconnected computers and electronics devices that operate ...coherently to read out and select relevant physics data. Advanced testing and diagnostics capabilities of the TDAQ control system are a crucial feature which contributes significantly to smooth operation and fast recovery in case of problem and, finally, to the high efficiency of the whole experiment. The base layer of the verification and diagnostic functionality is a test management framework. We have developed a flexible test management system that allows experts to define and configure tests for different components, indicate follow-up actions to test failures and describe inter-dependencies between TDAQ or detector elements. This development is based on the experience gained with the previous test system that was used during the first three years of data taking. We discovered that more emphasis needed to be put on the flexibility and configurability of the verification and diagnostics functionality by the many people that are, each, knowledgeable and expert on individual components of the experiment. In this paper we describe the design and implementation of the test management system and also some aspects of its exploitation during the ATLAS data taking in the LHC Run 2.
A large experiment like ATLAS at LHC (CERN), with over three thousand members and a shift crew of 15 people running the experiment 24/7, needs an easy and reliable tool to gather all the information ...concerning the experiment development, installation, deployment and exploitation over its lifetime. With the increasing number of users and the accumulation of stored information since the experiment start-up, the electronic logbook actually in use, ATLOG, started to show its limitations in terms of speed and usability. Its monolithic architecture makes the maintenance and implementation of new functionality a hard-to-almost-impossible process. A new tool ELisA has been developed to replace the existing ATLOG. It is based on modern web technologies: the Spring framework using a Model-View-Controller architecture was chosen, thus helping building flexible and easy to maintain applications. The new tool implements all features of the old electronic logbook with increased performance and better graphics: it uses the same database back-end for portability reasons. In addition, several new requirements have been accommodated which could not be implemented in ATLOG. This paper describes the architecture, implementation and performance of ELisA, with particular emphasis on the choices that allowed having a scalable and very fast system and on the aspects that could be re-used in different contexts to build a similar application.
The Trigger and Data Acquisition (TDAQ) system of the ATLAS experiment is a very complex distributed computing system, composed of more than 20000 applications running on more than 2000 computers. ...The TDAQ Controls system has to guarantee the smooth and synchronous operations of all the TDAQ components and has to provide the means to minimize the downtime of the system caused by runtime failures. During data taking runs, streams of information messages sent or published by running applications are the main sources of knowledge about correctness of running operations. The huge flow of operational monitoring data produced is constantly monitored by experts in order to detect problems or misbehaviours. Given the scale of the system and the rates of data to be analyzed, the automation of the system functionality in the areas of operational monitoring, system verification, error detection and recovery is a strong requirement. To accomplish its objective, the Controls system includes some high-level components which are based on advanced software technologies, namely the rule-based Expert System and the Complex Event Processing engines. The chosen techniques allow to formalize, store and reuse the knowledge of experts and thus to assist the shifters in the ATLAS control room during the data-taking activities.
ATLAS is one of the four experiments in the Large Hadron Collider (LHC) at CERN, which has been put in operation this year. The challenging experimental environment and the extreme detector ...complexity required development of a highly scalable distributed monitoring framework, which is currently being used to monitor the quality of the data being taken as well as operational conditions of the hardware and software elements of the detector, trigger and data acquisition systems. At the moment the ATLAS Trigger/DAQ system is distributed over more than 1000 computers, which is about one third of the final ATLAS size. At every minute of an ATLAS data taking session the monitoring framework serves several thousands physics events to monitoring data analysis applications, handles more than 4 million histograms updates coming from more than 4 thousands applications, executes 10 thousands advanced data quality checks for a subset of those histograms, displays histograms and results of these checks on several dozens of monitors installed in main and satellite ATLAS control rooms. This note presents the overview of the online monitoring software framework, and describes the experience, which was gained during an extensive commissioning period as well as at the first phase of LHC beam in September 2008. Performance results, obtained on the current ATLAS DAQ system will also be presented, showing that the performance of the framework is adequate for the final ATLAS system.
Data quality monitoring (DQM) is an integral part of the data taking process of HEP experiments. DQM involves automated analysis of monitoring data through user-defined algorithms and relaying the ...summary of the analysis results to the shift personnel while data is being processed. In the online environment, DQM provides the shifter with current run information that can be used to overcome problems early on. During the offline reconstruction, more complex analysis of physics quantities is performed by DQM, and the results are used to assess the quality of the reconstructed data. The ATLAS data quality monitoring framework (DQMF) is a distributed software system providing DQM functionality in the online environment. The DQMF has a scalable architecture achieved by distributing the execution of the analysis algorithms over a configurable number of DQMF agents running on different nodes connected over the network. The core part of the DQMF is designed to have dependence only on software that is common between online and offline (such as ROOT) and therefore the same framework can be used in both environments. This paper describes the main requirements, the architectural design, and the implementation of the DQMF.
This first year of data taking has been of great interest, not only for the physics outcome, but also for operating the system under the environment it was designed for. The online data quality ...monitoring framework (DQMF) is a highly scalable distributed framework which is used to assess the operational conditions of the detector and the quality of the data. DQMF provides quick feedback to the user about the functioning and performance of the sub-detectors by performing over 75,000 advanced data quality checks, with rates varying depending on histogram update frequency. The DQM display (DQMD) is the visualisation tool with which histograms and their data quality assessments can be accessed. It allows for great flexibility for displaying histograms, their reference when applicable, configurations used for the automatics checks, data quality flags and much more. The DQM configuration is stored in a database that can be easily created and edited with the DQM Configurator tool (DQMC). This paper is describing the design and implementation of the DQMF and its display as well as the data quality performance achieved during this first year of data taking.
The start of collisions at the LHC brings a new era of particle physics and much improved potential to observe signatures of new physics. Some of these may be evident already from the very beginning ...of collisions. It's essential at this point in the experiment to be prepared to quickly and efficiently determine the quality of the incoming data. Easy visualization of data for the shift crew and experts is one of the key factors in the data quality assessment process. This paper describes the design and implementation of the Data Quality Monitoring Display and discusses experience from its usage and performance during ATLAS commissioning with cosmic ray and single beam data.
In order to meet the requirements of ATLAS experiment data taking, the Trigger-DAQ (TDAQ) system is composed of O(10000) of applications running on more than 2600 computers in a network. With such a ...system size, software and hardware failures are quite frequent. To minimize system downtime, the Trigger-DAQ control system shall include advance verification and diagnostics facilities. The operator shall use tests and expertise of the TDAQ and detectors developers in order to diagnose and recover from errors, if possible automatically. The TDAQ control system is built as a distributed tree of controllers, where the behavior of each controller is defined in a rule-based language allowing easy customization. The control system also includes a verification framework which allows users to develop and configure tests for any component in the system with different levels of complexity. It can be used as a stand-alone test facility for a small detector installation, as part of the general TDAQ initialization procedure, and for diagnosing problems which may occur during run time. The system is currently being used in TDAQ commissioning at the ATLAS experimental zone and by subdetectors for stand-alone verification of the detector hardware before it is finally installed.
ATLAS is one of the four experiments under construction along the Large Hadron Collider (LHC) ring at CERN. The LHC will produce interactions at a center-of-mass energy equal to radics = 14 TeV with ...a frequency of 40 MHz. The detector consists of more than 140 million electronic channels. The challenging experimental environment and the extreme detector complexity impose the necessity of a common, scalable, distributed monitoring framework, which can be tuned for optimal use by different ATLAS sub-detectors at the various levels of the ATLAS data flow. This paper presents the architecture of this monitoring software framework and describes its current implementation, which has already been used at the ATLAS beam test activity in 2004. Preliminary performance results, obtained on a computer cluster consisting of 700 nodes, will also be presented, showing that the performance of the current implementation is within the range of the final ATLAS requirements.