Abstract
The performance of I/O intensive applications is largely determined by the organization of data and the associated insertion/extraction techniques. In this paper we present the design and ...implementation of an application that is targeted at managing data received (upto ~ 150 Gb/s payload throughput) into host DRAM, buffering data for several seconds, matched with the DRAM size, before being dropped. All data are validated, processed and indexed. The features extracted from the processing are streamed out to subscribers over the network; in addition, while data resides in the buffer, about 0.1 ‰ of them are served to remote clients upon request. Last but not least, the application must be able to locally persist data at full input speed when instructed to do so. The characteristics of the incoming data stream (fixed or variable rate, fixed or variable payload size) heavily influences the choice of implementation of the buffer management system. The application design promotes the separation of interfaces (concepts) and application oriented specializations (models) that makes it possible to generalize most of the workflows and only requires minimal effort to integrate new data sources. After the description of the application design, we will present the hardware platform used for validation and benchmarking of the software, and the performance results obtained.
The ATLAS experiment at the Large Hadron Collider (LHC) operated very successfully in the years 2008 to 2013, a period identified as Run 1. ATLAS achieved an overall data-taking efficiency of 94%, ...largely constrained by the irreducible dead-time introduced to accommodate the limitations of the detector read-out electronics. Out of the 6% dead-time only about 15% could be attributed to the central trigger and DAQ system, and out of these, a negligible fraction was due to the Control and Configuration sub-system. Despite these achievements, and in order to improve the efficiency of the whole DAQ system in Run 2 (2015-2018), the first long LHC shutdown (2013-2014) was used to carry out a complete revision of the control and configuration software. The goals were three-fold: properly accommodate additional requirements that could not be seamlessly included during steady operation of the system; re-factor software that had been repeatedly modified to include new features, thus becoming less maintainable; and seize the opportunity of modernizing software written even before Run 1, thus profiting from the rapid evolution in IT technologies. This upgrade was carried out retaining the important constraint of minimally impacting the mode of operation of the system and public APIs, in order to maximize the acceptance of the changes by the large user community. This paper presents, using a few selected examples, how the work was approached and which new technologies were introduced into the ATLAS DAQ system, and how they were performing in course of Run 2. Despite these being specific to this system, many solutions can be considered and adapted to different distributed DAQ systems.
The ATLAS experiment at CERN is planning full deployment of a new unified optical link technology for connecting detector front end electronics on the timescale of the LHC Run 4 (2025). It is ...estimated that roughly 8000 GBT (GigaBit Transceiver) links, with transfer rates up to 10.24 Gbps, will replace existing links used for readout, detector control and distribution of timing and trigger information. A new class of devices will be needed to interface many GBT links to the rest of the trigger, data-acquisition and detector control systems. In this paper FELIX (Front End LInk eXchange) is presented, a PC-based device to route data from and to multiple GBT links via a high-performance general purpose network capable of a total throughput up to O(20 Tbps). FELIX implies architectural changes to the ATLAS data acquisition system, such as the use of industry standard COTS components early in the DAQ chain. Additionally the design and implementation of a FELIX demonstration platform is presented and hardware and software aspects will be discussed.
The ATLAS experiment at the Large Hadron Collider at CERN relies on a complex and highly distributed Trigger and Data Acquisition (TDAQ) system to gather and select particle collision data obtained ...at unprecedented energy and rates. The Run Control (RC) system is the component steering the data acquisition by starting and stopping processes and by carrying all data-taking elements through well-defined states in a coherent way. Taking into account all the lessons learnt during LHC's Run 1, the RC has been completely re-designed and re-implemented during the LHC Long Shutdown 1 (LS1) phase. As a result of the new design, the RC is assisted by the Central Hint and Information Processor (CHIP) service that can be truly considered its "brain". CHIP is an intelligent system able to supervise the ATLAS data taking, take operational decisions and handle abnormal conditions. In this paper, the design, implementation and performances of the RC CHIP system will be described. A particular emphasis will be put on the way the RC and CHIP cooperate and on the huge benefits brought by the Complex Event Processing engine. Additionally, some error recovery scenarios will be analysed for which the intervention of human experts is now rendered unnecessary.
The ATLAS Phase-I upgrade (2019) requires a Trigger and Data Acquisition (TDAQ) system able to trigger and record data from up to three times the nominal LHC instantaneous luminosity. Furthermore, ...the Front-End LInk eXchange (FELIX) system provides an infrastructure to achieve this in a scalable, detector agnostic and easily upgradeable way. It is a PC-based gateway, interfacing custom radiation tolerant optical links from front-end electronics, via PCIe Gen3 cards, to a commodity switched Ethernet or InfiniBand network. FELIX enables reducing custom electronics in favour of software running on commercial servers. Here, the FELIX system, the design of the PCIe prototype card and the integration test results are presented.
The Resource Manager is one of the core components of the Data Acquisition system of the ATLAS experiment at the LHC. The Resource Manager marshals the right for applications to access resources ...which may exist in multiple but limited copies, in order to avoid conflicts due to program faults or operator errors. The access to resources is managed in a manner similar to what a lock manager would do in other software systems. All the available resources and their association to software processes are described in the Data Acquisition configuration database. The Resource Manager is queried about the availability of resources every time an application needs to be started. The Resource Manager's design is based on a client-server model, hence it consists of two components: the Resource Manager "server" application and the "client" shared library. The Resource Manager server implements all the needed functionalities, while the Resource Manager client library provides remote access to the "server" (i.e., to allocate and free resources, to query about the status of resources). During the LHC's Long Shutdown period, the Resource Manager's requirements have been reviewed at the light of the experience gained during the LHC's Run 1. As a consequence, the Resource Manager has undergone a full re-design and re-implementation cycle with the result of a reduction of the code base by 40% with respect to the previous implementation. This contribution will focus on the way the design and the implementation of the Resource Manager could leverage the new features available in the C++11 standard, and how the introduction of external libraries (like Boost multi-container) led to a more maintainable system. Additionally, particular attention will be given to the technical solutions adopted to ensure the Resource Manager could effort the typical requests rates of the Data Acquisition system, which is about 30000 requests in a time window of few seconds coming from more than 1000 clients.
Complex Event Processing (CEP) is a methodology that combines data from many sources in order to identify events or patterns that need particular attention. It has gained a lot of momentum in the ...computing world in the past few years and is used in ATLAS to continuously monitor the behaviour of the data acquisition system, to trigger corrective actions and to guide the experiment's operators. This technology is very powerful, if experts regularly insert and update their knowledge about the system's behaviour into the CEP engine. Nevertheless, writing or modifying CEP rules is not trivial since the used programming paradigm is quite different with respect to what developers are normally familiar with. In order to help experts verify that the rules work as expected, we have thus developed a complete testing and validation environment. This system consists of three main parts: the first is the data reader from existing storage of all relevant data streams that are produced during data taking, the second is a playback tool that allows to re-inject data of specific data taking sessions from the past into the CEP engine, and the third is a reporting tool that shows the output that the rules loaded into the engine would have produced in the live system. In this paper we describe the design and implementation of this validation system, highlight its strengths and shortcomings and indicate how such a system could be reused in similar projects.
A large experiment like ATLAS at LHC (CERN), with over three thousand members and a shift crew of 15 people running the experiment 24/7, needs an easy and reliable tool to gather all the information ...concerning the experiment development, installation, deployment and exploitation over its lifetime. With the increasing number of users and the accumulation of stored information since the experiment start-up, the electronic logbook actually in use, ATLOG, started to show its limitations in terms of speed and usability. Its monolithic architecture makes the maintenance and implementation of new functionality a hard-to-almost-impossible process. A new tool ELisA has been developed to replace the existing ATLOG. It is based on modern web technologies: the Spring framework using a Model-View-Controller architecture was chosen, thus helping building flexible and easy to maintain applications. The new tool implements all features of the old electronic logbook with increased performance and better graphics: it uses the same database back-end for portability reasons. In addition, several new requirements have been accommodated which could not be implemented in ATLOG. This paper describes the architecture, implementation and performance of ELisA, with particular emphasis on the choices that allowed having a scalable and very fast system and on the aspects that could be re-used in different contexts to build a similar application.
The Trigger and Data Acquisition (TDAQ) system of the ATLAS experiment is a very complex distributed computing system, composed of more than 20000 applications running on more than 2000 computers. ...The TDAQ Controls system has to guarantee the smooth and synchronous operations of all the TDAQ components and has to provide the means to minimize the downtime of the system caused by runtime failures. During data taking runs, streams of information messages sent or published by running applications are the main sources of knowledge about correctness of running operations. The huge flow of operational monitoring data produced is constantly monitored by experts in order to detect problems or misbehaviours. Given the scale of the system and the rates of data to be analyzed, the automation of the system functionality in the areas of operational monitoring, system verification, error detection and recovery is a strong requirement. To accomplish its objective, the Controls system includes some high-level components which are based on advanced software technologies, namely the rule-based Expert System and the Complex Event Processing engines. The chosen techniques allow to formalize, store and reuse the knowledge of experts and thus to assist the shifters in the ATLAS control room during the data-taking activities.
A
bstract
The NA62 experiment reports an investigation of the
$$ {K}^{+}\to {\pi}^{+}\nu \overline{\nu} $$
K
+
→
π
+
ν
ν
¯
mode from a sample of
K
+
decays collected in 2017 at the CERN SPS. The ...experiment has achieved a single event sensitivity of (0
.
389
±
0
.
024)
×
10
−
10
, corresponding to 2.2 events assuming the Standard Model branching ratio of (8
.
4
±
1
.
0)
×
10
−
11
. Two signal candidates are observed with an expected background of 1.5 events. Combined with the result of a similar analysis conducted by NA62 on a smaller data set recorded in 2016, the collaboration now reports an upper limit of 1
.
78
×
10
−
10
for the
$$ {K}^{+}\to {\pi}^{+}\nu \overline{\nu} $$
K
+
→
π
+
ν
ν
¯
branching ratio at 90% CL. This, together with the corresponding 68% CL measurement of (
$$ {0.48}_{-0.48}^{+0.72} $$
0.48
−
0.48
+
0.72
)
×
10
−
10
, are currently the most precise results worldwide, and are able to constrain some New Physics models that predict large enhancements still allowed by previous measurements.