The 2020 upgrade of the LHCb detector will vastly increase the rate of collisions the Online system needs to process in software, in order to filter events in real time. 30 million collisions per ...second will pass through a selection chain, where each step is executed conditional to its prior acceptance. The Kalman Filter is a fit applied to all reconstructed tracks which, due to its time characteristics and early execution in the selection chain, consumes 40% of the whole reconstruction time in the current trigger software. This makes the Kalman Filter a time-critical component as the LHCb trigger evolves into a full software trigger in the Upgrade. I present a new Kalman Filter algorithm for LHCb that can efficiently make use of any kind of SIMD processor, and its design is explained in depth. Performance benchmarks are compared between a variety of hardware architectures, including x86_64 and Power8, and the Intel Xeon Phi accelerator, and the suitability of said architectures to efficiently perform the LHCb Reconstruction process is determined.
Summary
The Kalman filter is a fundamental process in the reconstruction of particle collisions in high‐energy physics detectors. At the LHCb detector in the Large Hadron Collider, this ...reconstruction happens at an average rate of 30 million times per second. Due to iterative enhancements in the detector's technology, together with the projected removal of the hardware filter, the rate of particles that will need to be processed in software in real‐time is expected to increase in the coming years by a factor 40. In order to cope with the projected data rate, processing and filtering software must be adapted to take into account cutting‐edge hardware technologies. We present Cross Kalman, a cross‐architecture Kalman filter optimized for low‐rank problems and SIMD architectures. We explore multi‐ and many‐core architectures and compare their performance on single and double precision configurations. We show that under the constraints of our mathematical formulation, we saturate the architectures under study. We validate our results and integrate our filter in the LHCb framework. Our work will allow to better use the available resources at the LHCb experiment and enables us to evaluate other computing platforms for future hardware upgrades. Finally, we expect that the presented algorithm and data structures can be easily adapted to other applications of low‐rank Kalman filters.
The Data Acquisition (DAQ) system of LHCb is a complex real-time system. It will be upgraded to provide LHCb with an all-software, trigger-free readout starting from 2020. Consequently, more CPU ...power in the form of servers will be needed and the DAQ network will grow to a capacity of 40 Tbps. A PC-based readout system would receive data incoming from the detector, which would then be scattered across builder nodes, and further distributed to a computing farm for data filtering. The design bandwidth of such a DAQ system requires rates as high as 400 Gbps single-duplex per node. These builder nodes will be connected with cost-effective, high-bandwidth data-centre switches in order to minimize the system cost. The behaviour of such an Event Building network can of course be studied in simulation but experience tells us that it is crucial to test, in particular to find out limitations in the switches themselves and to which extent various Event Building protocols can mitigate these limitations. We present a protocol, topology and transport independent emulation software named DAQ Protocol-Independent Performance Evaluator (DAQPIPE). It allows us to test different communication architectures, such as push or pull, with regards to the initiator of the communication. Different topologies and transport protocols can also be tested. We present throughput and stress tests on an InfiniBand FDR multi-rail based LAN network setup, with a focus on the network performance. Large tests on the current system LHCb DAQ are shown to demonstrate the scalability of DAQPIPE itself and its capability to be deployed on any kind of large, tightly interconnected network to test its suitability for Event Building applications.
•Local track reconstruction algorithms can be efficiently designed for parallel architectures.•Search by triplet is an efficient local track reconstruction algorithm optimized for CPU and GPU ...parallel architectures.•Search by triplet performs track reconstruction of the LHCb VELO detector at a rate of up to 592 kHz in a single GPU.•Search by triplet achieves an average physics reconstruction efficiency of 98.52% for the LHCb VELO detector.•Search by triplet is one of the main track reconstruction algorithms of the first software trigger stage of LHCb.
Millions of particles are collided every second at the LHCb detector placed inside the Large Hadron Collider at CERN. The particles produced as a result of these collisions pass through various detecting devices which will produce a combined raw data rate of up to 40 Tbps by 2021. These data will be fed through a data acquisition system which reconstructs individual particles and filters the collision events in real time. This process will occur in a heterogeneous farm employing exclusively off-the-shelf CPU and GPU hardware, in a two stage process known as High Level Trigger.
The reconstruction of charged particle trajectories in physics detectors, also referred to as track reconstruction or tracking, determines the position, charge and momentum of particles as they pass through detectors. The Vertex Locator subdetector (VELO) is the closest such detector to the beamline, placed outside of the region where the LHCb magnet produces a sizable magnetic field. It is used to reconstruct straight particle trajectories which serve as seeds for reconstruction of other subdetectors and to locate collision vertices. The VELO subdetector will detect up to 109 particles every second, which need to be reconstructed in real time in the High Level Trigger.
We present Search by triplet, an efficient track reconstruction algorithm. Our algorithm is designed to run efficiently across parallel architectures. We extend on previous work and explain the algorithm evolution since its inception. We show the scaling of our algorithm under various situations, and analyse its amortized time in terms of complexity for each of its constituent parts and profile its performance. Our algorithm is the current state-of-the-art in VELO track reconstruction on SIMT architectures, and we qualify its improvements over previous results.
During the data taking process in the LHC at CERN, millions of collisions are recorded every second by the LHCb Detector. The LHCb Online computing farm, counting around 15000 cores, is dedicated to ...the reconstruction of the events in real-time, in order to filter those with interesting Physics. The ones kept are later analysed Offline in a more precise fashion on the Grid. This imposes very stringent requirements on the reconstruction software, which has to be as efficient as possible. Modern CPUs support so-called vector-extensions, which extend their Instruction Sets, allowing for concurrent execution across functional units. Several libraries expose the Single Instruction Multiple Data programming paradigm to issue these instructions. The use of vectorisation in our codebase can provide performance boosts, leading ultimately to Physics reconstruction enhancements. In this paper, we present vectorisation studies of significant reconstruction algorithms. A variety of vectorisation libraries are analysed and compared in terms of design, maintainability and performance. We also present the steps taken to systematically measure the performance of the released software, to ensure the consistency of the run-time of the vectorised software.
Real-time data processing is one of the central processes of particle physics experiments which require large computing resources. The LHCb (Large Hadron Collider beauty) experiment will be upgraded ...to cope with a particle bunch collision rate of 30 million times per second, producing <inline-formula> <tex-math notation="LaTeX">10^{9} </tex-math></inline-formula> particles/s. 40 Tbits/s need to be processed in real-time to make filtering decisions to store data. This poses a computing challenge that requires exploration of modern hardware and software solutions. We present Compass , a particle tracking algorithm and a parallel raw input decoding optimized for GPUs. It is designed for highly parallel architectures, data-oriented, and optimized for fast and localized data access. Our algorithm is configurable, and we explore the trade-off in computing and physics performance of various configurations. A CPU implementation that delivers the same physics performance as our GPU implementation is presented. We discuss the achieved physics performance and validate it with Monte Carlo simulated data. We show a computing performance analysis comparing consumer and server-grade GPUs, and a CPU. We show the feasibility of using a full GPU decoding and particle tracking algorithm for high-throughput particle trajectories reconstruction, where our algorithm improves the throughput up to <inline-formula> <tex-math notation="LaTeX">7.4\times </tex-math></inline-formula> compared to the LHCb baseline.
The LHCb Data Acquisition (DAQ) will be upgraded in 2020 to a trigger-free readout. In order to achieve this goal we will need to connect around 500 nodes with a total network capacity of 32 Tb s. To ...get such an high network capacity we are testing zero-copy technology in order to maximize the theoretical link throughput without adding excessive CPU and memory bandwidth overhead, leaving free resources for data processing resulting in less power, space and money used for the same result. We develop a modular test application which can be used with different transport layers. For the zero-copy implementation we choose the OFED IBVerbs API because it can provide low level access and high throughput. We present throughput and CPU usage measurements of 40 GbE solutions using Remote Direct Memory Access (RDMA), for several network configurations to test the scalability of the system.
The LHCb experiment is preparing a major upgrade, during long shutdown 2 in 2018, of both the detector and the data acquisition system. A system capable of transporting up to 50 Tbps of data will be ...required. This can only be achieved in a manageable way using 100 Gbps links. Such links recently became available also in the servers, while they have been available between switches already for a while. We present first measurements with such links using standard benchmarks and using a prototype event-building application. We analyse the CPU load effects by using Remote DMA technologies, and we also show comparison with previous tests on 40G equipment.
The Large Hadron Collider Beauty experiment (LHCb) experiment is preparing a major upgrade resulting in the need for a high-end network for the data acquisition system. Its capacity will grow up to a ...target speed of 40 Tb/s, aggregated by 500 nodes. This can only be achieved reasonably by using links that are capable of coping with 100-Gb/s line rates. The constantly increasing need for more and more bandwidth has initiated the development of commercial 100-Gb/s networks. There are three candidates on the horizon that need to be considered: Intel Omni-Path, 100-G Ethernet, and EDR InfiniBand. We present test results with such links using both standard benchmarks (e.g., iperf) and a custom benchmark called Data AcQquisition (DAQ) Protocol Independent Performance Evaluator (DAQPIPE). With DAQPIPE, we mainly evaluate the ability to exploit the targeted network for a kind of all-to-all communication pattern. The key benefit of these measurements is that it helps us to tune our benchmark and improves our understanding of the relevant parameters. It will now permit us to prepare and motivate some upcoming tests at scale on existing supercomputers offering the targeted hardware.
A GPU offloading mechanism for LHCb Badalov, Alexey; Campora Perez, Daniel Hugo; Zvyagin, Alexander ...
Journal of physics. Conference series,
01/2014, Letnik:
513, Številka:
5
Journal Article
Recenzirano
Odprti dostop
The current computational infrastructure at LHCb is designed for sequential execution. It is possible to make use of modern multi-core machines by using multi-threaded algorithms and running multiple ...instances in parallel, but there is no way to make efficient use of specialized massively parallel hardware, such as graphical processing units and Intel Xeon/Phi. We extend the current infrastructure with an out-of-process computational server able to gather data from multiple instances and process them in large batches.