The High-Luminosity upgrade of the Large Hadron Collider (LHC) will see the accelerator reach an instantaneous luminosity of 7 × 10
cm
s
with an average pileup of 200 proton-proton collisions. These ...conditions will pose an unprecedented challenge to the online and offline reconstruction software developed by the experiments. The computational complexity will exceed by far the expected increase in processing power for conventional CPUs, demanding an alternative approach. Industry and High-Performance Computing (HPC) centers are successfully using heterogeneous computing platforms to achieve higher throughput and better energy efficiency by matching each job to the most appropriate architecture. In this paper we will describe the results of a heterogeneous implementation of pixel tracks and vertices reconstruction chain on Graphics Processing Units (GPUs). The framework has been designed and developed to be integrated in the CMS reconstruction software, CMSSW. The speed up achieved by leveraging GPUs allows for more complex algorithms to be executed, obtaining better physics output and a higher throughput.
Transcendental mathematical functions are one of the main hot-spots of scientific applications. The usage of highly optimised, general purpose mathematical libraries can mitigate this issue. On the ...other hand, a more comprehensive solution is represented by the replacement of the generic mathematical functions by specific implementations targeting particular subdomains only. CptnHook is a tool that helps achieving this goal allowing to monitor the input values of mathematical functions used in a given application, categorised according to the stacktraces leading to their invocations. In this contribution we describe the design of CptnHook, the data format of its profile and how it is possible to perform measurements without instrumenting the users code and imposing the need of recompilation. We demonstrate that this approach scales on production workflows of LHC experiments and characterise a set of real life measurements, showing where opportunities for improvement lie and how the tool can be used for advanced debugging. We also illustrate how elegant summaries of the measurements can be produced and how ROOT based analysis of the profiles can be performed.
As Moore's Law drives the silicon industry towards higher transistor counts, processor designs are becoming more and more complex. The area of development includes core count, execution ports, vector ...units, uncore architecture and finally instruction sets. This increasing complexity leads us to a place where access to the shared memory is the major limiting factor, resulting in feeding the cores with data a real challenge. On the other hand, the significant focus on power efficiency paves the way for power-aware computing and less complex architectures to data centers. In this paper we try to examine these trends and present results of our experiments with Haswell-EP processor family and highly scalable HEP workloads.
The Compact Muon Solenoid (CMS) experiment at the Large Hadron Collider (LHC) is a general-purpose particle detector and comprises the largest silicon-based tracking system built to date with 75 ...million individual readout channels. The precise reconstruction of particle tracks from this tremendous amount of input channels is a compute-intensive task. The foreseen LHC beam parameters for the next data taking period, starting in 2015, will result in an increase in the number of simultaneous proton-proton interactions and hence the number of particle tracks per event. Due to the stagnating clock frequencies of individual CPU cores, new approaches to particle track reconstruction need to be evaluated in order to cope with this computational challenge. Track finding methods that are based on cellular automata (CA) offer a fast and parallelizable alternative to the well-established Kalman filter-based algorithms. We present a new cellular automaton based track reconstruction, which copes with the complex detector geometry of CMS. We detail the specific design choices made to allow for a high-performance computation on GPU and CPU devices using the OpenCL framework. We conclude by evaluating the physics performance, as well as the computational properties of our implementation on various hardware platforms and show that a significant speedup can be attained by using GPU architectures while achieving a reasonable physics performance at the same time.
The processing of data acquired by the CMS detector at LHC is carried out with an object-oriented C++ software framework: CMSSW. With the increasing luminosity delivered by the LHC, the treatment of ...recorded data requires extraordinary large computing resources, also in terms of CPU usage. A possible solution to cope with this task is the exploitation of the features offered by the latest microprocessor architectures. Modern CPUs present several vector units, the capacity of which is growing steadily with the introduction of new processor generations. Moreover, an increasing number of cores per die is offered by the main vendors, even on consumer hardware. Most recent C++ compilers provide facilities to take advantage of such innovations, either by explicit statements in the programs sources or automatically adapting the generated machine instructions to the available hardware, without the need of modifying the existing code base. Programming techniques to implement reconstruction algorithms and optimised data structures are presented, that aim to scalable vectorization and parallelization of the calculations. One of their features is the usage of new language features of the C++11 standard. Portions of the CMSSW framework are illustrated which have been found to be especially profitable for the application of vectorization and multi-threading techniques. Specific utility components have been developed to help vectorization and parallelization. They can easily become part of a larger common library. To conclude, careful measurements are described, which show the execution speedups achieved via vectorised and multi-threaded code in the context of CMSSW.
Non-event data describing detector conditions change with time and come from different data sources. They are accessible by physicists within the offline event-processing applications for precise ...calibration of reconstructed data as well as for data-quality control purposes. Over the past years CMS has developed and deployed a software system managing such data. Object-relational mapping and the relational abstraction layer of the LHC persistency framework are the foundation; the offline condition framework updates and delivers C++ data objects according to their validity. A high-level tag versioning system allows production managers to organize data in hierarchical view. A scripting API in python, command-line tools and a web service serve physicists in daily work. A mini-framework is available for handling data coming from external sources. Efficient data distribution over the worldwide network is guaranteed by a system of hierarchical web caches. The system has been tested and used in all major productions, test-beams and cosmic runs.
The CMS experiment at LHC has a very large body of software of its own and uses extensively software from outside the experiment. Understanding the performance of such a complex system is a very ...challenging task, not the least because there are extremely few developer tools capable of profiling software systems of this scale, or producing useful reports. CMS has mainly used IgProf, valgrind, callgrind and OProfile for analysing the performance and memory usage patterns of our software. We describe the challenges, at times rather extreme ones, faced as we've analysed the performance of our software and how we've developed an understanding of the performance features. We outline the key lessons learnt so far and the actions taken to make improvements. We describe why an in-house general profiler tool still ends up besting a number of renowned open-source tools, and the improvements we've made to it in the recent year.
CMS has had an ongoing and dedicated effort to optimize software performance for several years. Initially this effort focused primarily on the cleanup of many issues coming from basic C++ errors, ...namely reducing dynamic memory churn, unnecessary copies/temporaries and tools to routinely monitor these things. Over the past 1.5 years, however, the transition to 64bit, newer versions of the gcc compiler, newer tools and the enabling of techniques like vectorization have made possible more sophisticated improvements to the software performance. This presentation will cover this evolution and describe the current avenues being pursued for software performance, as well as the corresponding gains.
In the context of the LHC computing grid (LCG) project, the applications area develops and maintains that part of the physics applications software and associated infrastructure that is shared among ...the LHC experiments. The "physicist interface" (PI) project of the LCG application area encompasses the interfaces and tools by which physicists will directly use the software, providing implementations based on agreed standards like the analysis systems subsystem (AIDA) interfaces for data analysis. In collaboration with users from the experiments, work has started with implementing the AIDA interfaces for (binned and unbinned) histogramming, fitting and minimization as well as manipulation of tuples. These implementations have been developed by re-using existing packages either directly or by using a (thin) layer of wrappers. In addition, bindings of these interfaces to the Python interpreted language have been done using the dictionary subsystem of the LCG applications area/SEAL project. The actual status and the future planning of the project will be presented