HEP applications perform an excessive amount of allocations/deallocations within short time intervals which results in memory churn, poor locality and performance degradation. These issues are ...already known for a decade, but due to the complexity of software frameworks and billions of allocations for a single job, up until recently no efficient mechanism has been available to correlate these issues with source code lines. However, with the advent of the Big Data era, many tools and platforms are now available to do large scale memory profiling. This paper presents, a prototype program developed to track and identify each single (de-)allocation. The CERN IT Hadoop cluster is used to compute memory key metrics, like locality, variation, lifetime and density of allocations. The prototype further provides a web based visualization back-end that allows the user to explore the results generated on the Hadoop cluster. Plotting these metrics for every single allocation over time gives a new insight into application's memory handling. For instance, it shows which algorithms cause which kind of memory allocation patterns, which function flow causes how many short-lived objects, what are the most commonly allocated sizes etc. The paper will give an insight into the prototype and will show profiling examples for the LHC reconstruction, digitization and simulation jobs.
HEP applications need to adapt to the continuously increasing number of cores on modern CPUs. This must be done at different levels: the software must support parallelization, and the scheduling has ...to differ between multicore and singlecore jobs. The LHCb software framework (GAUDI) provides a parallel prototype (GaudiMP), based on the multiprocessing approach. It allows a reduction of the overall memory footprint and a coordinated access to data via separated reader and writer processes. A comparison between the parallel prototype and multiple independent Gaudi jobs in respect of CPU time and memory consumption will be shown. Furthermore, speedup must be predicted in order to find the limit beyond which the parallel prototype (GaudiMP) does not bring further scaling. This number must be known as it indicates the point, where new technologies must be introduced into the software framework. In order to reach further improvements in the overall throughput, scheduling strategies for mixing parallel jobs can be applied. It allows overcoming limitations in the speedup of the parallel prototype. Those changes require modifications at the level of the Workload Management System (DIRAC).
In the past few years the increased luminosity of the LHC, changes in the linux kernel and a move to a 64bit architecture have affected the ATLAS jobs memory usage and the ATLAS workload management ...system had to be adapted to be more flexible and pass memory parameters to the batch systems, which in the past wasn't a necessity. This paper describes the steps required to add the capability to better handle memory requirements, included the review of how each component definition and parametrization of the memory is mapped to the other components, and what changes had to be applied to make the submission chain work. These changes go from the definition of tasks and the way tasks memory requirements are set using scout jobs, through the new memory tool developed to do that, to how these values are used by the submission component of the system and how the jobs are treated by the sites through the CEs, batch systems and ultimately the kernel.
The LHCb experiment at CERN processes its datasets over hundred different grid sites within the Worldwide LHC Computing Grid (WLCG). All those grid sites consist of multicore CPUs nowadays. However, ...the number of cores per worker node will increase in the near future. Using such worker nodes more efficiently requires parallelization of software as well as modifications at the level of scheduling. This paper will evaluate a moldable job model for LHCb grid jobs where the main challenge is the definition of the best degree of parallelism. Choosing an appropriate degree of parallelism depends on the parameters, on which optimization shall be applied. Commonly used features are for example scalability, workload and turnaround time. Prediction of run time is another major problem and it will be discussed how it can be handled using historical information. Furthermore, the advantages and disadvantages of a moldable job model will be discussed as well on how it must be extended to meet the requirements of LHCb jobs.