Abstract
The GeoModel toolkit is an open-source suite of standalone tools that provides the user with lightweight tools to describe, visualize, test, and debug detector descriptions and geometries ...for HEP standalone studies and experiments. GeoModel has been designed with independence and responsiveness in mind and offers a development environment free of other large HEP tools and frameworks, and with a very quick development cycle. With very few and lightweight dependencies, GeoModel is easy to install on all systems, in a modular way; and pre-compiled binaries are provided for the major platforms, for a quick and easy installation. Coded entirely in C++, GeoModel offers the user tools to describe geometries inside C++ code or in external XML files, create persistent representation with a low disk footprint and interactively visualize and inspect the geometry in a 3D view. It also offers a plugin mechanism and an optional Geant4 application to simulate the described geometry in a standalone environment. GeoModel has been developed as part of the software for the ATLAS experiment at the LHC, and evolved towards an experiment-independent toolkit. In this contribution, we describe all the available tools, with a focus on the latest additions, which provide users with more visualization, debug, and simulation tools.
The CREST project is a new realization of the conditions DB with the REST API and JSON support for the ATLAS experiment at the LHC. This project simplifies the conditions data structure and optimizes ...data access. CREST development requires not only the client C++ library (CrestApi) but also the various tools for testing software and validating data. A command line client enables a quick access to the stored data. A set of the utilities was used to make a dump of the data from CREST to the file system and to test the client library and the CREST server using dummy data. Now CREST software is being tested using the real conditions data converted with the COOL to CREST converter. The Athena code (ATLAS event processing software framework) was modified to operate with the new conditions data source.
The ATLAS Event Service (ES) implements a new fine grained approach to HEP event processing, designed to be agile and efficient in exploiting transient, short-lived resources such as HPC ...hole-filling, spot market commercial clouds, and volunteer computing. Input and output control and data flows, bookkeeping, monitoring, and data storage are all managed at the event level in an implementation capable of supporting ATLAS-scale distributed processing throughputs (about 4M CPU-hours day). Input data flows utilize remote data repositories with no data locality or pre-staging requirements, minimizing the use of costly storage in favor of strongly leveraging powerful networks. Object stores provide a highly scalable means of remotely storing the quasi-continuous, fine grained outputs that give ES based applications a very light data footprint on a processing resource, and ensure negligible losses should the resource suddenly vanish. We will describe the motivations for the ES system, its unique features and capabilities, its architecture and the highly scalable tools and technologies employed in its implementation, and its applications in ATLAS processing on HPCs, commercial cloud resources, volunteer computing, and grid resources. Notice: This manuscript has been authored by employees of Brookhaven Science Associates, LLC under Contract No. DE-AC02-98CH10886 with the U.S. Department of Energy. The publisher by accepting the manuscript for publication acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes.
With ever-greater computing needs and fixed budgets, big scientific experiments are turning to opportunistic resources as a means to add much-needed extra computing power. These resources can be very ...different in design from those that comprise the Grid computing of most experiments, therefore exploiting them requires a change in strategy for the experiment. They may be highly restrictive in what can be run or in connections to the outside world, or tolerate opportunistic usage only on condition that tasks may be terminated without warning. The Advanced Resource Connector Computing Element (ARC CE) with its nonintrusive architecture is designed to integrate resources such as High Performance Computing (HPC) systems into a computing Grid. The ATLAS experiment developed the ATLAS Event Service (AES) primarily to address the issue of jobs that can be terminated at any point when opportunistic computing capacity is needed by someone else. This paper describes the integration of these two systems in order to exploit opportunistic resources for ATLAS in a restrictive environment. In addition to the technical details, results from deployment of this solution in the SuperMUC HPC centre in Munich are shown.
Data processing applications of the ATLAS experiment, such as event simulation and reconstruction, spend considerable amount of time in the initialization phase. This phase includes loading a large ...number of shared libraries, reading detector geometry and condition data from external databases, building a transient representation of the detector geometry and initializing various algorithms and services. In some cases the initialization step can take as long as 10-15 minutes. Such slow initialization has a significant negative impact on overall CPU efficiency of the production job, especially when the job is executed on opportunistic, often short-lived, resources such as commercial clouds or volunteer computing. In order to improve this situation, we can take advantage of the fact that ATLAS runs large numbers of production jobs with similar configuration parameters (e.g. jobs within the same production task). This allows us to checkpoint one job at the end of its configuration step and then use the generated checkpoint image for rapid startup of thousands of production jobs. By applying this technique we can bring the initialization time of a job from tens of minutes down to just a few seconds. In addition to that we can leverage container technology for restarting checkpointed applications on the variety of computing platforms, in particular of platforms different from the one on which the checkpoint image was created. We will describe the mechanism of creating checkpoint images of Geant4 simulation jobs with AthenaMP (the multi-process version of the ATLAS data simulation, reconstruction and analysis framework Athena) and the usage of these images for running ATLAS Simulation production jobs on volunteer computing resources (ATLAS@Home) and on Supercomputers.
These proceedings give a summary of the many software upgrade projects undertaken to prepare ATLAS for the challenges of Run-2 of the LHC. Those projects include a significant reduction of the CPU ...time required for reconstruction of real data with high average pile-up event rates compared to 2012. This is required to meet the challenges of the expected increase in pileup and the higher data taking rate of up to 1 kHz. By far the most ambitious project is the implementation of a completely new analysis model, based on a new ROOT readable reconstruction format, xAOD. The new model also includes a reduction framework based on a train model to centrally produce skimmed data samples and an analysis framework. These proceedings close with a brief overview of future software projects and plans that will lead up to the coming Long Shutdown 2 as the next major ATLAS software upgrade phase.
ATLAS's current software framework, Gaudi Athena, has been very successful for the experiment in LHC Runs 1 and 2. However, its single-threaded design has been recognised for some time to be ...increasingly problematic as CPUs have increased core counts and decreased available memory per core. Even the multi-process version of Athena, AthenaMP, will not scale to the range of architectures we expect to use beyond Run2. ATLAS examined the requirements on an updated multi-threaded framework and laid out plans for a new framework, including better support for High Level Trigger use cases, in 2014. In this paper we report on our progress in developing the new multi-threaded task parallel extension of Athena, AthenaMT. Implementing AthenaMT has required many significant code changes. Progress has been made in updating key concepts of the framework, allowing different levels of thread safety in algorithmic code. Substantial advances have also been made in implementing a data flow centric design, as well as on the development of the new 'event views' infrastructure. These event views support partial event processing and are an essential component to support the High Level Trigger's processing of certain regions of interest. A major effort has also been invested to have an early version of AthenaMT that can run simulation on many core architectures, which has augmented the understanding gained from work on earlier ATLAS demonstrators.
The ATLAS Event Service (AES) has been designed and implemented for efficient running of ATLAS production workflows on a variety of computing platforms, ranging from conventional Grid sites to ...opportunistic, often short-lived resources, such as spot market commercial clouds, supercomputers and volunteer computing. The Event Service architecture allows real time delivery of fine grained workloads to running payload applications which process dispatched events or event ranges and immediately stream the outputs to highly scalable Object Stores. Thanks to its agile and flexible architecture the AES is currently being used by grid sites for assigning low priority workloads to otherwise idle computing resources; similarly harvesting HPC resources in an efficient back-fill mode; and massively scaling out to the 50-100k concurrent core level on the Amazon spot market to efficiently utilize those transient resources for peak production needs. Platform ports in development include ATLAS@Home (BOINC) and the Google Compute Engine, and a growing number of HPC platforms. After briefly reviewing the concept and the architecture of the Event Service, we will report the status and experience gained in AES commissioning and production operations on supercomputers, and our plans for extending ES application beyond Geant4 simulation to other workflows, such as reconstruction and data analysis.
Continued growth in public cloud and HPC resources is on track to exceed the dedicated resources available for ATLAS on the WLCG. Examples of such platforms are Amazon AWS EC2 Spot Instances, Edison ...Cray XC30 supercomputer, backfill at Tier 2 and Tier 3 sites, opportunistic resources at the Open Science Grid (OSG), and ATLAS High Level Trigger farm between the data taking periods. Because of specific aspects of opportunistic resources such as preemptive job scheduling and data I O, their efficient usage requires workflow innovations provided by the ATLAS Event Service. Thanks to the finer granularity of the Event Service data processing workflow, the opportunistic resources are used more efficiently. We report on our progress in scaling opportunistic resource usage to double-digit levels in ATLAS production.
In this paper we will review the ATLAS Monte Carlo production setup including the different production steps involved in full and fast detector simulation. A report on the Monte Carlo production ...campaigns during Run 1 and Long Shutdown 1 will be presented, including details on various performance aspects. Important improvements in the work flow and software will be highlighted. Besides standard Monte Carlo production for data analyses at 7 and 8 TeV, the production accommodates various specialised activities. These range from extended Monte Carlo validation, Geant4 validation, pile-up simulation using zero bias data and production for various upgrade studies. The challenges of these activities will be discussed.