The CMS experiment at the LHC accelerator at CERN relies on its computing infrastructure to stay at the frontier of High Energy Physics, searching for new phenomena and making discoveries. Even ...though computing plays a significant role in physics analysis we rarely use its data to predict the system behavior itself. A basic information about computing resources, user activities and site utilization can be really useful for improving the throughput of the system and its management. In this paper, we discuss a first CMS analysis of dataset popularity based on CMS meta-data which can be used as a model for dynamic data placement and provide the foundation of data-driven approach for the CMS computing infrastructure.
The CERN IT provides a set of Hadoop clusters featuring more than 5 PBytes of raw storage with different open-source, user-level tools available for analytical purposes. The CMS experiment started ...collecting a large set of computing meta-data, e.g. dataset, file access logs, since 2015. These records represent a valuable, yet scarcely investigated, set of information that needs to be cleaned, categorized and analyzed. CMS can use this information to discover useful patterns and enhance the overall efficiency of the distributed data, improve CPU and site utilization as well as tasks completion time. Here we present evaluation of Apache Spark platform for CMS needs. We discuss two main use-cases CMS analytics and ML studies where efficient process billions of records stored on HDFS plays an important role. We demonstrate that both Scala and Python (PySpark) APIs can be successfully used to execute extremely I/O intensive queries and provide valuable data insight from collected meta-data.
DODAS stands for Dynamic On Demand Analysis Service and is a Platform as a Service toolkit built around several EOSC-hub services designed to instantiate and configure on-demand container-based ...clusters over public or private Cloud resources. It automates the whole workflow from service provisioning to the configuration and setup of software applications. Therefore, such a solution allows using "any cloud provider", with almost zero effort. In this paper, we demonstrate how DODAS can be adopted as a deployment manager to set up and manage the compute resources and services required to develop an AI solution for smart data caching. The smart caching layer may reduce the operational cost and increase flexibility with respect to regular centrally managed storage of the current CMS computing model. The cache space should be dynamically populated with the most requested data. In addition, clustering such caching systems will allow to operate them as a Content Delivery System between data providers and end-users. Moreover, a geographically distributed caching layer will be functional also to a data-lake based model, where many satellite computing centers might appear and disappear dynamically. In this context, our strategy is to develop a flexible and automated AI environment for smart management of the content of such clustered cache system. In this contribution, we will describe the identified computational phases required for the AI environment implementation, as well as the related DODAS integration. Therefore we will start with the overview of the architecture for the pre-processing step, based on Spark, which has the role to prepare data for a Machine Learning technique. A focus will be given on the automation implemented through DODAS. Then, we will show how to train an AI-based smart cache and how we implemented a training facility managed through DODAS. Finally, we provide an overview of the inference system, based on the CMS-TensorFlow as a Service and also deployed as a DODAS service.
After a successful first run at the LHC, and during the Long Shutdown (LS1) of the accelerator, the workload and data management sectors of the CMS Computing Model are entering into an operational ...review phase in order to concretely assess area of possible improvements and paths to exploit new promising technology trends. In particular, since the preparation activities for the LHC start, the Networks have constantly been of paramount importance for the execution of CMS workflows, exceeding the original expectations - as from the MONARC model - in terms of performance, stability and reliability. The low-latency transfers of PetaBytes of CMS data among dozens of WLCG Tiers worldwide using the PhEDEx dataset replication system is an example of the importance of reliable Networks. Another example is the exploitation of WAN data access over data federations in CMS. A new emerging area of work is the exploitation of Intelligent Network Services, including also bandwidth on demand concepts. In this paper, we will review the work done in CMS on this, and the next steps.
The International Conference on Computing in High Energy and Nuclear Physics (CHEP) is a major series of international conferences intended to attract physicists and computing professionals to ...discuss on recent developments and trends in software and computing for their research communities. Experts from the high energy and nuclear physics, computer science, and information technology communities attend CHEP events. This conference series provides an international forum to exchange experiences and the needs of a wide community, and to present and discuss recent, ongoing, and future activities. At the beginning of the successful series of CHEP conferences in 1985, the latest developments in embedded systems, networking, vector and parallel processing were presented in Amsterdam. The software and computing ecosystem massively evolved since then, and along this path each CHEP event has marked a step further. A vibrant community of experts on a wide range of different high-energy and nuclear physics experiments, as well as technology explorer and industry contacts, attend and discuss the present and future challenges, and shape the future of an entire community. In such a rapidly evolving area, aiming to capture the state-of-the-art on software and computing through a collection of proceedings papers on a journal is a big challenge. Due to the large attendance, the final papers appear on the journal a few months after the conference is over. Additionally, the contributions often report about studies at very heterogeneous statuses, namely studies that are completed, or are just started, or yet to be done. It is not uncommon that by the time a specific paper appears on the journal some of the work is over a year old, or the investigation actually happened in different directions and with different methodologies than originally presented at the conference just a few months before. And by the time the proceedings appear in journal form, new ideas and explorations have quickly formed, have already started, and presumably have also followed previously unpredictable directions. In this scenario, it is normal and healthy for the entire community to question itself as of whether it is a set of proceedings the best way to document and communicate to peers (present and future) the work that has been done at a precise time and the vivid and live ideas of a precise moment in the evolution of the discipline. Pointing the attention to a specific CHEP event alone does not give the right answer: in fact, the heritage value lies in the quality and continuity of the documentation work, despite the changes of times, trends and actors. The CHEP proceedings, in their variety and thanks to the condensed form of knowledge they offer, are what most likely will be more easily preserved for future generations, thanks to the outstanding efforts over digital libraries for all kinds of cultural heritage. Since 1985, this long-standing tradition continued with the 21st CHEP edition in Okinawa. The successful model that brings together high-energy and nuclear physicist and computer scientists was repeated in the Okinawa prefecture, an outstanding location consisting of a few dozen small islands in the southern half of the Nansei Shoto, the island chain which stretches over about one thousand kilometres from Kyushu to Taiwan. The OIST (Okinawa Institute of Science and Technology) centre hosted the event, and offered an outstanding location and efficient facilities for the event. As for the CHEP history, contributions from 'general purpose' physics experiments mixed together with highly specialized work on the frontier of precision and intensity. The year 2015 is marked by the LHC restart in Run 2. Experimental groups at the LHC reviewed and presented their Run 1 experiences in detail, and reported the work done in acquiring the latest computing and software technologies, as well as in evolving their computing models in preparation for Run 2 (and beyond). On the side of the intensity frontier, 2015 is also the start of Super-KEKB commissioning. Fixed-target experiments at CERN, Fermilab and J-PARC are growing bigger in size. In the field of nuclear physics, FAIR is under construction and RHIC well engaged into its Phase-II research program facing increased datasets and new challenges with precision physics. For the future, developments are progressing towards the construction of ILC. In all these projects, computing and software will be even more important than before. Beyond those examples, non-accelerator experiments reported on their search for novel computing models as their apparatus and operation become larger and more distributed. The CHEP edition in Okinawa explored the synergy of HEP experimental physicists and computer scientists with data engineers and data scientists even further. Many area of research are covered, and the techniques developed and adopted are presented in a richness and diversity never seen before. In numbers, CHEP 2015 attracted a very high number of oral and poster contribution, 535 in total, and hosted 450 participants from 28 countries. For the first time in the conference history, a system of 'keywords' at the abstracts submission time was set up and exploited to produce conference tracks depending on the topics covered in the proposed contributions. Authors were asked to select some 'application keywords' and/or 'technology keywords' to specify the content of their contribution. A bottom-up approach that was tried at CHEP 2015 in Okinawa for the first time in the history of this conference series, this encountered vast satisfaction both in the International Advisory Committee and among the conference attendees. This process created 8 topical tracks, well balanced in content, manageable in terms of number of contributions, and able to create the adequate discussion space for trend topics (e.g. cloud computing and virtualization). CHEP 2015 hosted contributions on online computing; offline software; data store and access; middleware, software development and tools, experiment frameworks, tools for distributed computing; computing activities and computing models; facilities, infrastructure, network; clouds and virtualization; performance increase and optimization exploiting hardware features. Throughout the entire process, we were blessed with a forward-looking group of competent colleagues in our International Advisory Committee, whom we warmly thank. All the individuals in the Program Committee team, who put together the technical tracks of the conference and reviewed all papers to prepare the sections of this proceedings journal, have to be credited for their outstanding work. And of course the gratitude goes to all people who submitted a contribution, presented it, and spent time to prepare a careful paper to document the work. These people, in the first place, are the main authors of the big success that CHEP continues to be. After almost 30 years, and 21 CHEP editions, this conference cycle continues to stay strong and to evolve in rapidly changing times towards a challenging future, covering new grounds and intercepting new trends as our field of research evolves. The next stop in this journey will be at the 22nd CHEP Conference on October 12th-14th, in San Francisco, hosted by SLAC and LBNL.
The CMS experiment has collected an enormous volume of metadata about its computing operations in its monitoring systems, describing its experience in operating all of the CMS workflows on all of the ...Worldwide LHC Computing Grid Tiers. Data mining efforts into all these information have rarely been done, but are of crucial importance for a better understanding of how CMS did successful operations, and to reach an adequate and adaptive modelling of the CMS operations, in order to allow detailed optimizations and eventually a prediction of system behaviours. These data are now streamed into the CERN Hadoop data cluster for further analysis. Specific sets of information (e.g. data on how many replicas of datasets CMS wrote on disks at WLCG Tiers, data on which datasets were primarily requested for analysis, etc) were collected on Hadoop and processed with MapReduce applications profiting of the parallelization on the Hadoop cluster. We present the implementation of new monitoring applications on Hadoop, and discuss the new possibilities in CMS computing monitoring introduced with the ability to quickly process big data sets from mulltiple sources, looking forward to a predictive modeling of the system.
After CMS Data Challenge (DC04)—which was devised to test several key aspects of the CMS computing model—a deeper insight was achieved in most crucial issues for a successful Tier-1 operation with ...real data within the overall CMS computing infrastructure. In particular, at the Italian Tier-1 centre located at CNAF, several improvements were implemented in one year since DC04, concerning the data management, the data distribution system using the CMS PhEDEx tool, the coexistence of local traditional farm operations and Grid official CMS Monte Carlo production, the development and use of tools to grant efficient data access to distributed users to analyze CMS data via Grid tools, the long-term local archiving and custodial responsibility (e.g. MSS with Castor back-end), the daily CMS operations on Tier-1 resources which are shared among LHC (and not only) experiments, and so on. The outcome of CMS DC04, as well as the CMS use of INFN-CNAF Tier-1 resources, are briefly reviewed and discussed, yielding indications for a roadmap towards the operation of the regional centre when real data from LHC will be available.
Clouds and virtualization offer typical answers to the needs of large-scale computing centers to satisfy diverse sets of user communities in terms of architecture, OS, etc. On the other hand, ...solutions like Docker seems to emerge as a way to rely on Linux kernel capabilities to package only the applications and the development environment needed by the users, thus solving several resource management issues related to cloud-like solutions. In this paper, we present an exploratory (though well advanced) test done at a major Italian Tier2, at INFN-Pisa, where a considerable fraction of the resources and services has been moved to Docker. The results obtained are definitely encouraging, and Pisa is transitioning all of its Worker Nodes and services to Docker containers. Work is currently being expanded into the preparation of suitable images for a completely virtualized Tier2, with no dependency on local configurations.
During the first LHC run, the CMS experiment collected tens of Petabytes of collision and simulated data, which need to be distributed among dozens of computing centres with low latency in order to ...make efficient use of the resources. While the desired level of throughput has been successfully achieved, it is still common to observe transfer workflows that cannot reach full completion in a timely manner due to a small fraction of stuck files which require operator intervention. For this reason, in 2012 the CMS transfer management system, PhEDEx, was instrumented with a monitoring system to measure file transfer latencies, and to predict the completion time for the transfer of a data set. The operators can detect abnormal patterns in transfer latencies while the transfer is still in progress, and monitor the long-term performance of the transfer infrastructure to plan the data placement strategy. Based on the data collected for one year with the latency monitoring system, we present a study on the different factors that contribute to transfer completion time. As case studies, we analyze several typical CMS transfer workflows, such as distribution of collision event data from CERN or upload of simulated event data from the Tier-2 centres to the archival Tier-1 centres. For each workflow, we present the typical patterns of transfer latencies that have been identified with the latency monitor. We identify the areas in PhEDEx where a development effort can reduce the latency, and we show how we are able to detect stuck transfers which need operator intervention. We propose a set of metrics to alert about stuck subscriptions and prompt for manual intervention, with the aim of improving transfer completion times.
The computing infrastructures serving the LHC experiments have been designed to cope at most with the average amount of data recorded. The usage peaks, as already observed in Run-I, may however ...originate large backlogs, thus delaying the completion of the data reconstruction and ultimately the data availability for physics analysis. In order to cope with the production peaks, the LHC experiments are exploring the opportunity to access Cloud resources provided by external partners or commercial providers. In this work we present the proof of concept of the elastic extension of a local analysis facility, specifically the Bologna Tier-3 Grid site, for the LHC experiments hosted at the site, on an external OpenStack infrastructure. We focus on the Cloud Bursting of the Grid site using DynFarm, a newly designed tool that allows the dynamic registration of new worker nodes to LSF. In this approach, the dynamically added worker nodes instantiated on an OpenStack infrastructure are transparently accessed by the LHC Grid tools and at the same time they serve as an extension of the farm for the local usage.