CMS computing needs reliable, stable and fast connections among multi-tiered computing infrastructures. For data distribution, the CMS experiment relies on a data placement and transfer system, ...PhEDEx, managing replication operations at each site in the distribution network. PhEDEx uses the File Transfer Service (FTS), a low level data movement service responsible for moving sets of files from one site to another, while allowing participating sites to control the network resource usage. FTS servers are provided by Tier-0 and Tier-1 centres and are used by all computing sites in CMS, according to the established policy. FTS needs to be set up according to the Grid site's policies, and properly configured to satisfy the requirements of all Virtual Organizations making use of the Grid resources at the site. Managing the service efficiently requires good knowledge of the CMS needs for all kinds of transfer workflows. This contribution deals with a revision of FTS servers used by CMS, collecting statistics on their usage, customizing the topologies and improving their setup in order to keep CMS transferring data at the desired levels in a reliable and robust way.
The CMS experiment has adopted a computing system where resources are distributed worldwide in more than 100 sites. The operation of the system requires a stable and reliable behavior of the ...underlying infrastructure. CMS has established procedures to extensively test all relevant aspects of a site and their capability to sustain the various CMS computing workflows at the required scale. The Site Readiness monitoring infrastructure has been instrumental in understanding how the system as a whole was improving towards LHC operations, measuring the reliability of sites when running CMS activities, and providing sites with the information they need to solve eventual problems. This paper reviews the complete automation of the Site Readiness program, with the description of monitoring tools, the impact in improving the overall reliability of the Grid from the point of view of the CMS computing system, as well as the resource utilization and performance seen at the sites during the first year of LHC running.
The CMS experiment has adopted a computing system where resources are distributed worldwide in more than 50 sites. The operation of the system requires a stable and reliable behaviour of the ...underlying infrastructure. CMS has established procedures to extensively test all relevant aspects of a site and their capability to sustain the various CMS computing workflows at the required scale. The Site Readiness monitoring infrastructure has been instrumental in understanding how the system as a whole was improving towards LHC operations, measuring the reliability of sites when running CMS activities, and providing sites with the information they need to troubleshoot any problem. This contribution reviews the complete automation of the Site Readiness program, with the description of monitoring tools and their inclusion into the Site Status Board (SSB), the performance checks, the use of tools like HammerCloud, and the impact in improving the overall reliability of the Grid from the point of view of the CMS computing system. These results are used by CMS to select good sites to conduct workflows, in order to maximize workflows efficiencies. The performance against these tests seen at the sites during the first years of LHC running is as well reviewed.
The Large Hadron Collider (LHC) experiments will collect unprecedented data volumes in the next Physics run, with high pile-up collisions resulting in events that require a complex processing. Hence, ...the collaborations have been required to update their Computing Models to optimize the use of the available resources and control the growth of resources, in the midst of widespread funding restrictions, without penalizing any of the Physics objectives. The changes in computing for Run2 represent significant efforts for the collaborations, as well as significant repercussions on how the WLCG sites are built and operated. This paper focuses on these changes, and how they have been implemented and integrated in the Spanish WLCG Tier-1 centre at Port d'Informació Cientifica (PIC), which serves the ATLAS, CMS and LHCb experiments. The approach to adapt a multi-VO site to the new requirements, while maintaining top reliability levels for all the experiments, is as well presented. Additionally, a description of work done to reduce the operational and maintenance costs of the Spanish Tier-1 centre, in agreement with the expectations from WLCG, is provided.
The WLCG infrastructure moved from a very rigid network topology, based on the MONARC model, to a more relaxed system, where data movement between regions or countries does not necessarily need to ...involve T1 centres. While this evolution brought obvious advantages, especially in terms of flexibility for the LHC experiment's data management systems, it also opened the question of how to monitor the increasing number of possible network paths, in order to provide a global reliable network service. The perfSONAR network monitoring system has been evaluated and agreed as a proper solution to cover the WLCG network monitoring use cases: it allows WLCG to plan and execute latency and bandwidth tests between any instrumented endpoint through a central scheduling configuration, it allows archiving of the metrics in a local database, it provides a programmatic and a web based interface exposing the tests results; it also provides a graphical interface for remote management operations. In this contribution we will present our activity to deploy a perfSONAR based network monitoring infrastructure, in the scope of the WLCG Operations Coordination initiative: we will motivate the main choices we agreed in terms of configuration and management, describe the additional tools we developed to complement the standard packages and present the status of the deployment, together with the possible future evolution.
CMS computing needs reliable, stable and fast connections among multi-tiered distributed infrastructures. CMS experiment relies on File Transfer Services (FTS) for data distribution, a low level data ...movement service responsible for moving sets of files from one site to another, while allowing participating sites to control the network resource usage. FTS servers are provided by Tier-0 and Tier-1 centers and used by all the computing sites in CMS, subject to established CMS and sites setup policies, including all the virtual organizations making use of the Grid resources at the site, and properly dimensioned to satisfy all the requirements for them. Managing the service efficiently needs good knowledge of the CMS needs for all kind of transfer routes, and the sharing and interference with other VOs using the same FTS transfer managers. This contribution deals with a complete revision of all FTS servers used by CMS, customizing the topologies and improving their setup in order to keep CMS transferring data to the desired levels, as well as performance studies for all kind of transfer routes, including overheads measurements introduced by SRM servers and storage systems, FTS server misconfigurations and identification of congested channels, historical transfer throughputs per stream, file-latency studies,… This information is retrieved directly from the FTS servers through the FTS Monitor webpages and conveniently archived for further analysis. The project provides an interface for all these values, to ease the analysis of the data.
Several scientific fields, including Astrophysics, Astroparticle Physics, Cosmology, Nuclear and Particle Physics, and Research with Photons, are estimating that by the 2020 decade they will require ...data handling systems with data volumes approaching the Zettabyte distributed amongst as many as 1018 individually addressable data objects (Zettabyte-Exascale systems). It may be convenient or necessary to deploy such systems using multiple physical sites. This paper describes the findings of a working group composed of experts from several
OSG has been operating for a few years at UCSD a glideinWMS factory for several scientific communities, including CMS analysis, HCC and GLOW. This setup worked fine, but it had become a single point ...of failure. OSG thus recently added another instance at Indiana University, serving the same user communities. Similarly, CMS has been operating a glidein factory dedicated to reprocessing activities at Fermilab, with similar results. Recently, CMS decided to host another glidein factory at CERN, to increase the availability of the system, both for analysis, MC and reprocessing jobs. Given the large overlap between this new factory and the three factories in the US, and given that CMS represents a significant fraction of glideins going through the OSG factories, CMS and OSG formed a common operations team that operates all of the above factories. The reasoning behind this arrangement is that most operational issues stem from Grid-related problems, and are very similar for all the factory instances. Solving a problem in one instance thus very often solves the problem for all of them. This paper presents the operational experience of how we address both the social and technical issues of running multiple instances of a glideinWMS factory with operations staff spanning multiple time zones on two continents.
CMS experiment utilizes distributed computing infrastructure and its performance heavily depends on the fast and smooth distribution of data between different CMS sites. Data must be transferred from ...the Tier-0 (CERN) to the Tier-1s for processing, storing and archiving, and time and good quality are vital to avoid overflowing CERN storage buffers. At the same time, processed data has to be distributed from Tier-1 sites to all Tier-2 sites for physics analysis while Monte Carlo simulations sent back to Tier-1 sites for further archival. At the core of all transferring machinery is PhEDEx (Physics Experiment Data Export) data transfer system. It is very important to ensure reliable operation of the system, and the operational tasks comprise monitoring and debugging all transfer issues. Based on transfer quality information Site Readiness tool is used to create plans for resources utilization in the future. We review the operational procedures created to enforce reliable data delivery to CMS distributed sites all over the world. Additionally, we need to keep data and meta-data consistent at all sites and both on disk and on tape. In this presentation, we describe the principles and actions taken to keep data consistent on sites storage systems and central CMS Data Replication Database (TMDB/DBS) while ensuring fast and reliable data samples delivery of hundreds of terabytes to the entire CMS physics community.
This paper summarizes the operational experience of the Tier1 computer center at Port d'InformacioCientifica (PIC) supporting the commissioning and first run (Run1) of the Large Hadron Collider ...(LHC). Theevolution of the experiment computing models resulting from the higher amounts of data expected after therestart of the LHC are also described.