DODAS stands for Dynamic On Demand Analysis Service and is a Platform as a Service toolkit built around several EOSC-hub services designed to instantiate and configure on-demand container-based ...clusters over public or private Cloud resources. It automates the whole workflow from service provisioning to the configuration and setup of software applications. Therefore, such a solution allows using "any cloud provider", with almost zero effort. In this paper, we demonstrate how DODAS can be adopted as a deployment manager to set up and manage the compute resources and services required to develop an AI solution for smart data caching. The smart caching layer may reduce the operational cost and increase flexibility with respect to regular centrally managed storage of the current CMS computing model. The cache space should be dynamically populated with the most requested data. In addition, clustering such caching systems will allow to operate them as a Content Delivery System between data providers and end-users. Moreover, a geographically distributed caching layer will be functional also to a data-lake based model, where many satellite computing centers might appear and disappear dynamically. In this context, our strategy is to develop a flexible and automated AI environment for smart management of the content of such clustered cache system. In this contribution, we will describe the identified computational phases required for the AI environment implementation, as well as the related DODAS integration. Therefore we will start with the overview of the architecture for the pre-processing step, based on Spark, which has the role to prepare data for a Machine Learning technique. A focus will be given on the automation implemented through DODAS. Then, we will show how to train an AI-based smart cache and how we implemented a training facility managed through DODAS. Finally, we provide an overview of the inference system, based on the CMS-TensorFlow as a Service and also deployed as a DODAS service.
The INFN Cloud project was launched at the beginning of 2020, aiming to build a distributed Cloud infrastructure and provide advanced services for the INFN scientific communities. A Platform as a ...Service (PaaS) was created inside INFN Cloud that allows the experiments to develop and access resources as a Software as a Service (SaaS), and CYGNO is the betatester of this system. The aim of the CYGNO experiment is to realize a large gaseous Time Projection Chamber based on the optical readout of the photons produced in the avalanche multiplication of ionization electrons in a GEM stack. To this extent, CYGNO exploits the progress in commercial scientific Active Pixel Sensors based on Scientific CMOS for Dark Matter search and Solar Neutrino studies. CYGNO, like many other astroparticle experiments, requires a computing model to acquire, store, simulate and analyze data typically far from High Energy Physics (HEP) experiments. Indeed, astroparticle experiments are typically characterized by being less demanding of computing resources with respect to HEP ones but have to deal with unique and unrepeatable data, sometimes collected in extreme conditions, with extensive use of templates and montecarlo, and are often re-calibrated and reconstructed many times for a given data set. Moreover, the varieties and the scale of computing models and requirements are extremely large. In this scenario, the Cloud infrastructure with standardized and optimized services offered to the scientific community could be a useful solution able to match the requirements of many small/medium size experiments. In this work, we will present the CYGNO computing model based on the INFN cloud infrastructure where the experiment software, easily extendible to similar experiments to similar applications on other similar experiments, provides tools as a service to store, archive, analyze, and simulate data.
AsyncStageOut (ASO) is the component of the CMS distributed data analysis system (CRAB) that manages users transfers in a centrally controlled way using the File Transfer System (FTS3) at CERN. It ...addresses a major weakness of the previous, decentralized model, namely that the transfer of the user's output data to a single remote site was part of the job execution, resulting in inefficient use of job slots and an unacceptable failure rate. Currently ASO manages up to 600k files of various sizes per day from more than 500 users per month, spread over more than 100 sites. ASO uses a NoSQL database (CouchDB) as internal bookkeeping and as way to communicate with other CRAB components. Since ASO/CRAB were put in production in 2014, the number of transfers constantly increased up to a point where the pressure to the central CouchDB instance became critical, creating new challenges for the system scalability, performance, and monitoring. This forced a re-engineering of the ASO application to increase its scalability and lowering its operational effort. In this contribution we present a comparison of the performance of the current NoSQL implementation and a new SQL implementation, and how their different strengths and features influenced the design choices and operational experience. We also discuss other architectural changes introduced in the system to handle the increasing load and latency in delivering output to the user.
Efficient monitoring of CRAB jobs at CMS Silva, J M D; Balcas, J; Belforte, S ...
Journal of physics. Conference series,
10/2017, Letnik:
898, Številka:
9
Journal Article
Recenzirano
Odprti dostop
CRAB is a tool used for distributed analysis of CMS data. Users can submit sets of jobs with similar requirements (tasks) with a single request. CRAB uses a client-server architecture, where a ...lightweight client, a server, and ancillary services work together and are maintained by CMS operators at CERN. As with most complex software, good monitoring tools are crucial for efficient use and longterm maintainability. This work gives an overview of the monitoring tools developed to ensure the CRAB server and infrastructure are functional, help operators debug user problems, and minimize overhead and operating cost. This work also illustrates the design choices and gives a report on our experience with the tools we developed and the external ones we used.
CMS distributed data analysis with CRAB3 Mascheroni, M; Balcas, J; Belforte, S ...
Journal of physics. Conference series,
12/2015, Letnik:
664, Številka:
6
Journal Article
Recenzirano
Odprti dostop
The CMS Remote Analysis Builder (CRAB) is a distributed workflow management tool which facilitates analysis tasks by isolating users from the technical details of the Grid infrastructure. Throughout ...LHC Run 1, CRAB has been successfully employed by an average of 350 distinct users each week executing about 200,000 jobs per day. CRAB has been significantly upgraded in order to face the new challenges posed by LHC Run 2. Components of the new system include 1) a lightweight client, 2) a central primary server which communicates with the clients through a REST interface, 3) secondary servers which manage user analysis tasks and submit jobs to the CMS resource provisioning system, and 4) a central service to asynchronously move user data from temporary storage in the execution site to the desired storage location. The new system improves the robustness, scalability and sustainability of the service. Here we provide an overview of the new system, operation, and user support, report on its current status, and identify lessons learned from the commissioning phase and production roll-out.
AsyncStageOut (ASO) is a new component of the distributed data analysis system of CMS, CRAB, designed for managing users' data. It addresses a major weakness of the previous model, namely that mass ...storage of output data was part of the job execution resulting in inefficient use of job slots and an unacceptable failure rate at the end of the jobs. ASO foresees the management of up to 400k files per day of various sizes, spread worldwide across more than 60 sites. It must handle up to 1000 individual users per month, and work with minimal delay. This creates challenging requirements for system scalability, performance and monitoring. ASO uses FTS to schedule and execute the transfers between the storage elements of the source and destination sites. It has evolved from a limited prototype to a highly adaptable service, which manages and monitors the user file placement and bookkeeping. To ensure system scalability and data monitoring, it employs new technologies such as a NoSQL database and re-uses existing components of PhEDEx and the FTS Dashboard. We present the asynchronous stage-out strategy and the architecture of the solution we implemented to deal with those issues and challenges. The deployment model for the high availability and scalability of the service is discussed. The performance of the system during the commissioning and the first phase of production are also shown, along with results from simulations designed to explore the limits of scalability.
The distributed data analysis workflow in CMS assumes that jobs run in a different location to where their results are finally stored. Typically the user outputs must be transferred from one site to ...another by a dedicated CMS service, AsyncStageOut. This new service is originally developed to address the inefficiency in using the CMS computing resources when transferring the analysis job outputs, synchronously, once they are produced in the job execution node to the remote site. The AsyncStageOut is designed as a thin application relying only on the NoSQL database (CouchDB) as input and data storage. It has progressed from a limited prototype to a highly adaptable service which manages and monitors the whole user files steps, namely file transfer and publication. The AsyncStageOut is integrated with the Common CMS/Atlas Analysis Framework. It foresees the management of nearly nearly 200k users' files per day of close to 1000 individual users per month with minimal delays, and providing a real time monitoring and reports to users and service operators, while being highly available. The associated data volume represents a new set of challenges in the areas of database scalability and service performance and efficiency. In this paper, we present an overview of the AsyncStageOut model and the integration strategy with the Common Analysis Framework. The motivations for using the NoSQL technology are also presented, as well as data design and the techniques used for efficient indexing and monitoring of the data. We describe deployment model for the high availability and scalability of the service. We also discuss the hardware requirements and the results achieved as they were determined by testing with actual data and realistic loads during the commissioning and the initial production phase with the Common Analysis Framework.
In 2012, 14 Italian institutions participating in LHC Experiments won a grant from the Italian Ministry of Research (MIUR), with the aim of optimising analysis activities, and in general the Tier2 ...Tier3 infrastructure. We report on the activities being researched upon, on the considerable improvement in the ease of access to resources by physicists, also those with no specific computing interests. We focused on items like distributed storage federations, access to batch-like facilities, provisioning of user interfaces on demand and cloud systems. R&D on next-generation databases, distributed analysis interfaces, and new computing architectures was also carried on. The project, ending in the first months of 2016, will produce a white paper with recommendations on best practices for data-analysis support by computing centers.
Abstract
We present the first Fermi Large Area Telescope (LAT) catalog of long-term
γ
-ray transient sources (1FLT). This comprises sources that were detected on monthly time intervals during the ...first decade of Fermi-LAT operations. The monthly timescale allows us to identify transient and variable sources that were not yet reported in other Fermi-LAT catalogs. The monthly data sets were analyzed using a wavelet-based source detection algorithm that provided the candidate new transient sources. The search was limited to the extragalactic regions of the sky to avoid the dominance of the Galactic diffuse emission at low Galactic latitudes. The transient candidates were then analyzed using the standard Fermi-LAT maximum likelihood analysis method. All sources detected with a statistical significance above 4
σ
in at least one monthly bin were listed in the final catalog. The 1FLT catalog contains 142 transient
γ
-ray sources that are not included in the 4FGL-DR2 catalog. Many of these sources (102) have been confidently associated with active galactic nuclei (AGNs): 24 are associated with flat-spectrum radio quasars, 1 with a BL Lac object, 70 with blazars of uncertain type, 3 with radio galaxies, 1 with a compact steep-spectrum radio source, 1 with a steep-spectrum radio quasar, and 2 with AGNs of other types. The remaining 40 sources have no candidate counterparts at other wavelengths. The median
γ
-ray spectral index of the 1FLT-AGN sources is softer than that reported in the latest Fermi-LAT AGN general catalog. This result is consistent with the hypothesis that detection of the softest
γ
-ray emitters is less efficient when the data are integrated over year-long intervals.