This paper describes the adoption and extension of the TOSCA standard by the INDIGO-DataCloud project for the definition and deployment of complex computing clusters together with the required ...support in both OpenStack and OpenNebula, carried out in close collaboration with industry partners such as IBM. Two examples of these clusters are described in this paper, the definition of an elastic computing cluster to support the Galaxy bioinformatics application where the nodes are dynamically added and removed from the cluster to adapt to the workload, and the definition of an scalable Apache Mesos cluster for the execution of batch jobs and support for long-running services. The coupling of TOSCA with Ansible Roles to perform automated installation has resulted in the definition of high-level, deterministic templates to provision complex computing clusters across different Cloud sites.
This paper describes the achievements of the H2020 project INDIGO-DataCloud. The project has provided e-infrastructures with tools, applications and cloud framework enhancements to manage the ...demanding requirements of scientific communities, either locally or through enhanced interfaces. The middleware developed allows to federate hybrid resources, to easily write, port and run scientific applications to the cloud. In particular, we have extended existing PaaS (Platform as a Service) solutions, allowing public and private e-infrastructures, including those provided by EGI, EUDAT, and Helix Nebula, to integrate their existing services and make them available through AAI services compliant with GEANT interfederation policies, thus guaranteeing transparency and trust in the provisioning of such services. Our middleware facilitates the execution of applications using containers on Cloud and Grid based infrastructures, as well as on HPC clusters. Our developments are freely downloadable as open source components, and are already being integrated into many scientific applications.
DODAS stands for Dynamic On Demand Analysis Service and is a Platform as a Service toolkit built around several EOSC-hub services designed to instantiate and configure on-demand container-based ...clusters over public or private Cloud resources. It automates the whole workflow from service provisioning to the configuration and setup of software applications. Therefore, such a solution allows using "any cloud provider", with almost zero effort. In this paper, we demonstrate how DODAS can be adopted as a deployment manager to set up and manage the compute resources and services required to develop an AI solution for smart data caching. The smart caching layer may reduce the operational cost and increase flexibility with respect to regular centrally managed storage of the current CMS computing model. The cache space should be dynamically populated with the most requested data. In addition, clustering such caching systems will allow to operate them as a Content Delivery System between data providers and end-users. Moreover, a geographically distributed caching layer will be functional also to a data-lake based model, where many satellite computing centers might appear and disappear dynamically. In this context, our strategy is to develop a flexible and automated AI environment for smart management of the content of such clustered cache system. In this contribution, we will describe the identified computational phases required for the AI environment implementation, as well as the related DODAS integration. Therefore we will start with the overview of the architecture for the pre-processing step, based on Spark, which has the role to prepare data for a Machine Learning technique. A focus will be given on the automation implemented through DODAS. Then, we will show how to train an AI-based smart cache and how we implemented a training facility managed through DODAS. Finally, we provide an overview of the inference system, based on the CMS-TensorFlow as a Service and also deployed as a DODAS service.
A dashboard devoted to the computing in the Italian sites for the ALICE experiment at the LHC has been deployed. A combination of different complementary monitoring tools is typically used in most of ...the Tier-2 sites: this makes somewhat difficult to figure out at a glance the status of the site and to compare information extracted from different sources for debugging purposes. To overcome these limitations a dedicated ALICE dashboard has been designed and implemented in each of the ALICE Tier-2 sites in Italy: in particular, it provides a single, interactive and easily customizable graphical interface where heterogeneous data are presented. The dashboard is based on two main ingredients: an open source time-series database and a dashboard builder tool for visualizing time-series metrics. Various sensors, able to collect data from the multiple data sources, have been also written. A first version of a national computing dashboard has been implemented using a specific instance of the builder to gather data from all the local databases.
A cloud-based Virtual Analysis Facility (VAF) for the ALICE experiment at the LHC has been deployed in Bari. Similar facilities are currently running in other Italian sites with the aim to create a ...federation of interoperating farms able to provide their computing resources for interactive distributed analysis. The use of cloud technology, along with elastic provisioning of computing resources as an alternative to the grid for running data intensive analyses, is the main challenge of these facilities. One of the crucial aspects of the user-driven analysis execution is the data access. A local storage facility has the disadvantage that the stored data can be accessed only locally, i.e. from within the single VAF. To overcome such a limitation a federated infrastructure, which provides full access to all the data belonging to the federation independently from the site where they are stored, has been set up. The federation architecture exploits both cloud computing and XRootD technologies, in order to provide a dynamic, easy-to-use and well performing solution for data handling. It should allow the users to store the files and efficiently retrieve the data, since it implements a dynamic distributed cache among many datacenters in Italy connected to one another through the high-bandwidth national network. Details on the preliminary architecture implementation and performance studies are discussed.
One of the challenges a scientific computing center has to face is to keep delivering well consolidated computational frameworks (i.e. the batch computing farm), while conforming to modern computing ...paradigms. The aim is to ease system administration at all levels (from hardware to applications) and to provide a smooth end-user experience. Within the INDIGO- DataCloud project, we adopt two different approaches to implement a PaaS-level, on-demand Batch Farm Service based on HTCondor and Mesos. In the first approach, described in this paper, the various HTCondor daemons are packaged inside pre-configured Docker images and deployed as Long Running Services through Marathon, profiting from its health checks and failover capabilities. In the second approach, we are going to implement an ad-hoc HTCondor framework for Mesos. Container-to-container communication and isolation have been addressed exploring a solution based on overlay networks (based on the Calico Project). Finally, we have studied the possibility to deploy an HTCondor cluster that spans over different sites, exploiting the Condor Connection Broker component, that allows communication across a private network boundary or firewall as in case of multi-site deployments. In this paper, we are going to describe and motivate our implementation choices and to show the results of the first tests performed.
Solid-State-Drives are currently introduced in storage facilities for High Energy Physics (HEP) experiments and their performances are measured and compared to standard magnetic disks. For this paper ...a typical HEP data analysis is performed and used as a test to measure computing performances. The tests exploit the features provided by PROOF-Lite which allows to distribute a huge number of events among different CPU cores, thus reducing the overall time needed to complete the analysis task. These tests are carried on few computational devices typically hosted at a current Tier-2/Tier-3 facility. The performance results are provided in terms of figures of merit and the main issue is scalability described in terms of speed up factor and processing event rate. The obtained results can be used as guideline for both the typical HEP analyst and the Tier-2/Tier-3 manager: the former in the configuration of his own analysis task while dealing with increasing data sizes, the latter in the implementation of an interactive data analysis facility for HEP experiments while facing solutions that concern both technological and economical aspects.
A Consortium between four LHC Computing Centers (Bari, Milano, Pisa and Trieste) has been formed in 2010 to prototype Analysis-oriented facilities for CMS data analysis, profiting from a grant from ...the Italian Ministry of Research. The Consortium aims to realize an ad-hoc infrastructure to ease the analysis activities on the huge data set collected at the LHC Collider. While “Tier2” Computing Centres, specialized in organized processing tasks like Monte Carlo simulation, are nowadays a well established concept, with years of running experience, site specialized towards end user chaotic analysis activities do not yet have a defacto standard implementation. In our effort, we focus on all the aspects that can make the analysis tasks easier for a physics user not expert in computing. On the storage side, we are experimenting on storage techniques allowing for remote data access and on storage optimization on the typical analysis access patterns. On the networking side, we are studying the differences between flat and tiered LAN architecture, also using virtual partitioning of the same physical networking for the different use patterns. Finally, on the user side, we are developing tools and instruments to allow for an exhaustive monitoring of their processes at the site, and for an efficient support system in case of problems. We will report about the results of the test executed on different subsystem and give a description of the layout of the infrastructure in place at the site participating to the consortium.