Nowadays Machine Learning (ML) techniques are successfully used in many areas of High-Energy Physics (HEP) and will play a significant role also in the upcoming High-Luminosity LHC upgrade foreseen ...at CERN, when a huge amount of data will be produced by LHC and collected by the experiments, facing challenges at the exascale. To favor the usage of ML in HEP analyses, it would be useful to have a service allowing to perform the entire ML pipeline (in terms of reading the data, processing data, training a ML model, and serving predictions) directly using ROOT files of arbitrary size from local or remote distributed data sources. The Machine Learning as a Service for HEP (MLaaS4HEP) solution we have already proposed aims to provide such kind of service and to be HEP experiment agnostic. To provide users with a real service and to integrate it into the INFN Cloud, we started working on MLaaS4HEP cloudification. This would allow to use cloud resources and to work in a distributed environment. In this work, we provide updates on this topic and discuss a working prototype of the service running on INFN Cloud. It includes an OAuth2 proxy server as authentication/authorization layer, a MLaaS4HEP server, an XRootD proxy server for enabling access to remote ROOT data, and the TensorFlow as a Service (TFaaS) service in charge of the inference phase. With this architecture a HEP user can submit ML pipelines, after being authenticated and authorized, using local or remote ROOT files simply using HTTP calls.
Machine learning (ML) and deep learning (DL) techniques are increasingly influential in High Energy Physics, necessitating effective computing infrastructures and training opportunities for users and ...developers, particularly concerning programmable hardware like FPGAs. A gap exists in accessible ML/DL on FPGA tutorials catering to diverse hardware specifications. To bridge this gap, collaborative efforts by INFN-Bologna, the University of Bologna, and INFN-CNAF produced a pilot course using virtual machines, inhouse cloud platforms, and AWS instances, utilizing Docker containers for interactive exercises. Additionally, the Bond Machine software ecosystem, capable of generating FPGA-synthesizable computer architectures, is explored as a simplified approach for teaching FPGA programming.
Extension of the INFN Tier-1 on a HPC system Boccali, Tommaso; Dal Pra, Stefano; Spiga, Daniele ...
EPJ Web of Conferences,
2020, Letnik:
245
Journal Article, Conference Proceeding
Recenzirano
Odprti dostop
The INFN Tier-1 located at CNAF in Bologna (Italy) is a center of the WLCG e-Infrastructure, supporting the 4 major LHC collaborations and more than 30 other INFN-related experiments.
After multiple ...tests towards elastic expansion of CNAF compute power via Cloud resources (provided by Azure, Aruba and in the framework of the HNSciCloud project), and building on the experience gained with the production quality extension of the Tier-1 farm on remote owned sites, the CNAF team, in collaboration with experts from the ALICE, ATLAS, CMS, and LHCb experiments, has been working to put in production a solution of an integrated HTC+HPC system with the PRACE CINECA center, located nearby Bologna. Such extension will be implemented on the Marconi A2 partition, equipped with Intel Knights Landing (KNL) processors. A number of technical challenges were faced and solved in order to successfully run on low RAM nodes, as well as to overcome the closed environment (network, access, software distribution, … ) that HPC systems deploy with respect to standard GRID sites. We show preliminary results from a large scale integration effort, using resources secured via the successful PRACE grant N. 2018194658, for 30 million KNL core hours.
As a joint effort from various communities involved in the Worldwide LHC Computing Grid, the Operational Intelligence project aims at increasing the level of automation in computing operations and ...reducing human interventions. The distributed computing systems currently deployed by the LHC experiments have proven to be mature and capable of meeting the experimental goals, by allowing timely delivery of scientific results. However, a substantial number of interventions from software developers, shifters, and operational teams is needed to efficiently manage such heterogenous infrastructures. Under the scope of the Operational Intelligence project, experts from several areas have gathered to propose and work on "smart" solutions. Machine learning, data mining, log analysis, and anomaly detection are only some of the tools we have evaluated for our use cases. In this community study contribution, we report on the development of a suite of operational intelligence services to cover various use cases: workload management, data management, and site operations.
In modern data centres an effective and efficient monitoring system is a critical asset, yet a continuous concern for administrators. Since its birth, INFN Tier-1 data centre, hosted at CNAF, has ...used various monitoring tools all replaced, a few years ago, by a system common to all CNAF departments (based on Sensu, Influxdb, Grafana). Given the complexity of the inter-dependencies of the several services running at the data centre and the foreseen large increase of resources in the near future, a more powerful and versatile monitoring system is needed. This new monitoring system should be able to automatically correlate log files and metrics coming from heterogeneous sources and devices (including services, hardware and infrastructure) thus providing us with a suitable framework to implement a solution for the predictive analysis of the status of the whole environment. In particular, the possibility to correlate IT infrastructure monitoring information with the logs of running applications is of great relevance in order to be able to quickly find application failure root cause. At the same time, a modern, flexible and user-friendly analytics solution is needed in order to enable users, IT engineers and IT managers to extract valuable information from the different sources of collected data in a timely fashion. In this paper, a prototype of such a system, installed at the INFN Tier-1, is described with an assessment of the state and an evaluation of the resources needed for a fully production system. Technologies adopted, amount of foreseen data, target KPIs and production design are illustrated.
In the near future, large scientific collaborations will face unprecedented computing challenges. Processing and storing exabyte datasets require a federated infrastructure of distributed computing ...resources. The current systems have proven to be mature and capable of meeting the experiment goals, by allowing timely delivery of scientific results. However, a substantial amount of interventions from software developers, shifters and operational teams is needed to efficiently manage such heterogeneous infrastructures. A wealth of operational data can be exploited to increase the level of automation in computing operations by using adequate techniques, such as machine learning (ML), tailored to solve specific problems. The Operational Intelligence project is a joint effort from various WLCG communities aimed at increasing the level of automation in computing operations. We discuss how state-of-the-art technologies can be used to build general solutions to common problems and to reduce the operational cost of the experiment computing infrastructure.
The increase in the scale of LHC computing during Run 3 and Run 4 (HL-LHC) will certainly require radical changes to the computing models and the data processing of the LHC experiments. The working ...group established by WLCG and the HEP Software Foundation to investigate all aspects of the cost of computing and how to optimise them has continued producing results and improving our understanding of this process. In particular, experiments have developed more sophisticated ways to calculate their resource needs, we have a much more detailed process to calculate infrastructure costs. This includes studies on the impact of HPC and GPU based resources on meeting the computing demands. We have also developed and perfected tools to quantitatively study the performance of experiments workloads and we are actively collaborating with other activities related to data access, benchmarking and technology cost evolution. In this contribution we expose our recent developments and results and outline the directions of future work.
The increase in the scale of LHC computing expected for Run 3 and even more so for Run 4 (HL-LHC) over the next ten years will certainly require radical changes to the computing models and the data ...processing of the LHC experiments. Translating the requirements of the physics programmes into computing resource needs is a complicated process and subject to significant uncertainties. For this reason, WLCG has established a working group to develop methodologies and tools intended tocharacterise the LHC workloads, better understand their interaction with the computing infrastructure, calculate their cost in terms of resources and expenditure and assist experiments, sites and the WLCG project in the evaluation of their future choices. This working group started in November 2017 and has about 30 active participants representing experiments and sites. In this contribution we expose the activities, the results achieved and the future directions.
We introduce a method called evolving Log Parsing (eLP) to extract information granules and an interval rule-based classification model from streams of words in unstructured log files. Logs are ...elementary expressions of language that are used by computational systems to communicate with humans unidirectionally. The logs tell stories based on event occurrences. Any software expresses itself through a log language. In particular, the eLP approach has identified templates (patterns in textual data) in an unsupervised and incremental way. Online pattern classification is achieved with effectiveness of (96.05 ± 1.04)% using 6 datasets and eLP models exhibiting an interpretability level of about 0.04. We present a recursive model-interpretability index to evaluate rule-based classifiers, and discuss the effectiveness-interpretability tradeoff on an actual scenario, namely, the StorM Service of a computing center.