DIRAC universal pilots Stagni, F; McNab, A; Luzzi, C ...
Journal of physics. Conference series,
10/2017, Volume:
898, Issue:
9
Journal Article
Peer reviewed
Open access
In the last few years, new types of computing models, such as IAAS (Infrastructure as a Service) and IAAC (Infrastructure as a Client), gained popularity. New resources may come as part of pledged ...resources, while others are in the form of opportunistic ones. Most but not all of these new infrastructures are based on virtualization techniques. In addition, some of them, present opportunities for multi-processor computing slots to the users. Virtual Organizations are therefore facing heterogeneity of the available resources and the use of an Interware software like DIRAC to provide the transparent, uniform interface has become essential. The transparent access to the underlying resources is realized by implementing the pilot model. DIRAC's newest generation of generic pilots (the so-called Pilots 2.0) are the "pilots for all the skies", and have been successfully released in production more than a year ago. They use a plugin mechanism that makes them easily adaptable. Pilots 2.0 have been used for fetching and running jobs on every type of resource, being it a Worker Node (WN) behind a CREAM/ARC/HTCondor/DIRAC Computing element, a Virtual Machine running on IaaC infrastructures like Vac or BOINC, on IaaS cloud resources managed by Vcycle, the LHCb High Level Trigger farm nodes, and any type of opportunistic computing resource. Make a machine a "Pilot Machine", and all diversities between them will disappear. This contribution describes how pilots are made suitable for different resources, and the recent steps taken towards a fully unified framework, including monitoring. Also, the cases of multi-processor computing slots either on real or virtual machines, with the whole node or a partition of it, is discussed.
In order to ensure an optimal performance of the LHCb Distributed Computing, based on LHCbDIRAC, it is necessary to be able to inspect the behavior over time of many components: firstly the agents ...and services on which the infrastructure is built, but also all the computing tasks and data transfers that are managed by this infrastructure. This consists of recording and then analyzing time series of a large number of observables, for which the usage of SQL relational databases is far from optimal. Therefore within DIRAC we have been studying novel possibilities based on NoSQL databases (ElasticSearch, OpenTSDB and InfluxDB) as a result of this study we developed a new monitoring system based on ElasticSearch. It has been deployed on the LHCb Distributed Computing infrastructure for which it collects data from all the components (agents, services, jobs) and allows creating reports through Kibana and a web user interface, which is based on the DIRAC web framework. In this paper we describe this new implementation of the DIRAC monitoring system. We give details on the ElasticSearch implementation within the DIRAC general framework, as well as an overview of the advantages of the pipeline aggregation used for creating a dynamic bucketing of the time series. We present the advantages of using the ElasticSearch DSL high-level library for creating and running queries. Finally we shall present the performances of that system.
Pilots 2.0: DIRAC pilots for all the skies Stagni, F; Tsaregorodtsev, A; McNab, A ...
Journal of physics. Conference series,
12/2015, Volume:
664, Issue:
6
Journal Article
Peer reviewed
Open access
In the last few years, new types of computing infrastructures, such as IAAS (Infrastructure as a Service) and IAAC (Infrastructure as a Client), gained popularity. New resources may come as part of ...pledged resources, while others are opportunistic. Most of these new infrastructures are based on virtualization techniques. Meanwhile, some concepts, such as distributed queues, lost appeal, while still supporting a vast amount of resources. Virtual Organizations are therefore facing heterogeneity of the available resources and the use of an Interware software like DIRAC to hide the diversity of underlying resources has become essential. The DIRAC WMS is based on the concept of pilot jobs that was introduced back in 2004. A pilot is what creates the possibility to run jobs on a worker node. Within DIRAC, we developed a new generation of pilot jobs, that we dubbed Pilots 2.0. Pilots 2.0 are not tied to a specific infrastructure; rather they are generic, fully configurable and extendible pilots. A Pilot 2.0 can be sent, as a script to be run, or it can be fetched from a remote location. A pilot 2.0 can run on every computing resource, e.g.: on CREAM Computing elements, on DIRAC Computing elements, on Virtual Machines as part of the contextualization script, or IAAC resources, provided that these machines are properly configured, hiding all the details of the Worker Nodes (WNs) infrastructure. Pilots 2.0 can be generated server and client side. Pilots 2.0 are the "pilots to fly in all the skies", aiming at easy use of computing power, in whatever form it is presented. Another aim is the unification and simplification of the monitoring infrastructure for all kinds of computing resources, by using pilots as a network of distributed sensors coordinated by a central resource monitoring system. Pilots 2.0 have been developed using the command pattern. VOs using DIRAC can tune pilots 2.0 as they need, and extend or replace each and every pilot command in an easy way. In this paper we describe how Pilots 2.0 work with distributed and heterogeneous resources providing the necessary abstraction to deal with different kind of computing resources.
DIRAC in Large Particle Physics Experiments Stagni, F; Tsaregorodtsev, A; Arrabito, L ...
Journal of physics. Conference series,
10/2017, Volume:
898, Issue:
9
Journal Article
Peer reviewed
Open access
The DIRAC project is developing interware to build and operate distributed computing systems. It provides a development framework and a rich set of services for both Workload and Data Management ...tasks of large scientific communities. A number of High Energy Physics and Astrophysics collaborations have adopted DIRAC as the base for their computing models. DIRAC was initially developed for the LHCb experiment at LHC, CERN. Later, the Belle II, BES III and CTA experiments as well as the linear collider detector collaborations started using DIRAC for their computing systems. Some of the experiments built their DIRAC-based systems from scratch, others migrated from previous solutions, ad-hoc or based on different middlewares. Adaptation of DIRAC for a particular experiment was enabled through the creation of extensions to meet their specific requirements. Each experiment has a heterogeneous set of computing and storage resources at their disposal that were aggregated through DIRAC into a coherent pool. Users from different experiments can interact with the system in different ways depending on their specific tasks, expertise level and previous experience using command line tools, python APIs or Web Portals. In this contribution we will summarize the experience of using DIRAC in particle physics collaborations. The problems of migration to DIRAC from previous systems and their solutions will be presented. An overview of specific DIRAC extensions will be given. We hope that this review will be useful for experiments considering an update, or for those designing their computing models.
Running Jobs in the Vacuum McNab, A; Stagni, F; Garcia, M Ubeda
Journal of physics. Conference series,
01/2014, Volume:
513, Issue:
3
Journal Article
Peer reviewed
Open access
We present a model for the operation of computing nodes at a site using Virtual Machines (VMs), in which VMs are created and contextualized for experiments by the site itself. For the experiment, ...these VMs appear to be produced spontaneously "in the vacuum" rather having to ask the site to create each one. This model takes advantage of the existing pilot job frameworks adopted by many experiments. In the Vacuum model, the contextualization process starts a job agent within the VM and real jobs are fetched from the central task queue as normal. An implementation of the Vacuum scheme, Vac, is presented in which a VM factory runs on each physical worker node to create and contextualize its set of VMs. With this system, each node's VM factory can decide which experiments' VMs to run, based on site-wide target shares and on a peer-to-peer protocol in which the site's VM factories query each other to discover which VM types they are running. A property of this system is that there is no gate keeper service, head node, or batch system accepting and then directing jobs to particular worker nodes, avoiding several central points of failure. Finally, we describe tests of the Vac system using jobs from the central LHCb task queue, using the same contextualization procedure for VMs developed by LHCb for Clouds.
The LHCb computing model was designed in order to support the LHCb physics program, taking into account LHCb specificities (event sizes, processing times etc ). Within this model several key ...activities are defined, the most important of which are real data processing (reconstruction, stripping and streaming, group and user analysis), Monte-Carlo simulation and data replication. In this contribution we detail how these activities are managed by the LHCbDIRAC Data Transformation System. The LHCbDIRAC Data Transformation System leverages the workload and data management capabilities provided by DIRAC, a generic community grid solution, to support data-driven workflows (or DAGs). The ability to combine workload and data tasks within a single DAG allows to create highly sophisticated workflows with the individual steps linked by the availability of data. This approach also provides the advantage of a single point at which all activities can be monitored and controlled. While several interfaces are currently supported (including python API and CLI), we will present the ability to create LHCb workflows through a secure web interface, control their state in addition to creating and submitting jobs. To highlight the versatility of the system we present in more detail experience with real data of the 2010 and 2011 LHC run.
Nowadays, many database systems are available but they may not be optimized for storing time series data. Monitoring DIRAC jobs would be better done using a database optimised for storing time series ...data. So far it was done using a MySQL database, which is not well suited for such an application. Therefore alternatives have been investigated. Choosing an appropriate database for storing huge amounts of time series data is not trivial as one must take into account different aspects such as manageability, scalability and extensibility. We compared the performance of Elasticsearch, OpenTSDB (based on HBase) and InfluxDB NoSQL databases, using the same set of machines and the same data. We also evaluated the effort required for maintaining them. Using the LHCb Workload Management System (WMS), based on DIRAC as a use case we set up a new monitoring system, in parallel with the current MySQL system, and we stored the same data into the databases under test. We evaluated Grafana (for OpenTSDB) and Kibana (for ElasticSearch) metrics and graph editors for creating dashboards, in order to have a clear picture on the usability of each candidate. In this paper we present the results of this study and the performance of the selected technology. We also give an outlook of other potential applications of NoSQL databases within the DIRAC project.
The LHCb experiment has been running production jobs in virtual machines since 2013 as part of its DIRAC-based infrastructure. We describe the architecture of these virtual machines and the steps ...taken to replicate the WLCG worker node environment expected by user and production jobs. This relies on the uCernVM system for providing root images for virtual machines. We use the CernVM-FS distributed filesystem to supply the root partition files, the LHCb software stack, and the bootstrapping scripts necessary to configure the virtual machines for us. Using this approach, we have been able to minimise the amount of contextualisation which must be provided by the virtual machine managers. We explain the process by which the virtual machine is able to receive payload jobs submitted to DIRAC by users and production managers, and how this differs from payloads executed within conventional DIRAC pilot jobs on batch queue based sites. We describe our operational experiences in running production on VM based sites managed using Vcycle/OpenStack, Vac, and HTCondor Vacuum. Finally we show how our use of these resources is monitored using Ganglia and DIRAC.
The DIRAC Web Portal 2.0 Mathe, Z; Ramo, A Casajus; Lazovsky, N. ...
Journal of physics. Conference series,
01/2015, Volume:
664, Issue:
6
Journal Article
Peer reviewed
Open access
For many years the DIRAC interware (Distributed Infrastructure with Remote Agent Control) has had a web interface, allowing the users to monitor DIRAC activities and also interact with the system. ...Since then many new web technologies have emerged, therefore a redesign and a new implementation of the DIRAC Web portal were necessary, taking into account the lessons learnt using the old portal. These new technologies allowed to build a more compact, robust and responsive web interface that enables users to have better control over the whole system while keeping a simple interface. The web framework provides a large set of "applications", each of which can be used for interacting with various parts of the system. Communities can also create their own set of personalised web applications, and can easily extend already existing ones with a minimal effort. Each user can configure and personalise the view for each application and save it using the DIRAC User Profile service as RESTful state provider, instead of using cookies. The owner of a view can share it with other users or within a user community. Compatibility between different browsers is assured, as well as with mobile versions. In this paper, we present the new DIRAC Web framework as well as the LHCb extension of the DIRAC Web portal.