In any distributed computing infrastructure, a job is normally forbidden to run for an indefinite amount of time. This limitation is implemented using different technologies, the most common one ...being the CPU time limit implemented by batch queues. It is therefore important to have a good estimate of how much CPU work a job will require: otherwise, it might be killed by the batch system, or by whatever system is controlling the jobs' execution. In many modern interwares, the jobs are actually executed by pilot jobs, that can use the whole available time in running multiple consecutive jobs. If at some point the available time in a pilot is too short for the execution of any job, it should be released, while it could have been used efficiently by a shorter job. Within LHCbDIRAC, the LHCb extension of the DIRAC interware, we developed a simple way to fully exploit computing capabilities available to a pilot, even for resources with limited time capabilities, by adding elasticity to production MonteCarlo (MC) simulation jobs. With our approach, independently of the time available, LHCbDIRAC will always have the possibility to execute a MC job, whose length will be adapted to the available amount of time: therefore the same job, running on different computing resources with different time limits, will produce different amounts of events. The decision on the number of events to be produced is made just in time at the start of the job, when the capabilities of the resource are known. In order to know how many events a MC job will be instructed to produce, LHCbDIRAC simply requires three values: the CPU-work per event for that type of job, the power of the machine it is running on, and the time left for the job before being killed. Knowing these values, we can estimate the number of events the job will be able to simulate with the available CPU time. This paper will demonstrate that, using this simple but effective solution, LHCb manages to make a more efficient use of the available resources, and that it can easily use new types of resources. An example is represented by resources provided by batch queues, where low-priority MC jobs can be used as "masonry" jobs in multi-jobs pilots. A second example is represented by opportunistic resources with limited available time.
In order to ensure an optimal performance of the LHCb Distributed Computing, based on LHCbDIRAC, it is necessary to be able to inspect the behavior over time of many components: firstly the agents ...and services on which the infrastructure is built, but also all the computing tasks and data transfers that are managed by this infrastructure. This consists of recording and then analyzing time series of a large number of observables, for which the usage of SQL relational databases is far from optimal. Therefore within DIRAC we have been studying novel possibilities based on NoSQL databases (ElasticSearch, OpenTSDB and InfluxDB) as a result of this study we developed a new monitoring system based on ElasticSearch. It has been deployed on the LHCb Distributed Computing infrastructure for which it collects data from all the components (agents, services, jobs) and allows creating reports through Kibana and a web user interface, which is based on the DIRAC web framework. In this paper we describe this new implementation of the DIRAC monitoring system. We give details on the ElasticSearch implementation within the DIRAC general framework, as well as an overview of the advantages of the pipeline aggregation used for creating a dynamic bucketing of the time series. We present the advantages of using the ElasticSearch DSL high-level library for creating and running queries. Finally we shall present the performances of that system.
DIRAC universal pilots Stagni, F; McNab, A; Luzzi, C ...
Journal of physics. Conference series,
10/2017, Letnik:
898, Številka:
9
Journal Article
Recenzirano
Odprti dostop
In the last few years, new types of computing models, such as IAAS (Infrastructure as a Service) and IAAC (Infrastructure as a Client), gained popularity. New resources may come as part of pledged ...resources, while others are in the form of opportunistic ones. Most but not all of these new infrastructures are based on virtualization techniques. In addition, some of them, present opportunities for multi-processor computing slots to the users. Virtual Organizations are therefore facing heterogeneity of the available resources and the use of an Interware software like DIRAC to provide the transparent, uniform interface has become essential. The transparent access to the underlying resources is realized by implementing the pilot model. DIRAC's newest generation of generic pilots (the so-called Pilots 2.0) are the "pilots for all the skies", and have been successfully released in production more than a year ago. They use a plugin mechanism that makes them easily adaptable. Pilots 2.0 have been used for fetching and running jobs on every type of resource, being it a Worker Node (WN) behind a CREAM/ARC/HTCondor/DIRAC Computing element, a Virtual Machine running on IaaC infrastructures like Vac or BOINC, on IaaS cloud resources managed by Vcycle, the LHCb High Level Trigger farm nodes, and any type of opportunistic computing resource. Make a machine a "Pilot Machine", and all diversities between them will disappear. This contribution describes how pilots are made suitable for different resources, and the recent steps taken towards a fully unified framework, including monitoring. Also, the cases of multi-processor computing slots either on real or virtual machines, with the whole node or a partition of it, is discussed.
Pilots 2.0: DIRAC pilots for all the skies Stagni, F; Tsaregorodtsev, A; McNab, A ...
Journal of physics. Conference series,
12/2015, Letnik:
664, Številka:
6
Journal Article
Recenzirano
Odprti dostop
In the last few years, new types of computing infrastructures, such as IAAS (Infrastructure as a Service) and IAAC (Infrastructure as a Client), gained popularity. New resources may come as part of ...pledged resources, while others are opportunistic. Most of these new infrastructures are based on virtualization techniques. Meanwhile, some concepts, such as distributed queues, lost appeal, while still supporting a vast amount of resources. Virtual Organizations are therefore facing heterogeneity of the available resources and the use of an Interware software like DIRAC to hide the diversity of underlying resources has become essential. The DIRAC WMS is based on the concept of pilot jobs that was introduced back in 2004. A pilot is what creates the possibility to run jobs on a worker node. Within DIRAC, we developed a new generation of pilot jobs, that we dubbed Pilots 2.0. Pilots 2.0 are not tied to a specific infrastructure; rather they are generic, fully configurable and extendible pilots. A Pilot 2.0 can be sent, as a script to be run, or it can be fetched from a remote location. A pilot 2.0 can run on every computing resource, e.g.: on CREAM Computing elements, on DIRAC Computing elements, on Virtual Machines as part of the contextualization script, or IAAC resources, provided that these machines are properly configured, hiding all the details of the Worker Nodes (WNs) infrastructure. Pilots 2.0 can be generated server and client side. Pilots 2.0 are the "pilots to fly in all the skies", aiming at easy use of computing power, in whatever form it is presented. Another aim is the unification and simplification of the monitoring infrastructure for all kinds of computing resources, by using pilots as a network of distributed sensors coordinated by a central resource monitoring system. Pilots 2.0 have been developed using the command pattern. VOs using DIRAC can tune pilots 2.0 as they need, and extend or replace each and every pilot command in an easy way. In this paper we describe how Pilots 2.0 work with distributed and heterogeneous resources providing the necessary abstraction to deal with different kind of computing resources.
DIRAC in Large Particle Physics Experiments Stagni, F; Tsaregorodtsev, A; Arrabito, L ...
Journal of physics. Conference series,
10/2017, Letnik:
898, Številka:
9
Journal Article
Recenzirano
Odprti dostop
The DIRAC project is developing interware to build and operate distributed computing systems. It provides a development framework and a rich set of services for both Workload and Data Management ...tasks of large scientific communities. A number of High Energy Physics and Astrophysics collaborations have adopted DIRAC as the base for their computing models. DIRAC was initially developed for the LHCb experiment at LHC, CERN. Later, the Belle II, BES III and CTA experiments as well as the linear collider detector collaborations started using DIRAC for their computing systems. Some of the experiments built their DIRAC-based systems from scratch, others migrated from previous solutions, ad-hoc or based on different middlewares. Adaptation of DIRAC for a particular experiment was enabled through the creation of extensions to meet their specific requirements. Each experiment has a heterogeneous set of computing and storage resources at their disposal that were aggregated through DIRAC into a coherent pool. Users from different experiments can interact with the system in different ways depending on their specific tasks, expertise level and previous experience using command line tools, python APIs or Web Portals. In this contribution we will summarize the experience of using DIRAC in particle physics collaborations. The problems of migration to DIRAC from previous systems and their solutions will be presented. An overview of specific DIRAC extensions will be given. We hope that this review will be useful for experiments considering an update, or for those designing their computing models.
Running Jobs in the Vacuum McNab, A; Stagni, F; Garcia, M Ubeda
Journal of physics. Conference series,
01/2014, Letnik:
513, Številka:
3
Journal Article
Recenzirano
Odprti dostop
We present a model for the operation of computing nodes at a site using Virtual Machines (VMs), in which VMs are created and contextualized for experiments by the site itself. For the experiment, ...these VMs appear to be produced spontaneously "in the vacuum" rather having to ask the site to create each one. This model takes advantage of the existing pilot job frameworks adopted by many experiments. In the Vacuum model, the contextualization process starts a job agent within the VM and real jobs are fetched from the central task queue as normal. An implementation of the Vacuum scheme, Vac, is presented in which a VM factory runs on each physical worker node to create and contextualize its set of VMs. With this system, each node's VM factory can decide which experiments' VMs to run, based on site-wide target shares and on a peer-to-peer protocol in which the site's VM factories query each other to discover which VM types they are running. A property of this system is that there is no gate keeper service, head node, or batch system accepting and then directing jobs to particular worker nodes, avoiding several central points of failure. Finally, we describe tests of the Vac system using jobs from the central LHCb task queue, using the same contextualization procedure for VMs developed by LHCb for Clouds.
The LHCb computing model was designed in order to support the LHCb physics program, taking into account LHCb specificities (event sizes, processing times etc ). Within this model several key ...activities are defined, the most important of which are real data processing (reconstruction, stripping and streaming, group and user analysis), Monte-Carlo simulation and data replication. In this contribution we detail how these activities are managed by the LHCbDIRAC Data Transformation System. The LHCbDIRAC Data Transformation System leverages the workload and data management capabilities provided by DIRAC, a generic community grid solution, to support data-driven workflows (or DAGs). The ability to combine workload and data tasks within a single DAG allows to create highly sophisticated workflows with the individual steps linked by the availability of data. This approach also provides the advantage of a single point at which all activities can be monitored and controlled. While several interfaces are currently supported (including python API and CLI), we will present the ability to create LHCb workflows through a secure web interface, control their state in addition to creating and submitting jobs. To highlight the versatility of the system we present in more detail experience with real data of the 2010 and 2011 LHC run.
Nowadays, many database systems are available but they may not be optimized for storing time series data. Monitoring DIRAC jobs would be better done using a database optimised for storing time series ...data. So far it was done using a MySQL database, which is not well suited for such an application. Therefore alternatives have been investigated. Choosing an appropriate database for storing huge amounts of time series data is not trivial as one must take into account different aspects such as manageability, scalability and extensibility. We compared the performance of Elasticsearch, OpenTSDB (based on HBase) and InfluxDB NoSQL databases, using the same set of machines and the same data. We also evaluated the effort required for maintaining them. Using the LHCb Workload Management System (WMS), based on DIRAC as a use case we set up a new monitoring system, in parallel with the current MySQL system, and we stored the same data into the databases under test. We evaluated Grafana (for OpenTSDB) and Kibana (for ElasticSearch) metrics and graph editors for creating dashboards, in order to have a clear picture on the usability of each candidate. In this paper we present the results of this study and the performance of the selected technology. We also give an outlook of other potential applications of NoSQL databases within the DIRAC project.
The LHCb experiment has been running production jobs in virtual machines since 2013 as part of its DIRAC-based infrastructure. We describe the architecture of these virtual machines and the steps ...taken to replicate the WLCG worker node environment expected by user and production jobs. This relies on the uCernVM system for providing root images for virtual machines. We use the CernVM-FS distributed filesystem to supply the root partition files, the LHCb software stack, and the bootstrapping scripts necessary to configure the virtual machines for us. Using this approach, we have been able to minimise the amount of contextualisation which must be provided by the virtual machine managers. We explain the process by which the virtual machine is able to receive payload jobs submitted to DIRAC by users and production managers, and how this differs from payloads executed within conventional DIRAC pilot jobs on batch queue based sites. We describe our operational experiences in running production on VM based sites managed using Vcycle/OpenStack, Vac, and HTCondor Vacuum. Finally we show how our use of these resources is monitored using Ganglia and DIRAC.