In 2012, 14 Italian institutions participating in LHC Experiments won a grant from the Italian Ministry of Research (MIUR), with the aim of optimising analysis activities, and in general the Tier2 ...Tier3 infrastructure. We report on the activities being researched upon, on the considerable improvement in the ease of access to resources by physicists, also those with no specific computing interests. We focused on items like distributed storage federations, access to batch-like facilities, provisioning of user interfaces on demand and cloud systems. R&D on next-generation databases, distributed analysis interfaces, and new computing architectures was also carried on. The project, ending in the first months of 2016, will produce a white paper with recommendations on best practices for data-analysis support by computing centers.
In the ATLAS computing model Grid resources are managed by PanDA, the system designed for production and distributed analysis, and data are stored under various formats in ROOT files. End-user ...physicists have the choice to use either the ATHENA framework or directly ROOT, that provides users the possibility to use PROOF to exploit the computing power of multi-core machines or to dynamically manage analysis facilities. Since analysis facilities are, in general, not dedicated to PROOF only, PROOF-on-Demand (PoD) is used to enable PROOF on top of an existing resource management system. In a previous work we investigated the usage of PoD to enable PROOF-based analysis on Tier-2 facilities using the PoD/gLite plug-in interface. In this paper we present the status of our investigations using the recently developed PoD/PanDA plug-in to enable PROOF and a real end-user ATLAS physics analysis as payload. For this work, data were accessed using two different protocols: XRootD and file protocol. The former in the site where the SRM interface is Disk Pool Manager (DPM) and the latter where the SRM interface is StoRM with GPFS file system. We will first describe the results of some benchmark tests we run on the ATLAS Italian Tier-1 and Tier-2s sites and at CERN. Then, we will compare the results of different types of analysis, comparing performances accessing data in relation to different types of SRM interfaces and accessing data with XRootD in the LAN and in the WAN using the ATLAS XROOTD storage federation infrastructure.
The CREAM CE implements a Grid job management service available to end users and to other higher level Grid job submission services. It allows the submission, management and monitoring of ...computational jobs to local resource management systems. CREAM, which is part of the gLite Grid middleware, is available in the EGI production Grid where it is used by several user communities in different job submission scenarios. In this paper, after a quick description of the CREAM CE architecture and functionality, we report on the status of this Grid service, focusing on the results, feedback and issues that had to be addressed. We also discuss about its integration with other job submission services, in particular the gLite Workload Management System. The planned future activities, concerning the maintenance and evolution of the CREAM CE, are reported as well.
The gLite workload management system Andreetto, P; Andreozzi, S; Avellino, G ...
Journal of physics. Conference series,
07/2008, Letnik:
119, Številka:
6
Journal Article
Recenzirano
Odprti dostop
The gLite Workload Management System (WMS) is a collection of components that provide the service responsible for distributing and managing tasks across computing and storage resources available on a ...Grid. The WMS basically receives requests of job execution from a client, finds the required appropriate resources, then dispatches and follows the jobs until completion, handling failure whenever possible. Other than single batch-like jobs, compound job types handled by the WMS are Directed Acyclic Graphs (a set of jobs where the input/output/execution of one of more jobs may depend on one or more other jobs), Parametric Jobs (multiple jobs with one parametrized description), and Collections (multiple jobs with a common description). Jobs are described via a flexible, high-level Job Definition Language (JDL). New functionality was recently added to the system (use of Service Discovery for obtaining new service endpoints to be contacted, automatic sandbox files archival/compression and sharing, support for bulk-submission and bulk-matchmaking). Intensive testing and troubleshooting allowed to dramatically increase both job submission rate and service stability. Future developments of the gLite WMS will be focused on reducing external software dependency, improving portability, robustness and usability.
With the advent of the recent European Union (EU) funded projects aimed at achieving an open, coordinated and proactive collaboration among the European communities that provide distributed computing ...services, more strict requirements and quality standards will be asked to middleware providers. Such a highly competitive and dynamic environment, organized to comply a business-oriented model, has already started pursuing quality criteria, thus requiring to formally define rigorous procedures, interfaces and roles for each step of the software life-cycle. This will ensure quality-certified releases and updates of the Grid middleware. In the European Middleware Initiative (EMI), the release management for one or more components will be organized into Product Team (PT) units, fully responsible for delivering production ready, quality-certified software and for coordinating each other to contribute to the EMI release as a whole. This paper presents the certification process, with respect to integration, installation, configuration and testing, adopted at INFN by the Product Team responsible for the gLite Web-Service based Computing Element (CREAM CE) and for the Workload Management System (WMS). The used resources, the testbeds layout, the integration and deployment methods, the certification steps to provide feedback to developers and to grant quality results are described.
The High Throughput Computing paradigm typically involves a scenario whereby a given, estimated processing power is made available and sustained by the computing environment over a medium/long period ...of time. As a consequence, the performance goals are in general targeted at maximizing resource utilization to obtain the expected throughput, rather than minimizing run time for individual jobs. This does not mean that optimal resource selection through adequate workload management is not desired nor effective, nonetheless, relatively small and pre-assessed percentages of suboptimal choices or unexpected events can be tolerated. However, there are use-cases, among the HEP community, for which the described model does not immediately fit. This paper deals with the workload needs primarily driven by the Collider Detector at Fermilab (CDF) experimental collaboration. In particular, the CDF analysis facility (CAF) typically operates by splitting its computations into so-called sections, which can be seen as sets of uniform and independent jobs. Processing a section cannot be considered completed until all _its jobs have been successfully executed, thus requiring a Minimum Completion Time (MCT) dynamic scheduling policy where not even a single job should lay in non-terminal Grid states. A significant part of the CDF analysis is processed on the European Grid infrastructure through the gLite Workload Management System (WMS) 2. This paper describes the design enhancements and ranking algorithms the WMS has been provided with to implement an adaptive scheduling policy to minimise MCT. Case study, outlined approach and first results are presented.
The large amount of data produced by the ATLAS experiment needs new computing paradigms for data processing and analysis, which involve many computing centres spread around the world. The computing ...workload is managed by regional federations, called “clouds”. The Italian cloud consists of a main (Tier-1) center, located in Bologna, four secondary (Tier-2) centers, and a few smaller (Tier-3) sites. In this contribution we describe the Italian cloud facilities and the activities of data processing, analysis, simulation and software development performed within the cloud, and we discuss the tests of the new computing technologies contributing to evolution of the ATLAS Computing Model.
ATLAS data are distributed centrally to Tier-1 and Tier-2 sites. The first stages of data selection and analysis take place mainly at Tier-2 centres, with the final, iterative and interactive, stages ...taking place mostly at Tier-3 clusters. The Italian ATLAS cloud consists of a Tier-1, four Tier-2s, and Tier-3 sites at each institute. Tier-3s that are grid-enabled are used to test code that will then be run on a larger scale at Tier-2s. All Tier-3s offer interactive data access to their users and the possibility to run PROOF. This paper describes the hardware and software infrastructure choices taken, the operational experience after 10 months of LHC data, and discusses site performances.
The ATLAS experiment has been running continuous simulated events production since more than two years. A considerable fraction of the jobs is daily submitted and handled via the gLite Workload ...Management System, which overcomes several limitations of the previous LCG Resource Broker. The gLite WMS has been tested very intensively for the LHC experiments use cases for more than six months, both in terms of performance and reliability. The tests were carried out by the LCG Experiment Integration Support team (in close contact with the experiments) together with the EGEE integration and certification team and the gLite middleware developers. A pragmatic iterative and interactive approach allowed a very quick rollout of fixes and their rapid deployment, together with new functionalities, for the ATLAS production activities. The same approach is being adopted for other middleware components like the gLite and CREAM Computing Elements. In this contribution we will summarize the learning from the gLite WMS testing activity, pointing out the most important achievements and the open issues. In addition, we will present the current situation of the ATLAS simulated event production activity on the EGEE infrastructure based on the gLite WMS, showing the main improvements and benefits from the new middleware. Finally, the gLite WMS is being used by many other VOs, including the LHC experiments. In particular, some statistics will be shown on the CMS experience running WMS user analysis via the WMS
Grid middleware stacks, including gLite, matured into the state of being able to process up to millions of jobs per day. Logging and Bookkeeping, the gLite job-tracking service, keeps pace with this ...rate; however, it is not designed to provide a long-term archive of information on executed jobs. ATLAS — representative of a large user community — addresses this issue with its own job catalogue (ProdDB). Development of such a customized service, not easily reusable, took considerable effort which is not affordable by smaller communities. On the contrary, Job Provenance (JP), a generic gLite service designed for long-term archiving of information on executed jobs focusing on scalability, extensibility, uniform data view, and configurability, allows more specialized catalogues to be easily built. We present the first results of an experimental JP deployment for the ATLAS production infrastructure where a JP installation was fed with a part of ATLAS jobs, and also stress tested with real production data. The main outcome of this work is a demonstration that JP can complement large-scale application-specific job catalogue services, while serving a similar purpose where there are none available.