Predicting the performance of various infrastructure design options in complex federated infrastructures with computing sites distributed over a wide area network that support a plethora of users and ...workflows, such as the Worldwide LHC Computing Grid (WLCG), is not trivial. Due to the complexity and size of these infrastructures, it is not feasible to deploy experimental test-beds at large scales merely for the purpose of comparing and evaluating alternate designs. An alternative is to study the behaviours of these systems using simulation. This approach has been used successfully in the past to identify efficient and practical infrastructure designs for High Energy Physics (HEP). A prominent example is the Monarc simulation framework, which was used to study the initial structure of the WLCG. New simulation capabilities are needed to simulate large-scale heterogeneous computing systems with complex networks, data access and caching patterns. A modern tool to simulate HEP workloads that execute on distributed computing infrastructures based on the SimGrid and WRENCH simulation frameworks is outlined. Studies of its accuracy and scalability are presented using HEP as a case-study. Hypothetical adjustments to prevailing computing architectures in HEP are studied providing insights into the dynamics of a part of the WLCG and candidates for improvements.
The CMS collaboration is successfully using glideinWMS for managing grid resources within the WLCG project. The glidein mechanism with HTCondor underneath provides a clear separation of ...responsibilities between administrators operating the service and users utilizing computational resources. German CMS collaborators (dCMS) have explored modern capabilities of the glideinWMS aiming at merging national grid resources, institutional CPU power and cloud resources into the set of pools with common sign-in interface presented towards HEP analysis users. The key goals of service development include ease of use, uniform access, load balancing and automated selection among different resource technologies. The approach integrates experience of dCMS during the development and integration phases as well as production operations and highly encourages other countries to follow. First experience with the production system and an outlook towards ongoing development will be presented.
The specific requirements concerning the software environment within the HEP community constrain the choice of resource providers for the outsourcing of computing infrastructure. The use of ...virtualization in HPC clusters and in the context of cloud resources is therefore a subject of recent developments in scientific computing. The dynamic virtualization of worker nodes in common batch systems provided by ViBatch serves each user with a dynamically virtualized subset of worker nodes on a local cluster. Now it can be transparently extended by the use of common open source cloud interfaces like OpenNebula or Eucalyptus, launching a subset of the virtual worker nodes within the cloud. This paper demonstrates how a dynamically virtualized computing cluster is combined with cloud resources by attaching remotely started virtual worker nodes to the local batch system.
The HappyFace Project Mauch, Viktor; Ay, Cano; Birkholz, Stefan ...
Journal of physics. Conference series,
12/2011, Letnik:
331, Številka:
8
Journal Article
Recenzirano
Odprti dostop
An efficient administration of computing centres requires sophisticated tools for the monitoring of the local computing infrastructure. The enormous flood of information from different monitoring ...sources retards the identification of problems and complicates the local administration unnecessarily. The meta-monitoring system "HappyFace" offers elegant mechanisms to collect, process and evaluate all relevant information and to condense it into a simple rating visualisation, reflecting the current status of a computing centre. In this paper, we give an overview of the HappyFace architecture and selected modules.
Computing resource needs are expected to increase drastically in the future. The HEP experiments ATLAS and CMS foresee an increase of a factor of 5-10 in the volume of recorded data in the upcoming ...years. The current infrastructure, namely the WLCG, is not sufficient to meet the demands in terms of computing and storage resources.
The usage of non HEP specific resources is one way to reduce this shortage. However, using them comes at a cost: First, with multiple of such resources at hand, it gets more and more diffcult for the single user, as each resource normally requires its own authentication and has its own way of accessing it. Second, as they are not specifically designed for HEP workflows, they might lack dedicated software or other necessary services.
Allocating the resources at the different providers can be done by COBalD/TARDIS, developed at KIT. The resource manager integrates resources on demand into one overlay batch system, providing the user with a single point of entry. The software and services, needed for the communities workflows, are transparently served through containers.
With this, an HPC cluster at RWTH Aachen University is dynamically and transparently integrated into a Tier 2 WLCG resource, virtually doubling its computing capacities.
Increased operational effectiveness and the dynamic integration of only temporarily available compute resources (opportunistic resources) becomes more and more important in the next decade, due to ...the scarcity of resources for future high energy physics experiments as well as the desired integration of cloud and high performance computing resources. This results in a more heterogenous compute environment, which gives rise to huge challenges for the computing operation teams of the experiments.
At the Karlsruhe Institute of Technology (KIT) we design solutions to tackle these challenges. In order to ensure an efficient utilization of opportunistic resources and unified access to the entire infrastructure, we developed the Transparent Adaptive Resource Dynamic Integration System (TARDIS). A scalable multi-agent resource manager providing interfaces to provision as well as dynamically and transparently integrate resources of various providers into one common overlay batch system. Operational effectiveness is guaranteed by relying on COBalD – the Opportunistic Balancing Daemon and its simple approach of taking into account the utilization and allocation of the different resource types, in order to run the individual workflows on the best-suited resource respectively.
In this contribution we will present the current status of integrating various HPC centers and cloud providers into the compute infrastructure at the Karlsruhe Institute of Technology as well as our experiences gained in a production environment.
The amount of data to be processed by experiments in high energy physics (HEP) will increase tremendously in the coming years. To cope with this increasing load, most efficient usage of the resources ...is mandatory. Furthermore, the computing resources for user jobs in HEP will be increasingly distributed and heterogeneous, resulting in more difficult scheduling due to the increasing complexity of the system. We aim to create a simulation for the WLCG helping the HEP community to solve both challenges: a more efficient utilization of the grid and coping with the rising complexity of the system. There is currently no simulation in existence which helps the operators of the grid to make the correct decisions while optimizing the load balancing strategy. This paper presents a proof of concept in which the computing jobs at the Tier 1 center GridKa are modeled and simulated. To model the computing jobs we extended the Palladio simulator with a mechanism to simulate load balancing strategies. Furthermore, we implemented an automated model parameter analysis and model creation.
Finally, the simulation results are validated using real-word performance data. Our results suggest that simulating larger parts of the grid is feasible and can help to optimize the utilization of the grid.
Current and future end-user analyses and workflows in High Energy Physics demand the processing of growing amounts of data. This plays a major role when looking at the demands in the context of the ...High-Luminosity-LHC. In order to keep the processing time and turn-around cycles as low as possible analysis clusters optimized with respect to these demands can be used. Since hyper converged servers offer a good combination of compute power and local storage, they form the ideal basis for these clusters. In this contribution we report on the setup and commissioning of a dedicated analysis cluster setup at Karlsruhe Institute of Technology. This cluster was designed for use cases demanding high data-throughput. Based on hyper converged servers this cluster offers 500 job slots and 1 PB of local storage. Combined with the 100 Gb network connection between the servers and a 200 Gb uplink to the Tier-1 storage, the cluster can sustain a data-throughput of 1 PB per day. In addition, the local storage provided by the hyper converged worker nodes can be used as cache space. This allows employing of caching approaches on the cluster, thereby enabling a more efficient usage of the disk space. In previous contributions this concept has been shown to lead to an expected speedup of 2 to 4 compared to conventional setups.
High throughput and short turnaround cycles are core requirements for efficient processing of data-intense end-user analyses in High Energy Physics (HEP). Together with the tremendously increasing ...amount of data to be processed, this leads to enormous challenges for HEP storage systems, networks and the data distribution to computing resources for end-user analyses. Bringing data close to the computing resource is a very promising approach to solve throughput limitations and improve the overall performance. However, achieving data locality by placing multiple conventional caches inside a distributed computing infrastructure leads to redundant data placement and inefficient usage of the limited cache volume. The solution is a coordinated placement of critical data on computing resources, which enables matching each process of an analysis work-flow to its most suitable worker node in terms of data locality and, thus, reduces the overall processing time. This coordinated distributed caching concept was realized at KIT by developing the coordination service NaviX that connects an XRootD cache proxy infrastructure with an HTCondor batch system. We give an overview about the coordinated distributed caching concept and experiences collected on prototype system based on NaviX.
Applications of neural networks to data analyses in natural sciences are complicated by the fact that many inputs are subject to systematic uncertainties. To control the dependence of the neural ...network function to variations of the input space within these systematic uncertainties, several methods have been proposed. In this work, we propose a new approach of training the neural network by introducing penalties on the variation of the neural network output directly in the loss function. This is achieved at the cost of only a small number of additional hyperparameters. It can also be pursued by treating all systematic variations in the form of statistical weights. The proposed method is demonstrated with a simple example, based on pseudo-experiments, and by a more complex example from high-energy particle physics.