The Open Science Grid (OSG) is a large, robust computing grid that started primarily as a collection of sites associated with large HEP experiments such as ATLAS, CDF, CMS, and DZero, but has evolved ...in recent years to a much larger user and resource platform. In addition to meeting the US LHC community's computational needs, the OSG continues to be one of the largest providers of distributed high-throughput computing (DHTC) to researchers from a wide variety of disciplines via the OSG Open Facility. The Open Facility consists of OSG resources that are available opportunistically to users other than resource owners and their collaborators. In the past two years, the Open Facility has doubled its annual throughput to over 200 million wall hours. More than half of these resources are used by over 100 individual researchers from over 60 institutions in fields such as biology, medicine, math, economics, and many others. Over 10% of these individual users utilized in excess of 1 million computational hours each in the past year. The largest source of these cycles is temporary unused capacity at institutions affiliated with US LHC computational sites. An increasing fraction, however, comes from university HPC clusters and large national infrastructure supercomputers offering unused capacity. Such expansions have allowed the OSG to provide ample computational resources to both individual researchers and small groups as well as sizable international science collaborations such as LIGO, AMS, IceCube, and sPHENIX. Opening up access to the Fermilab FabrIc for Frontier Experiments (FIFE) project has also allowed experiments such as mu2e and NOvA to make substantial use of Open Facility resources, the former with over 40 million wall hours in a year. We present how this expansion was accomplished as well as future plans for keeping the OSG Open Facility at the forefront of enabling scientific research by way of DHTC.
Computing plays a significant role in all areas of high energy physics. The Snowmass 2021 CompF4 topical group's scope is facilities R&D, where we consider "facilities" as the computing hardware and ...software infrastructure inside the data centers plus the networking between data centers, irrespective of who owns them, and what policies are applied for using them. In other words, it includes commercial clouds, federally funded High Performance Computing (HPC) systems for all of science, and systems funded explicitly for a given experimental or theoretical program. This topical group report summarizes the findings and recommendations for the storage, processing, networking and associated software service infrastructures for future high energy physics research, based on the discussions organized through the Snowmass 2021 community study.
In April of 2014, the UCSD T2 Center deployed hdfs-xrootd-fallback, a UCSD- developed software system that interfaces Hadoop with XRootD to increase reliability of the Hadoop file system. The ...hdfs-xrootd-fallback system allows a site to depend less on local file replication and more on global replication provided by the XRootD federation to ensure data redundancy. Deploying the software has allowed us to reduce Hadoop replication on a significant subset of files in our cluster, freeing hundreds of terabytes in our local storage, and to recover HDFS blocks lost due to storage degradation. An overview of the architecture of the hdfs-xrootd-fallback system will be presented, as well as details of our experience operating the service over the past year.
The Pacific Research Platform is an initiative to interconnect Science DMZs between campuses across the West Coast of the United States over a 100 gbps network. The LHC @ UC is a proof of concept ...pilot project that focuses on interconnecting 6 University of California campuses. It is spearheaded by computing specialists from the UCSD Tier 2 Center in collaboration with the San Diego Supercomputer Center. A machine has been shipped to each campus extending the concept of the Data Transfer Node to a cluster in a box that is fully integrated into the local compute, storage, and networking infrastructure. The node contains a full HTCondor batch system, and also an XRootD proxy cache. User jobs routed to the DTN can run on 40 additional slots provided by the machine, and can also flock to a common GlideinWMS pilot pool, which sends jobs out to any of the participating UCs, as well as to Comet, the new supercomputer at SDSC. In addition, a common XRootD federation has been created to interconnect the UCs and give the ability to arbitrarily export data from the home university, to make it available wherever the jobs run. The UC level federation also statically redirects to either the ATLAS FAX or CMS AAA federation respectively to make globally published datasets available, depending on end user VO membership credentials. XRootD read operations from the federation transfer through the nearest DTN proxy cache located at the site where the jobs run. This reduces wide area network overhead for subsequent accesses, and improves overall read performance. Details on the technical implementation, challenges faced and overcome in setting up the infrastructure, and an analysis of usage patterns and system scalability will be presented.
In late 2008 a new experimental facility, the Large Hadron Collider is expected to start data taking. Two large experiments, Atlas and CMS, are getting ready to explore the high-energy frontier of ...elementary particle physics at this facility. We briefly explain why this new facility is a once-in-a-lifetime opportunity for the present generation of particle physicists. Using the example of the CMS experiment, we explain the computing challenge of these large experiments, and how this challenge is addressed. In this context, we explain why Grid computing is an essential enabling technology to fully exploit the science potential of these experiments. We use data transfer and data analysis as examples to give a flavor of the present state of readiness of the globally distributed computing system for data taking, and what we had to do to get there.
Pilot infrastructures are becoming prominent players in the Grid environment. One of the major advantages is represented by the reduced effort required by the user communities (also known as Virtual ...Organizations or VOs) due to the outsourcing of the Grid interfacing services, i.e. the pilot factory, to Grid experts. One such pilot factory, based on the glideinWMS pilot infrastructure, is being operated by the Open Science Grid at University of California San Diego (UCSD). This pilot factory is serving multiple VOs from several scientific domains. Currently the three major clients are the analysis operations of the HEP experiment CMS, the community VO HCC, which serves mostly math, biology and computer science users, and the structural biology VO NEBioGrid. The UCSD glidein factory allows the served VOs to use Grid resources distributed over 150 sites in North and South America, in Europe, and in Asia. This paper presents the steps taken to create a production quality pilot factory, together with the challenges encountered along the road.
The Open Science Grid relies on several network facing services to deliver resources to its users. The major services are the Compute Elements, Storage Elements, Workload Management Systems and ...Information Systems. All of these services are exposed to traffic coming from all over the world in an unmanaged way, so it is very important to know how they behave at different levels of load. In this paper we present the methodology and the results of scalability and reliability tests performed by OSG on some of the above services. The major services being tested are the Condor batch system, the GT2, GRAM5 and CREAM CEs, and the BeStMan SRM SE.
glideinWMS experience with glexec Sfiligoi, I; Bradley, D C; Miller, Z ...
Journal of physics. Conference series,
01/2012, Letnik:
396, Številka:
3
Journal Article
Recenzirano
Odprti dostop
Multi-user pilot infrastructures provide significant advantages for the communities using them, but also create new security challenges. With Grid authorization and mapping happening with the pilot ...credential only, final user identity is not properly addressed in the classic Grid paradigm. In order to solve this problem, OSG and EGI have deployed glexec, a privileged executable on the worker nodes that allows for final user authorization and mapping from inside the pilot itself. The glideinWMS instances deployed on OSG have been now using glexec on OSG sites for several years, and have started using it on EGI resources in the past year. The user experience of using glexec has been mostly positive, although there are still some edge cases where things could be improved. This paper provides both the usage statistics as well as a description of the still remaining problems and the expected solutions.
The Worldwide LHC Computing Grid (WLCG) is the largest grid computing infrastructure in the world pooling the resources of 170 computing centers (sites). One advantage of grid computing is that ...multiple copies of data can be distributed across different sites, allowing user access that is independent of geographic location or software. Each site is able to communicate using software stacks collectively referred to as "middleware". One key middleware piece is the storage element (SE), which provides remote POSIX-like access to a site's storage. The middleware stack managed by the Open Science Grid (OSG) used a storage resource manager (SRM) protocol implementation that, among other things, allowed sites to load-balance servers providing the Grid File Transfer Protocol (GridFTP) interface. OSG is eliminating the use of an SRM entirely and is transitioning to a solution based solely on GridFTP load-balanced at the network level with Linux Virtual Server (LVS). LVS is a core component of the Linux kernel, so this change increases both maintainability and reduces complexity of the site. In this document, we outline our methodologies and results from the large scale testing of an LVS+GridFTP cluster for data reads. Additionally, we discuss potential optimizations to the cluster to maximize total throughput.