Workflows have emerged as a paradigm for representing and managing complex distributed computations and are used to accelerate the pace of scientific progress. A recent National Science Foundation ...workshop brought together domain, computer, and social scientists to discuss requirements of future scientific applications and the challenges they present to current workflow technologies.
Scientists using the high-throughput computing (HTC) paradigm for scientific discovery rely on complex software systems and heterogeneous architectures that must deliver robust science (i.e., ...ensuring performance scalability in space and time; trust in technology, people, and infrastructures; and reproducible or confirmable research). Developers must overcome a variety of obstacles to pursue workflow interoperability, identify tools and libraries for robust science, port codes across different architectures, and establish trust in non-deterministic results. This poster presents recommendations to build a roadmap to overcome these challenges and enable robust science for HTC applications and workflows. The findings were collected from an international community of software developers during a Virtual World Cafe in May 2021.
Todays scientific applications have huge data requirements which continue to increase drastically every year. These data are generally accessed by many users from all across the the globe. This ...implies a major necessity to move huge amounts of data around wide area networks to complete the computation cycle, which brings with it the problem of efficient and reliable data placement. The current approach to solve this problem of data placement is either doing it manually, or employing simple scripts which do not have any automation or fault tolerance capabilities. Our goal is to make data placement activities first class citizens in the Grid just like the computational jobs. They will be queued, scheduled, monitored, managed, and even check-pointed. More importantly, it will be made sure that they complete successfully and without any human interaction. We also believe that data placement jobs should be treated differently from computational jobs, since they may have different semantics and different characteristics. For this purpose, we have developed Stork, a scheduler for data placement activities in the grid.
Conventional resource management systems use a system model to describe resources and a centralized scheduler to control their allocation. We argue that this paradigm does not adapt well to ...distributed systems, particularly those built to support high throughput computing. Obstacles include heterogeneity of resources, which make uniform allocation algorithms difficult to formulate, and distributed ownership, leading to widely varying allocation policies. Faced with these problems, we developed and implemented the classified advertisement (classad) matchmaking framework, a flexible and general approach to resource management in distributed environment with decentralized ownership of resources. Novel aspects of the framework include a semi structured data model that combines schema, data, and query in a simple but powerful specification language, and a clean separation of the matching and claiming phases of resource allocation. The representation and protocols result in a robust, scalable and flexible framework that can evolve with changing resources. The framework was designed to solve real problems encountered in the deployment of Condor, a high throughput computing system developed at the University of Wisconsin-Madison. Condor is heavily used by scientists at numerous sites around the world. It derives much of its robustness and efficiency from the matchmaking architecture.
Condor-G: a computation management agent for multi-institutional grids Frey, J.; Tannenbaum, T.; Livny, M. ...
High Performance Distributed Computing: Proceedings of the 10th IEEE International Symposium on High Performance Distributed Computing; 07-09 Aug. 2001,
01/2001
Conference Proceeding, Journal Article
In recent years, there has been a dramatic increase in the amount of available computing and storage resources, yet few have been able to exploit these resources in an aggregated form. We present the ...Condor-G system, which leverages software from Globus and Condor to allow users to harness multi-domain resources as if they all belong to one personal domain. We describe the structure of Condor-G and how it handles job management, resource selection, security and fault tolerance.
The BioMagResBank (BMRB: www.bmrb.wisc.edu) is a repository for experimental and derived data gathered from nuclear magnetic resonance (NMR) spectroscopic studies of biological molecules. BMRB is a ...partner in the Worldwide Protein Data Bank (wwPDB). The BMRB archive consists of four main data depositories: (i) quantitative NMR spectral parameters for proteins, peptides, nucleic acids, carbohydrates and ligands or cofactors (assigned chemical shifts, coupling constants and peak lists) and derived data (relaxation parameters, residual dipolar couplings, hydrogen exchange rates, pKa values, etc.), (ii) databases for NMR restraints processed from original author depositions available from the Protein Data Bank, (iii) time-domain (raw) spectral data from NMR experiments used to assign spectral resonances and determine the structures of biological macromolecules and (iv) a database of one- and two-dimensional 1H and 13C one- and two-dimensional NMR spectra for over 250 metabolites. The BMRB website provides free access to all of these data. BMRB has tools for querying the archive and retrieving information and an ftp site (ftp.bmrb.wisc.edu) where data in the archive can be downloaded in bulk. Two BMRB mirror sites exist: one at the PDBj, Protein Research Institute, Osaka University, Osaka, Japan (bmrb.protein.osaka-u.ac.jp) and the other at CERM, University of Florence, Florence, Italy (bmrb.postgenomicnmr.net/). The site at Osaka also accepts and processes data depositions.
Condor is being used extensively in the HEP environment. It is the batch system of choice for many compute farms, including several WLCG Tier Is, Tier 2s and Tier 3s. It is also the building block of ...one of the Grid pilot infrastructures, namely glideinWMS. As with any software, Condor does not scale indefinitely with the number of users and/or the number of resources being handled. In this paper we are presenting the current observed scalability limits of both the latest production and the latest development release of Condor, and compare them with the limits reported in previous publications. A description of what changes were introduced to remove the previous scalability limits are also presented.
Distributed computing, and in particular Grid computing, enables physicists to use thousands of CPU days worth of computing every day, by submitting thousands of compute jobs. Unfortunately, a small ...fraction of such jobs regularly fail; the reasons vary from disk and network problems to bugs in the user code. A subset of these failures result in jobs being stuck for long periods of time. In order to debug such failures, interactive monitoring is highly desirable; users need to browse through the job log files and check the status of the running processes. Batch systems typically don't provide such services; at best, users get job logs at job termination, and even this may not be possible if the job is stuck in an infinite loop. In this paper we present a novel approach of using regular batch system capabilities of Condor to enable users to access the logs and processes of any running job. This does not provide true interactive access, so commands like vi are not viable, but it does allow operations like ls, cat, top, ps, lsof, netstat and dumping the stack of any process owned by the user; we call this pseudo-interactive monitoring. It is worth noting that the same method can be used to monitor Grid jobs in a glidein-based environment. We further believe that the same mechanism could be applied to many other batch systems.
We describe the RD-OPT algorithm for DCT quantization optimization, which can be used as an efficient tool for near-optimal rate control in DCT-based compression techniques, such as JPEG and MPEG. ...RD-OPT measures DCT coefficient statistics for the given image data to construct rate/distortion-specific quantization tables with nearly optimal tradeoffs.