Reproducibility is widely considered to be an essential requirement of the scientific process. However, a number of serious concerns have been raised recently, questioning whether today’s ...computational work is adequately reproducible. In principle, it should be possible to specify a computation to sufficient detail that anyone should be able to reproduce it exactly. But in practice, there are fundamental, technical, and social barriers to doing so. The many objectives and meanings of reproducibility are discussed within the context of scientific computing. Technical barriers to reproducibility are described, extant approaches surveyed, and open areas of research are identified.
Modern science often requires the execution of large-scale, multi-stage simulation and data analysis pipelines to enable the study of complex systems. The amount of computation and data involved in ...these pipelines requires scalable workflow management systems that are able to reliably and efficiently coordinate and automate data movement and task execution on distributed computational resources: campus clusters, national cyberinfrastructures, and commercial and academic clouds. This paper describes the design, development and evolution of the Pegasus Workflow Management System, which maps abstract workflow descriptions onto distributed computing infrastructures. Pegasus has been used for more than twelve years by scientists in a wide variety of domains, including astronomy, seismology, bioinformatics, physics and others. This paper provides an integrated view of the Pegasus system, showing its capabilities that have been developed over time in response to application needs and to the evolution of the scientific computing platforms. The paper describes how Pegasus achieves reliable, scalable workflow execution across a wide variety of computing infrastructures.
•Comprehensive description of the Pegasus Workflow Management System.•Detailed explanation of Pegasus workflow transformations.•Data management in Pegasus.•Earthquake science application example.
Scientific workflows consisting of a high number of interdependent tasks represent an important class of complex scientific applications. Recently, a new type of serverless infrastructures has ...emerged, represented by such services as Google Cloud Functions and AWS Lambda, also referred to as the Function-as-a-Service model. In this paper we take a look at such serverless infrastructures, which are designed mainly for processing background tasks of Web and Internet of Things applications, or event-driven stream processing. We evaluate their applicability to more compute- and data-intensive scientific workflows and discuss possible ways to repurpose serverless architectures for execution of scientific workflows. We have developed prototype workflow executor functions using AWS Lambda and Google Cloud Functions, coupled with the HyperFlow workflow engine. These functions can run workflow tasks in AWS and Google infrastructures, and feature such capabilities as data staging to/from S3 or Google Cloud Storage and execution of custom application binaries. We have successfully deployed and executed the Montage astronomy workflow, often used as a benchmark, and we report on initial results of its performance evaluation. Our findings indicate that the simple mode of operation makes this approach easy to use, although there are costs involved in preparing portable application binaries for execution in a remote environment.
While our solution is an early prototype, we find the presented approach highly promising. We also discuss possible future steps related to execution of scientific workflows in serverless infrastructures. Finally, we perform a cost analysis and discuss implications with regard to resource management for scientific applications in general.
•We evaluate the applicability of cloud functions and serverless architectures for execution of scientific workflows.•We developed a prototype which combines HyperFlow engine with AWS Lambda and Google Cloud Functions.•Experiments with Montage application show the good scalability and low overheads of serverless approach.•We discuss implications of serverless architectures for resource management.
Researchers working on the planning, scheduling, and execution of scientific workflows need access to a wide variety of scientific workflows to evaluate the performance of their implementations. This ...paper provides a characterization of workflows from six diverse scientific applications, including astronomy, bioinformatics, earthquake science, and gravitational-wave physics. The characterization is based on novel workflow profiling tools that provide detailed information about the various computational tasks that are present in the workflow. This information includes I/O, memory and computational characteristics. Although the workflows are diverse, there is evidence that each workflow has a job type that consumes the most amount of runtime. The study also uncovered inefficiency in a workflow component implementation, where the component was re-reading the same data multiple times.
► The workflows of six diverse scientific applications are characterized. ► The characterization includes workflow structure as well as I/O, memory and CPU usage. ► We describe new techniques that were developed to profile scientific workflows. ► The information provided can be used to create realistic synthetic workflows for use in simulation studies of workflow systems.
Cloud platforms allow users to execute tasks directly from their web browser and are a key enabling technology not only for commerce but also for computational science. Research software is often ...developed by scientists with limited experience in (and time for) user interface design, which can make research software difficult to install and use for novices. When combined with the increasing complexity of scientific workflows (involving many steps and software packages), setting up a computational research environment becomes a major entry barrier. AiiDAlab is a web platform that enables computational scientists to package scientific workflows and computational environments and share them with their collaborators and peers. By leveraging the AiiDA workflow manager and its plugin ecosystem, developers get access to a growing range of simulation codes through a python API, coupled with automatic provenance tracking of simulations for full reproducibility. Computational workflows can be bundled together with user-friendly graphical interfaces and made available through the AiiDAlab app store. Being fully compatible with open-science principles, AiiDAlab provides a complete infrastructure for automated workflows and provenance tracking, where incorporating new capabilities becomes intuitive, requiring only Python knowledge.
Automation of the execution of computational tasks is at the heart of improving scientific productivity. Over the last years, scientific workflows have been established as an important abstraction ...that captures data processing and computation of large and complex scientific applications. By allowing scientists to model and express entire data processing steps and their dependencies, workflow management systems relieve scientists from the details of an application and manage its execution on a computational infrastructure. As the resource requirements of today’s computational and data science applications that process vast amounts of data keep increasing, there is a compelling case for a new generation of advances in high-performance computing, commonly termed as extreme-scale computing, which will bring forth multiple challenges for the design of workflow applications and management systems. This paper presents a novel characterization of workflow management systems using features commonly associated with extreme-scale computing applications. We classify 15 popular workflow management systems in terms of workflow execution models, heterogeneous computing environments, and data access methods. The paper also surveys workflow applications and identifies gaps for future research on the road to extreme-scale workflows and management systems.
•Design requirements of workflow applications and systems in the extreme-scale.•Survey and classification of 15 popular workflow engines.•Research gaps between existing WMS and the desired extreme-scale WMS.
Large-scale applications expressed as scientific workflows are often grouped into ensembles of inter-related workflows. In this paper, we address a new and important problem concerning the efficient ...management of such ensembles under budget and deadline constraints on Infrastructure as a Service (IaaS) clouds. IaaS clouds are characterized by on-demand resource provisioning capabilities and a pay-per-use model. We discuss, develop, and assess novel algorithms based on static and dynamic strategies for both task scheduling and resource provisioning. We perform the evaluation via simulation using a set of scientific workflow ensembles with a broad range of budget and deadline parameters, taking into account task granularity, uncertainties in task runtime estimations, provisioning delays, and failures. We find that the key factor determining the performance of an algorithm is its ability to decide which workflows in an ensemble to admit or reject for execution. Our results show that an admission procedure based on workflow structure and estimates of task runtimes can significantly improve the quality of solutions.
•We define the problem of scheduling prioritized workflow ensembles on IaaS clouds.•We analyze and develop several dynamic (online) and static (offline) algorithms.•We address both problems of task scheduling and resource provisioning.•Our simulator models uncertainties in estimates, provisioning delays, and failures.•We use synthetic workflow ensembles based on important, real scientific applications.
The concept of “extreme data” is a recent re-incarnation of the “big data” problem, which is distinguished by the massive amounts of information that must be analyzed with strict time requirements. ...In the past decade, the Cloud data centers have been envisioned as the essential computing architectures for enabling extreme data workflows. However, the Cloud data centers are often geographically distributed. Such geographical distribution increases offloading latency, making it unsuitable for processing of workflows with strict latency requirements, as the data transfer times could be very high. Fog computing emerged as a promising solution to this issue, as it allows partial workflow processing in lower-network layers. Performing data processing on the Fog significantly reduces data transfer latency, allowing to meet the workflows’ strict latency requirements. However, the Fog layer is highly heterogeneous and loosely connected, which affects reliability and response time of task offloading. In this work, we investigate the potential of Fog for scheduling of extreme data workflows with strict response time requirements. Moreover, we propose a novel Pareto-based approach for task offloading in Fog, called Multi-objective Workflow Offloading (MOWO). MOWO considers three optimization objectives, namely response time, reliability, and financial cost. We evaluate MOWO workflow scheduler on a set of real-world biomedical, meteorological and astronomy workflows representing examples of extreme data application with strict latency requirements.
With the development of new experimental technologies, biologists are faced with an avalanche of data to be computationally analyzed for scientific advancements and discoveries to emerge. Faced with ...the complexity of analysis pipelines, the large number of computational tools, and the enormous amount of data to manage, there is compelling evidence that many if not most scientific discoveries will not stand the test of time: increasing the reproducibility of computed results is of paramount importance.
The objective we set out in this paper is to place scientific workflows in the context of reproducibility. To do so, we define several kinds of reproducibility that can be reached when scientific workflows are used to perform experiments. We characterize and define the criteria that need to be catered for by reproducibility-friendly scientific workflow systems, and use such criteria to place several representative and widely used workflow systems and companion tools within such a framework. We also discuss the remaining challenges posed by reproducible scientific workflows in the life sciences. Our study was guided by three use cases from the life science domain involving in silico experiments.
•Use cases from the Life Sciences highlighting reproducibility and reuse needs.•Terminology to describe reproducibility levels in scientific workflows.•Criteria to define reproducible-friendly workflow systems and evaluation of systems.•Challenges and opportunities in scientific workflows reproducibility.