Researchers working on the planning, scheduling, and execution of scientific workflows need access to a wide variety of scientific workflows to evaluate the performance of their implementations. This ...paper provides a characterization of workflows from six diverse scientific applications, including astronomy, bioinformatics, earthquake science, and gravitational-wave physics. The characterization is based on novel workflow profiling tools that provide detailed information about the various computational tasks that are present in the workflow. This information includes I/O, memory and computational characteristics. Although the workflows are diverse, there is evidence that each workflow has a job type that consumes the most amount of runtime. The study also uncovered inefficiency in a workflow component implementation, where the component was re-reading the same data multiple times.
► The workflows of six diverse scientific applications are characterized. ► The characterization includes workflow structure as well as I/O, memory and CPU usage. ► We describe new techniques that were developed to profile scientific workflows. ► The information provided can be used to create realistic synthetic workflows for use in simulation studies of workflow systems.
CyberShake, as part of the Southern California Earthquake Center’s (SCEC) Community Modeling Environment, is developing a methodology that explicitly incorporates deterministic source and wave ...propagation effects within seismic hazard calculations through the use of physics-based 3D ground motion simulations. To calculate a waveform-based seismic hazard estimate for a site of interest, we begin with Uniform California Earthquake Rupture Forecast, Version 2.0 (UCERF2.0) and identify all ruptures within 200 km of the site of interest. We convert the UCERF2.0 rupture definition into multiple rupture variations with differing hypocenter locations and slip distributions, resulting in about 415,000 rupture variations per site. Strain Green Tensors are calculated for the site of interest using the SCEC Community Velocity Model, Version 4 (CVM4), and then, using reciprocity, we calculate synthetic seismograms for each rupture variation. Peak intensity measures are then extracted from these synthetics and combined with the original rupture probabilities to produce probabilistic seismic hazard curves for the site. Being explicitly site-based, CyberShake directly samples the ground motion variability at that site over many earthquake cycles (i.e., rupture scenarios) and alleviates the need for the ergodic assumption that is implicitly included in traditional empirically based calculations. Thus far, we have simulated ruptures at over 200 sites in the Los Angeles region for ground shaking periods of 2 s and longer, providing the basis for the first generation CyberShake hazard maps. Our results indicate that the combination of rupture directivity and basin response effects can lead to an increase in the hazard level for some sites, relative to that given by a conventional Ground Motion Prediction Equation (GMPE). Additionally, and perhaps more importantly, we find that the physics-based hazard results are much more sensitive to the assumed magnitude-area relations and magnitude uncertainty estimates used in the definition of the ruptures than is found in the traditional GMPE approach. This reinforces the need for continued development of a better understanding of earthquake source characterization and the constitutive relations that govern the earthquake rupture process.
Modern science often requires the execution of large-scale, multi-stage simulation and data analysis pipelines to enable the study of complex systems. The amount of computation and data involved in ...these pipelines requires scalable workflow management systems that are able to reliably and efficiently coordinate and automate data movement and task execution on distributed computational resources: campus clusters, national cyberinfrastructures, and commercial and academic clouds. This paper describes the design, development and evolution of the Pegasus Workflow Management System, which maps abstract workflow descriptions onto distributed computing infrastructures. Pegasus has been used for more than twelve years by scientists in a wide variety of domains, including astronomy, seismology, bioinformatics, physics and others. This paper provides an integrated view of the Pegasus system, showing its capabilities that have been developed over time in response to application needs and to the evolution of the scientific computing platforms. The paper describes how Pegasus achieves reliable, scalable workflow execution across a wide variety of computing infrastructures.
•Comprehensive description of the Pegasus Workflow Management System.•Detailed explanation of Pegasus workflow transformations.•Data management in Pegasus.•Earthquake science application example.
Since 2001, the Pegasus Workflow Management System has evolved into a robust and scalable system that automates the execution of a number of complex applications running on a variety of ...heterogeneous, distributed high-throughput, and high-performance computing environments. Pegasus was built on the principle of separation between the workflow description and workflow execution, providing the ability to port and adapt the workflow based on the target execution environment. Through its user-driven research and development, it has adapted to the needs of a number of scientific communities, utilizing and developing novel algorithms and software engineering solutions. This paper describes the evolution of Pegasus over time and provides motivations behind the design decisions. It concludes with selected lessons learned.
Genome-wide association studies (GWAS) have laid the foundation for investigations into the biology of complex traits, drug development and clinical guidelines. However, the majority of discovery ...efforts are based on data from populations of European ancestry
. In light of the differential genetic architecture that is known to exist between populations, bias in representation can exacerbate existing disease and healthcare disparities. Critical variants may be missed if they have a low frequency or are completely absent in European populations, especially as the field shifts its attention towards rare variants, which are more likely to be population-specific
. Additionally, effect sizes and their derived risk prediction scores derived in one population may not accurately extrapolate to other populations
. Here we demonstrate the value of diverse, multi-ethnic participants in large-scale genomic studies. The Population Architecture using Genomics and Epidemiology (PAGE) study conducted a GWAS of 26 clinical and behavioural phenotypes in 49,839 non-European individuals. Using strategies tailored for analysis of multi-ethnic and admixed populations, we describe a framework for analysing diverse populations, identify 27 novel loci and 38 secondary signals at known loci, as well as replicate 1,444 GWAS catalogue associations across these traits. Our data show evidence of effect-size heterogeneity across ancestries for published GWAS associations, substantial benefits for fine-mapping using diverse cohorts and insights into clinical implications. In the United States-where minority populations have a disproportionately higher burden of chronic conditions
-the lack of representation of diverse populations in genetic research will result in inequitable access to precision medicine for those with the highest burden of disease. We strongly advocate for continued, large genome-wide efforts in diverse populations to maximize genetic discovery and reduce health disparities.
The Pegasus Workflow Management System maps abstract, resource-independent workflow descriptions onto distributed computing resources. As a result of this planning process, Pegasus workflows are ...portable across different infrastructures, optimizable for performance and efficiency, and automatically map to many different storage systems and data flows. This approach makes Pegasus a powerful solution for executing scientific workflows in the cloud.
In 2016, LIGO and Virgo announced the first observation of gravitational waves from a binary black hole merger, known as GW150914. To establish the confidence of this detection, large-scale ...scientific workflows were used to measure the event's statistical significance. They used code written by the LIGO/Virgo and were executed on the LIGO Data Grid. The codes are publicly available, but there has not yet been an attempt to directly reproduce the results, although several analyses have replicated the analysis, confirming the detection. We attempt to reproduce the result presented in the GW150914 discovery paper using publicly available code on the Open Science Grid. We show that we can reproduce the main result but we cannot exactly reproduce the LIGO analysis as the original dataset used is not public. We discuss the challenges we encountered and make recommendations for scientists who wish to make their work reproducible.
With the increased prevalence of employing workflows for scientific computing and a push towards exascale computing, it has become paramount that we are able to analyze characteristics of scientific ...applications to better understand their impact on the underlying infrastructure and vice-versa. Such analysis can help drive the design, development, and optimization of these next generation systems and solutions. In this paper, we present the architecture, integrated with existing well-established and newly developed tools, to collect online performance statistics of workflow executions from various, heterogeneous sources and publish them in a distributed database (Elasticsearch). Using this architecture, we are able to correlate online workflow performance data, with data from the underlying infrastructure, and present them in a useful and intuitive way via an online dashboard. We have validated our approach by executing two classes of real-world workflows, both under normal and anomalous conditions. The first is an I/O-intensive genome analysis workflow; the second, a CPU- and memory-intensive material science workflow. Based on the data collected in Elasticsearch, we are able to demonstrate that we can correctly identify anomalies that we injected. The resulting end-to-end data collection of workflow performance data is an important resource of training data for automated machine learning analysis.
•An architecture to collect end-to-end workflow statistics in near real time•The architecture is based on Pegasus WMS and well established tools•A well defined way to deploy the architecture is offered via Docker•Experimental validation done using two real world applications•The experiments where conducted at ExoGENI and Cori supercomputer (NERSC)•Open access data repository containing experimental traces
NASA’s Neutron Star Interior Composition Explorer (NICER) observed X-ray emission from the pulsar PSR J0030 + 0451 in 2018. Riley et al. reported Bayesian parameter measurements of the mass and the ...star’s radius using pulse-profile modeling of the X-ray data. This article reproduces their result using the open source software X-Ray Pulse Simulation and Inference and publicly available data within expected statistical errors. We note the challenges we faced in reproducing the results and demonstrate that the analysis can be reproduced and reused in future works by changing the prior distribution for the radius and the sampler configuration. We find no significant change in the measurement of the mass and radius, demonstrating that the original result is robust to these changes. Finally, we provide a containerized working environment that facilitates third-party reproduction of the measurements of the mass and radius of PSR J0030 + 0451 using the NICER observations.