Researchers working on the planning, scheduling, and execution of scientific workflows need access to a wide variety of scientific workflows to evaluate the performance of their implementations. This ...paper provides a characterization of workflows from six diverse scientific applications, including astronomy, bioinformatics, earthquake science, and gravitational-wave physics. The characterization is based on novel workflow profiling tools that provide detailed information about the various computational tasks that are present in the workflow. This information includes I/O, memory and computational characteristics. Although the workflows are diverse, there is evidence that each workflow has a job type that consumes the most amount of runtime. The study also uncovered inefficiency in a workflow component implementation, where the component was re-reading the same data multiple times.
► The workflows of six diverse scientific applications are characterized. ► The characterization includes workflow structure as well as I/O, memory and CPU usage. ► We describe new techniques that were developed to profile scientific workflows. ► The information provided can be used to create realistic synthetic workflows for use in simulation studies of workflow systems.
CyberShake, as part of the Southern California Earthquake Center’s (SCEC) Community Modeling Environment, is developing a methodology that explicitly incorporates deterministic source and wave ...propagation effects within seismic hazard calculations through the use of physics-based 3D ground motion simulations. To calculate a waveform-based seismic hazard estimate for a site of interest, we begin with Uniform California Earthquake Rupture Forecast, Version 2.0 (UCERF2.0) and identify all ruptures within 200 km of the site of interest. We convert the UCERF2.0 rupture definition into multiple rupture variations with differing hypocenter locations and slip distributions, resulting in about 415,000 rupture variations per site. Strain Green Tensors are calculated for the site of interest using the SCEC Community Velocity Model, Version 4 (CVM4), and then, using reciprocity, we calculate synthetic seismograms for each rupture variation. Peak intensity measures are then extracted from these synthetics and combined with the original rupture probabilities to produce probabilistic seismic hazard curves for the site. Being explicitly site-based, CyberShake directly samples the ground motion variability at that site over many earthquake cycles (i.e., rupture scenarios) and alleviates the need for the ergodic assumption that is implicitly included in traditional empirically based calculations. Thus far, we have simulated ruptures at over 200 sites in the Los Angeles region for ground shaking periods of 2 s and longer, providing the basis for the first generation CyberShake hazard maps. Our results indicate that the combination of rupture directivity and basin response effects can lead to an increase in the hazard level for some sites, relative to that given by a conventional Ground Motion Prediction Equation (GMPE). Additionally, and perhaps more importantly, we find that the physics-based hazard results are much more sensitive to the assumed magnitude-area relations and magnitude uncertainty estimates used in the definition of the ruptures than is found in the traditional GMPE approach. This reinforces the need for continued development of a better understanding of earthquake source characterization and the constitutive relations that govern the earthquake rupture process.
We have developed an RNA-Seq analysis workflow for single-ended Illumina reads, termed RseqFlow. This workflow includes a set of analytic functions, such as quality control for sequencing data, ...signal tracks of mapped reads, calculation of expression levels, identification of differentially expressed genes and coding SNPs calling. This workflow is formalized and managed by the Pegasus Workflow Management System, which maps the analysis modules onto available computational resources, automatically executes the steps in the appropriate order and supervises the whole running process. RseqFlow is available as a Virtual Machine with all the necessary software, which eliminates any complex configuration and installation steps.
http://genomics.isi.edu/rnaseq
wangying@xmu.edu.cn; knowles@med.usc.edu; deelman@isi.edu; tingchen@usc.edu
Supplementary data are available at Bioinformatics online.
Genome-wide association studies often incorporate information from public biological databases in order to provide a biological reference for interpreting the results. The dbSNP database is an ...extensive source of information on single nucleotide polymorphisms (SNPs) for many different organisms, including humans. We have developed free software that will download and install a local MySQL implementation of the dbSNP relational database for a specified organism. We have also designed a system for classifying dbSNP tables in terms of common tasks we wish to accomplish using the database. For each task we have designed a small set of custom tables that facilitate task-related queries and provide entity-relationship diagrams for each task composed from the relevant dbSNP tables. In order to expose these concepts and methods to a wider audience we have developed web tools for querying the database and browsing documentation on the tables and columns to clarify the relevant relational structure. All web tools and software are freely available to the public at http://cgsmd.isi.edu/dbsnpq. Resources such as these for programmatically querying biological databases are essential for viably integrating biological information into genetic association experiments on a genome-wide scale.
This paper describes the Pegasus framework that can be used to map complex scientific workflows onto distributed resources. Pegasus enables users to represent the workflows at an abstract level ...without needing to worry about the particulars of the target execution systems. The paper describes general issues in mapping applications and the functionality of Pegasus. We present the results of improving application performance through workflow restructuring which clusters multiple tasks in a workflow into single entities. A real-life astronomy application is used as the basis for the study.
In this paper we address the problem of automatically generating job workflows for the Grid. These workflows describe the execution of a complex application built from individual application ...components. In our work we have developed two workflow generators: the first (the Concrete Workflow Generator CWG) maps an abstract workflow defined in terms of application-level components to the set of available Grid resources. The second generator (Abstract and Concrete Workflow Generator, ACWG) takes a wider perspective and not only performs the abstract to concrete mapping but also enables the construction of the abstract workflow based on the available components. This system operates in the application domain and chooses application components based on the application metadata attributes. We describe our current ACWG based on AI planning technologies and outline how these technologies can play a crucial role in developing complex application workflows in Grid environments. Although our work is preliminary, CWG has already been used to map high energy physics applications onto the Grid. In one particular experiment, a set of production runs lasted 7 days and resulted in the generation of 167,500 events by 678 jobs. Additionally, ACWG was used to map gravitational physics workflows, with hundreds of nodes onto the available resources, resulting in 975 tasks, 1365 data transfers and 975 output files produced.
SPOT (http://spot.cgsmd.isi.edu), the SNP prioritization online tool, is a web site for integrating biological databases into the prioritization of single nucleotide polymorphisms (SNPs) for further ...study after a genome-wide association study (GWAS). Typically, the next step after a GWAS is to genotype the top signals in an independent replication sample. Investigators will often incorporate information from biological databases so that biologically relevant SNPs, such as those in genes related to the phenotype or with potentially non-neutral effects on gene expression such as a splice sites, are given higher priority. We recently introduced the genomic information network (GIN) method for systematically implementing this kind of strategy. The SPOT web site allows users to upload a list of SNPs and GWAS P-values and returns a prioritized list of SNPs using the GIN method. Users can specify candidate genes or genomic regions with custom levels of prioritization. The results can be downloaded or viewed in the browser where users can interactively explore the details of each SNP, including graphical representations of the GIN method. For investigators interested in incorporating biological databases into a post-GWAS SNP selection strategy, the SPOT web tool is an easily implemented and flexible solution.