Abstract
Motivation
Read alignment is an essential first step in the characterization of DNA sequence variation. The accuracy of variant-calling results depends not only on the quality of read ...alignment and variant-calling software but also on the interaction between these complex software tools.
Results
In this review, we evaluate short-read aligner performance with the goal of optimizing germline variant-calling accuracy. We examine the performance of three general-purpose short-read aligners—BWA-MEM, Bowtie 2, and Arioc—in conjunction with three germline variant callers: DeepVariant, FreeBayes, and GATK HaplotypeCaller. We discuss the behavior of the read aligners with regard to the data elements on which the variant callers rely, and illustrate how the runtime configurations of these software tools combine to affect variant-calling performance.
Abstract
Summary
Over the past decade, short-read sequence alignment has become a mature technology. Optimized algorithms, careful software engineering and high-speed hardware have contributed to ...greatly increased throughput and accuracy. With these improvements, many opportunities for performance optimization have emerged. In this review, we examine three general-purpose short-read alignment tools—BWA-MEM, Bowtie 2 and Arioc—with a focus on performance optimization. We analyze the performance-related behavior of the algorithms and heuristics each tool implements, with the goal of arriving at practical methods of improving processing speed and accuracy. We indicate where an aligner's default behavior may result in suboptimal performance, explore the effects of computational constraints such as end-to-end mapping and alignment scoring threshold, and discuss sources of imprecision in the computation of alignment scores and mapping quality. With this perspective, we describe an approach to tuning short-read aligner performance to meet specific data-analysis and throughput requirements while avoiding potential inaccuracies in subsequent analysis of alignment results. Finally, we illustrate how this approach avoids easily overlooked pitfalls and leads to verifiable improvements in alignment speed and accuracy.
Contact
richard.wilton@jhu.edu
Supplementary information
Appendices referenced in this article are available at Bioinformatics online.
In large DNA sequence repositories, archival data storage is often coupled with computers that provide 40 or more CPU threads and multiple GPU (general-purpose graphics processing unit) devices. This ...presents an opportunity for DNA sequence alignment software to exploit high-concurrency hardware to generate short-read alignments at high speed. Arioc, a GPU-accelerated short-read aligner, can compute WGS (whole-genome sequencing) alignments ten times faster than comparable CPU-only alignment software. When two or more GPUs are available, Arioc's speed increases proportionately because the software executes concurrently on each available GPU device. We have adapted Arioc to recent multi-GPU hardware architectures that support high-bandwidth peer-to-peer memory accesses among multiple GPUs. By modifying Arioc's implementation to exploit this GPU memory architecture we obtained a further 1.8x-2.9x increase in overall alignment speeds. With this additional acceleration, Arioc computes two million short-read alignments per second in a four-GPU system; it can align the reads from a human WGS sequencer run-over 500 million 150nt paired-end reads-in less than 15 minutes. As WGS data accumulates exponentially and high-concurrency computational resources become widespread, Arioc addresses a growing need for timely computation in the short-read data analysis toolchain.
Photometric redshifts for the SDSS Data Release 12 Beck, Robert; Dobos, Laszlo; Budavari, Tamas ...
Monthly notices of the Royal Astronomical Society,
08/2016, Letnik:
460, Številka:
2
Journal Article
Recenzirano
Odprti dostop
We present the methodology and data behind the photometric redshift data base of the Sloan Digital Sky Survey (SDSS) Data Release 12. We adopt a hybrid technique, empirically estimating the redshift ...via local regression on a spectroscopic training set, then fitting a spectrum template to obtain K-corrections and absolute magnitudes. The SDSS spectroscopic catalogue was augmented with data from other, publicly available spectroscopic surveys to mitigate target selection effects. The training set is comprised of 1976...978 galaxies, and extends up to redshift z ... 0.8, with a useful coverage of up to z ... 0.6. We provide photometric redshifts and realistic error estimates for the 208 474 076 galaxies of the SDSS primary photometric catalogue. We achieve an average bias of ... = 5.84 x 10 super( 5), a standard deviation of ... = 0.0205, and a 3... outlier rate of P sub( o) = 4.11 per cent when cross-validating on our training set. The published redshift error estimates and photometric error classes enable the selection of galaxies with high-quality photometric redshifts. We also provide a supplementary error map that allows additional, sophisticated filtering of the data. (ProQuest: ... denotes formulae/symbols omitted.)
We present a general probabilistic formalism for cross-identifying astronomical point sources in multiple observations. Our Bayesian approach, symmetric in all observations, is the foundation of a ...unified framework for object matching, where not only spatial information, but also physical properties, such as colors, redshift, and luminosity, can be considered in a natural way. We provide a practical recipe to implement an efficient recursive algorithm to evaluate the Bayes factor over a set of catalogs with known circular errors in positions. This new methodology is crucial for studies leveraging the synergy of today's multiwavelength observations and to enter the time domain science of the upcoming survey telescopes.
Accurate modelling of non-linearities in the galaxy bispectrum, the Fourier transform of the galaxy three-point correlation function, is essential to fully exploit it as a cosmological probe. In this ...paper, we present numerical and theoretical challenges in modelling the non-linear bispectrum. First, we test the robustness of the matter bispectrum measured from N-body simulations using different initial conditions generators. We run a suite of N-body simulations using the Zel'dovich approximation and second-order Lagrangian perturbation theory (2LPT) at different starting redshifts, and find that transients from initial decaying modes systematically reduce the non-linearities in the matter bispectrum. To achieve 1 per cent accuracy in the matter bispectrum at z less than or equal to 3 on scales k < 1 h Mpc-1, 2LPT initial conditions generator with initial redshift z ... 100 is required. We then compare various analytical formulas and empirical fitting functions for modelling the non-linear matter bispectrum, and discuss the regimes for which each is valid. We find that the next-to-leading order (one-loop) correction from standard perturbation theory matches with N-body results on quasi-linear scales forz greater than or equal to 1. We find that the fitting formula in Gil-Marin et al. accurately predicts the matter bispectrum for z less than or equal to 1 on a wide range of scales, but at higher redshifts, the fitting formula given in Scoccimarro & Couchman gives the best agreement with measurements from N-body simulations. (ProQuest: ... denotes formulae/symbols omitted.)
We introduce a method to constrain general cosmological models using Baryon Acoustic Oscillation (BAO) distance measurements from galaxy samples covering different redshift ranges, and apply this ...method to analyse samples drawn from the Sloan Digital Sky Survey (SDSS) and 2dF Galaxy Redshift Survey (2dFGRS). BAOs are detected in the clustering of the combined 2dFGRS and SDSS main galaxy samples, and measure the distance–redshift relation at z= 0.2. BAOs in the clustering of the SDSS luminous red galaxies measure the distance–redshift relation at z= 0.35. The observed scales of the BAOs calculated from these samples and from the combined sample are jointly analysed using estimates of the correlated errors, to constrain the form of the distance measure DV(z) ≡(1 +z)2D2Acz/H(z)1/3. Here DA is the angular diameter distance, and H(z) is the Hubble parameter. This gives rs/DV(0.2) = 0.1980 ± 0.0058 and rs/DV(0.35) = 0.1094 ± 0.0033 (1σ errors), with a correlation coefficient of 0.39, where rs is the comoving sound horizon scale at recombination. Matching the BAOs to have the same measured scale at all redshifts then gives DV(0.35)/DV(0.2) = 1.812 ± 0.060. The recovered ratio is roughly consistent with that predicted by the higher redshift Supernova Legacy Survey (SNLS) supernova data for Λ cold dark matter cosmologies, but does require slightly stronger cosmological acceleration at a low redshift. If we force the cosmological model to be flat with constant w, then we find Ωm= 0.249 ± 0.018 and w=−1.004 ± 0.089 after combining with the SNLS data, and including the WMAP measurement of the apparent acoustic horizon angle in the cosmic microwave background.
From SkyServer to SciServer SZALAY, ALEXANDER S.
The Annals of the American Academy of Political and Social Science,
01/2018, Letnik:
675, Številka:
1
Journal Article
Recenzirano
Odprti dostop
Twenty years ago, work commenced on the Sloan Digital Sky Survey. The project aimed to collect a statistically complete dataset over a large fraction of the sky and turn it into an open data resource ...for the world’s astronomy community. There were few examples to learn from, and those of us who worked on it had to invent much of the system ourselves. The project has made fundamental changes to astronomy, and we are now faced with the problem of ensuring that the data will be preserved and kept in active use for another 20 years. In redesigning this very large, open archive of data, we made a system that is able to serve a much broader set of communities. In this article, I discuss what we have learned by rebuilding a massive dataset that is available to an increasingly sophisticated set of users, and how we have been challenged and motivated to incorporate more of the patterns of data analytics required by contemporary science.
Astronomy was among the first disciplines to embrace Big Data and use it to characterize spatial relationships between stars and galaxies. Today, medicine, in particular pathology, has similar needs ...with regard to characterizing the spatial relationships between cells, with an emphasis on understanding the organization of the tumor microenvironment. In this article, we chronicle the emergence of data-intensive science through the development of the Sloan Digital Sky Survey and describe how analysis patterns and approaches similarly apply to multiplex immunofluorescence (mIF) pathology image exploration. The lessons learned from astronomy are detailed, and the new AstroPath platform that capitalizes on these learnings is described. AstroPath is being used to generate and display tumor-immune maps that can be used for mIF immuno-oncology biomarker development. The development of AstroPath as an open resource for visualizing and analyzing large-scale spatially resolved mIF datasets is underway, akin to how publicly available maps of the sky have been used by astronomers and citizen scientists alike. Associated technical, academic, and funding considerations, as well as extended future development for inclusion of spatial transcriptomics and application of artificial intelligence, are also addressed.