The complex correlation structure of a collection of orthologous DNA sequences is uniquely captured by the "ancestral recombination graph" (ARG), a complete record of coalescence and recombination ...events in the history of the sample. However, existing methods for ARG inference are computationally intensive, highly approximate, or limited to small numbers of sequences, and, as a consequence, explicit ARG inference is rarely used in applied population genomics. Here, we introduce a new algorithm for ARG inference that is efficient enough to apply to dozens of complete mammalian genomes. The key idea of our approach is to sample an ARG of chromosomes conditional on an ARG of chromosomes, an operation we call "threading." Using techniques based on hidden Markov models, we can perform this threading operation exactly, up to the assumptions of the sequentially Markov coalescent and a discretization of time. An extension allows for threading of subtrees instead of individual sequences. Repeated application of these threading operations results in highly efficient Markov chain Monte Carlo samplers for ARGs. We have implemented these methods in a computer program called ARGweaver. Experiments with simulated data indicate that ARGweaver converges rapidly to the posterior distribution over ARGs and is effective in recovering various features of the ARG for dozens of sequences generated under realistic parameters for human populations. In applications of ARGweaver to 54 human genome sequences from Complete Genomics, we find clear signatures of natural selection, including regions of unusually ancient ancestry associated with balancing selection and reductions in allele age in sites under directional selection. The patterns we observe near protein-coding genes are consistent with a primary influence from background selection rather than hitchhiking, although we cannot rule out a contribution from recurrent selective sweeps.
Full text
Available for:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
The prominent role of Horizontal Gene Transfer (HGT) in the evolution of bacteria is now well documented, but few studies have differentiated between evolutionary events that predominantly cause ...genes in one lineage to be replaced by homologs from another lineage ("replacing HGT") and events that result in the addition of substantial new genomic material ("additive HGT"). Here in, we make use of the distinct phylogenetic signatures of replacing and additive HGTs in a genome-wide study of the important human pathogen Streptococcus pyogenes (SPY) and its close relatives S. dysgalactiae subspecies equisimilis (SDE) and S. dysgalactiae subspecies dysgalactiae (SDD). Using recently developed statistical models and computational methods, we find evidence for abundant gene flow of both kinds within each of the SPY and SDE clades and of reduced levels of exchange between SPY and SDD. In addition, our analysis strongly supports a pronounced asymmetry in SPY-SDE gene flow, favoring the SPY-to-SDE direction. This finding is of particular interest in light of the recent increase in virulence of pathogenic SDE. We find much stronger evidence for SPY-SDE gene flow among replacing than among additive transfers, suggesting a primary influence from homologous recombination between co-occurring SPY and SDE cells in human hosts. Putative virulence genes are correlated with transfer events, but this correlation is found to be driven by additive, not replacing, HGTs. The genes affected by additive HGTs are enriched for functions having to do with transposition, recombination, and DNA integration, consistent with previous findings, whereas replacing HGTs seen to influence a more diverse set of genes. Additive transfers are also found to be associated with evidence of positive selection. These findings shed new light on the manner in which HGT has shaped pathogenic bacterial genomes.
Simulation is a key tool in population genetics for both methods development and empirical research, but producing simulations that recapitulate the main features of genomic datasets remains a major ...obstacle. Today, more realistic simulations are possible thanks to large increases in the quantity and quality of available genetic data, and the sophistication of inference and simulation software. However, implementing these simulations still requires substantial time and specialized knowledge. These challenges are especially pronounced for simulating genomes for species that are not well-studied, since it is not always clear what information is required to produce simulations with a level of realism sufficient to confidently answer a given question. The community-developed framework stdpopsim seeks to lower this barrier by facilitating the simulation of complex population genetic models using up-to-date information. The initial version of stdpopsim focused on establishing this framework using six well-characterized model species (Adrion et al., 2020). Here, we report on major improvements made in the new release of stdpopsim (version 0.2), which includes a significant expansion of the species catalog and substantial additions to simulation capabilities. Features added to improve the realism of the simulated genomes include non-crossover recombination and provision of species-specific genomic annotations. Through community-driven efforts, we expanded the number of species in the catalog more than threefold and broadened coverage across the tree of life. During the process of expanding the catalog, we have identified common sticking points and developed the best practices for setting up genome-scale simulations. We describe the input data required for generating a realistic simulation, suggest good practices for obtaining the relevant information from the literature, and discuss common pitfalls and major considerations. These improvements to stdpopsim aim to further promote the use of realistic whole-genome population genetic simulations, especially in non-model organisms, making them available, transparent, and accessible to everyone.
Distance based reconstruction methods of phylogenetic trees consist of two independent parts: first, inter-species distances are inferred assuming some stochastic model of sequence evolution; then ...the inferred distances are used to construct a tree. In this paper we concentrate on the task of inter-species distance estimation. Specifically, we characterize the family of valid distance functions for the assumed substitution model and show that deliberate selection of distance function significantly improves the accuracy of distance estimates and, consequently, also improves the accuracy of the reconstructed tree.
Our contribution consists of three parts: first, we present a general framework for constructing families of additive distance functions for stochastic evolutionary models. Then, we present a method for selecting (near) optimal distance functions, and we conclude by presenting simulation results which support our theoretical analysis.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UL, UM, UPCLJ, UPUK
The propensity of animal miRNAs to regulate targets bearing modest complementarity, most notably via pairing with miRNA positions ~2-8 (the "seed"), is believed to drive major aspects of miRNA ...evolution. First, minimal targeting requirements have allowed most conserved miRNAs to acquire large target cohorts, thus imposing strong selection on miRNAs to maintain their seed sequences. Second, the modest pairing needed for repression suggests that evolutionarily nascent miRNAs may generally induce net detrimental, rather than beneficial, regulatory effects. Hence, levels and activities of newly emerged miRNAs are expected to be limited to preserve the status quo of gene expression. In this study, we unexpectedly show that Drosophila testes specifically express a substantial miRNA population that contravenes these tenets. We find that multiple genomic clusters of testis-restricted miRNAs harbor recently evolved miRNAs, whose experimentally verified orthologs exhibit divergent sequences, even within seed regions. Moreover, this class of miRNAs exhibits higher expression and greater phenotypic capacities in transgenic misexpression assays than do non-testis-restricted miRNAs of similar evolutionary age. These observations suggest that these testis-restricted miRNAs may be evolving adaptively, and several methods of evolutionary analysis provide strong support for this notion. Consistent with this, proof-of-principle tests show that orthologous miRNAs with divergent seeds can distinguish target sensors in a species-cognate manner. Finally, we observe that testis-restricted miRNA clusters exhibit extraordinary dynamics of miRNA gene flux in other Drosophila species. Altogether, our findings reveal a surprising tissue-directed influence of miRNA evolution, involving a distinct mode of miRNA function connected to adaptive gene regulation in the testis.
Artificial synthesis of DNA molecules is an essential part of the study of biological mechanisms. The design of a synthetic DNA molecule usually involves many objectives. One of the important ...objectives is to eliminate short sequence patterns that correspond to binding sites of restriction enzymes or transcription factors. While many design tools address this problem, no adequate formal solution exists for the pattern elimination problem. In this work, we present a formal description of the elimination problem and suggest efficient algorithms that eliminate unwanted patterns and allow optimization of other objectives with minimal interference to the desired DNA functionality. Our approach is flexible, efficient, and straightforward, and therefore can be easily incorporated in existing DNA design tools, making them considerably more powerful.
Making faultless complex objects from potentially faulty building blocks is a fundamental challenge in computer engineering, nanotechnology and synthetic biology. Here, we show for the first time how ...recursion can be used to address this challenge and demonstrate a recursive procedure that constructs error‐free DNA molecules and their libraries from error‐prone oligonucleotides. Divide and Conquer (D&C), the quintessential recursive problem‐solving technique, is applied in silico to divide the target DNA sequence into overlapping oligonucleotides short enough to be synthesized directly, albeit with errors; error‐prone oligonucleotides are recursively combined in vitro, forming error‐prone DNA molecules; error‐free fragments of these molecules are then identified, extracted and used as new, typically longer and more accurate, inputs to another iteration of the recursive construction procedure; the entire process repeats until an error‐free target molecule is formed. Our recursive construction procedure surpasses existing methods for de novo DNA synthesis in speed, precision, amenability to automation, ease of combining synthetic and natural DNA fragments, and ability to construct designer DNA libraries. It thus provides a novel and robust foundation for the design and construction of synthetic biological molecules and organisms.
Full text
Available for:
FZAB, GIS, IJS, IZUM, KILJ, NLZOH, NUK, OILJ, PILJ, PNG, SAZU, SBCE, SBMB, UL, UM, UPUK
Distance-based phylogenetic reconstruction methods use the evolutionary distances between species in order to reconstruct the tree spanning them. The evolutionary distance between two species, which ...is computed from their DNA (or protein) sequences, is typically considered as a fixed function of these sequences, predetermined by the assumed model of evolution. This article continues the line of research that attempts to adjust to each given set of input sequences a distance function which maximizes the expected accuracy of the reconstructed tree. Specifically, we present methods for selecting distance functions that considerably improve the accuracy of quartets constructed by the four-point method in Kimura's 2-parameter model, where special emphasis is given to the case of non-homogenous quartets.
Making error-free, custom DNA assemblies from potentially faulty building blocks is a fundamental challenge in synthetic biology. Here, we show how recursion can be used to address this challenge ...using a recursive procedure that constructs error-free DNA molecules and their libraries from error-prone synthetic oligonucleotides and naturally existing DNA. Specifically, we describe how divide and conquer (D&C), the quintessential recursive problem-solving technique, is applied in silico to divide target DNA sequences into overlapping, albeit error prone, oligonucleotides, and how recursive construction is applied in vitro to combine them to form error-prone DNA molecules. To correct DNA sequence errors, error-free fragments of these molecules are then identified, extracted, and used as new, typically longer and more accurate, inputs to another iteration of the recursive construction procedure; the entire process repeats until an error-free target molecule is formed. The method allows combining synthetic and natural DNA fragments into error-free designer DNA libraries, thus providing a foundation for the design and construction of complex synthetic DNA assemblies.