Abstract
Trans-species polymorphism has been widely used as a key sign of long-term balancing selection across multiple species. However, such sites are often rare in the genome and could result from ...mutational processes or technical artifacts. Few methods are yet available to specifically detect footprints of trans-species balancing selection without using trans-species polymorphic sites. In this study, we develop summary- and model-based approaches that are each specifically tailored to uncover regions of long-term balancing selection shared by a set of species by using genomic patterns of intraspecific polymorphism and interspecific fixed differences. We demonstrate that our trans-species statistics have substantially higher power than single-species approaches to detect footprints of trans-species balancing selection, and are robust to those that do not affect all tested species. We further apply our model-based methods to human and chimpanzee whole-genome sequencing data. In addition to the previously established major histocompatibility complex and malaria resistance-associated FREM3/GYPE regions, we also find outstanding genomic regions involved in barrier integrity and innate immunity, such as the GRIK1/CLDN17 intergenic region, and the SLC35F1 and ABCA13 genes. Our findings not only echo the significance of pathogen defense but also reveal novel candidates in maintaining balanced polymorphisms across human and chimpanzee lineages. Finally, we show that these trans-species statistics can be applied to and work well for an arbitrary number of species, and integrate them into open-source software packages for ease of use by the scientific community.
Abstract
Learning about the roles that duplicate genes play in the origins of novel phenotypes requires an understanding of how their functions evolve. A previous method for achieving this goal, ...CDROM, employs gene expression distances as proxies for functional divergence and then classifies the evolutionary mechanisms retaining duplicate genes from comparisons of these distances in a decision tree framework. However, CDROM does not account for stochastic shifts in gene expression or leverage advances in contemporary statistical learning for performing classification, nor is it capable of predicting the parameters driving duplicate gene evolution. Thus, here we develop CLOUD, a multi-layer neural network built on a model of gene expression evolution that can both classify duplicate gene retention mechanisms and predict their underlying evolutionary parameters. We show that not only is the CLOUD classifier substantially more powerful and accurate than CDROM, but that it also yields accurate parameter predictions, enabling a better understanding of the specific forces driving the evolution and long-term retention of duplicate genes. Further, application of the CLOUD classifier and predictor to empirical data from Drosophila recapitulates many previous findings about gene duplication in this lineage, showing that new functions often emerge rapidly and asymmetrically in younger duplicate gene copies, and that functional divergence is driven by strong natural selection. Hence, CLOUD represents a major advancement in classifying retention mechanisms and predicting evolutionary parameters of duplicate genes, thereby highlighting the utility of incorporating sophisticated statistical learning techniques to address long-standing questions about evolution after gene duplication.
Abstract
Summary
The growing availability of genomewide polymorphism data has fueled interest in detecting diverse selective processes affecting population diversity. However, no model-based ...approaches exist to jointly detect and distinguish the two complementary processes of balancing and positive selection. We extend the BalLeRMix B-statistic framework described in Cheng and DeGiorgio (2020) for detecting balancing selection and present BalLeRMix+, which implements five B statistic extensions based on mixture models to robustly identify both types of selection. BalLeRMix+ is implemented in Python and computes the composite likelihood ratios and associated model parameters for each genomic test position.
Availability and implementation
BalLeRMix+ is freely available at https://github.com/bioXiaoheng/BallerMixPlus.
Supplementary information
Supplementary data are available at Bioinformatics online.
We present bammds, a practical tool that allows visualization of samples sequenced by second-generation sequencing when compared with a reference panel of individuals (usually genotypes) using a ...multidimensional scaling algorithm. Our tool is aimed at determining the ancestry of unknown samples-typical of ancient DNA data-particularly when only low amounts of data are available for those samples.
The software package is available under GNU General Public License v3 and is freely available together with test datasets https://savannah.nongnu.org/projects/bammds/. It is using R (http://www.r-project.org/), parallel (http://www.gnu.org/software/parallel/), samtools (https://github.com/samtools/samtools).
bammds-users@nongnu.org
Supplementary data are available at Bioinformatics online.
Most studies focusing on human high-altitude adaptation in the Andean highlands have thus far been focused on Peruvian populations. We present high-coverage whole genomes from Indigenous people ...living in the Ecuadorian highlands and perform multi-method scans to detect positive natural selection. We identified regions of the genome that show signals of strong selection to both cardiovascular and hypoxia pathways, which are distinct from those uncovered in Peruvian populations. However, the strongest signals of selection were related to regions of the genome that are involved in immune function related to tuberculosis. Given our estimated timing of this selection event, the Indigenous people of Ecuador may have adapted to Mycobacterium tuberculosis thousands of years before the arrival of Europeans. Furthermore, we detect a population collapse that coincides with the arrival of Europeans, which is more severe than other regions of the Andes, suggesting differing effects of contact across high-altitude populations.
Display omitted
•Indigenous people of Ecuador may have adapted to tuberculosis over 3,000 years ago•Selection in cardiovascular and hypoxia pathways distinct from Peruvian populations•Population collapse at arrival of Europeans more severe than other areas of the Andes
Human genetics; Evolutionary biology; Anthropology
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP
Abstract
Selective sweeps are frequent and varied signatures in the genomes of natural populations, and detecting them is consequently important in understanding mechanisms of adaptation by natural ...selection. Following a selective sweep, haplotypic diversity surrounding the site under selection decreases, and this deviation from the background pattern of variation can be applied to identify sweeps. Multiple methods exist to locate selective sweeps in the genome from haplotype data, but none leverages the power of a model-based approach to make their inference. Here, we propose a likelihood ratio test statistic T to probe whole-genome polymorphism data sets for selective sweep signatures. Our framework uses a simple but powerful model of haplotype frequency spectrum distortion to find sweeps and additionally make an inference on the number of presently sweeping haplotypes in a population. We found that the T statistic is suitable for detecting both hard and soft sweeps across a variety of demographic models, selection strengths, and ages of the beneficial allele. Accordingly, we applied the T statistic to variant calls from European and sub-Saharan African human populations, yielding primarily literature-supported candidates, including LCT, RSPH3, and ZNF211 in CEU, SYT1, RGS18, and NNT in YRI, and HLA genes in both populations. We also searched for sweep signatures in Drosophila melanogaster, finding expected candidates at Ace, Uhg1, and Pimet. Finally, we provide open-source software to compute the T statistic and the inferred number of presently sweeping haplotypes from whole-genome data.
There are few methods tailored for detecting signals of positive selection in populations directly ancestral to multiple descendent populations. We introduce the ancestral branch statistic (ABS), a ...four‐population summary statistic for identifying selective sweeps occurring in the direct ancestor of a pair of populations. Simulations show that ABS performs at least as well as, and often better under model violations, than the complementary likelihood approach of 3P‐CLR across diverse selection scenarios and parameter values. We first applied ABS to contemporary human genomic data to identify genes that may have been adaptive in ancestral East Asian populations, uncovering the well‐established candidate EDAR, as well as a novel candidate SLC35F3, which encodes a putative thiamine transporter that may have been involved in adaptation to eating polished grains. Next, we performed scans with ancient European genomic data to reexamine evidence of recent positive selection in ancestral Europeans. The MCM6/LCT cluster and the SLC45A2 and HERC2 genes are strong outliers, agreeing with previous studies. Novel candidates, such as SLC30A9 and CYP1A2, may have been involved in adaptation to local nutrient sufficiency and lifestyle changes. Finally, we provide open‐source software, CalcABS, which can perform genomic scans of ancestral sweeps with ABS from population allele frequency data.
Full text
Available for:
BFBNIB, FZAB, GIS, IJS, KILJ, NLZOH, NUK, OILJ, SAZU, SBCE, SBMB, UL, UM, UPUK
Abstract
Natural selection leaves a spatial pattern along the genome, with a haplotype distribution distortion near the selected locus that fades with distance. Evaluating the spatial signal of a ...population-genetic summary statistic across the genome allows for patterns of natural selection to be distinguished from neutrality. Considering the genomic spatial distribution of multiple summary statistics is expected to aid in uncovering subtle signatures of selection. In recent years, numerous methods have been devised that consider genomic spatial distributions across summary statistics, utilizing both classical machine learning and deep learning architectures. However, better predictions may be attainable by improving the way in which features are extracted from these summary statistics. We apply wavelet transform, multitaper spectral analysis, and S-transform to summary statistic arrays to achieve this goal. Each analysis method converts one-dimensional summary statistic arrays to two-dimensional images of spectral analysis, allowing simultaneous temporal and spectral assessment. We feed these images into convolutional neural networks and consider combining models using ensemble stacking. Our modeling framework achieves high accuracy and power across a diverse set of evolutionary settings, including population size changes and test sets of varying sweep strength, softness, and timing. A scan of central European whole-genome sequences recapitulated well-established sweep candidates and predicted novel cancer-associated genes as sweeps with high support. Given that this modeling framework is also robust to missing genomic segments, we believe that it will represent a welcome addition to the population-genomic toolkit for learning about adaptive processes from genomic data.
SweepFinder is a widely used program that implements a powerful likelihood-based method for detecting recent positive selection, or selective sweeps. Here, we present SweepFinder2, an extension of ...SweepFinder with increased sensitivity and robustness to the confounding effects of mutation rate variation and background selection. Moreover, SweepFinder2 has increased flexibility that enables the user to specify test sites, set the distance between test sites and utilize a recombination map.
SweepFinder2 is a freely-available (www.personal.psu.edu/mxd60/sf2.html) software package that is written in C and can be run from a Unix command line.
mxd60@psu.edu.
Abstract
Inferences of adaptive events are important for learning about traits, such as human digestion of lactose after infancy and the rapid spread of viral variants. Early efforts toward ...identifying footprints of natural selection from genomic data involved development of summary statistic and likelihood methods. However, such techniques are grounded in simple patterns or theoretical models that limit the complexity of settings they can explore. Due to the renaissance in artificial intelligence, machine learning methods have taken center stage in recent efforts to detect natural selection, with strategies such as convolutional neural networks applied to images of haplotypes. Yet, limitations of such techniques include estimation of large numbers of model parameters under nonconvex settings and feature identification without regard to location within an image. An alternative approach is to use tensor decomposition to extract features from multidimensional data although preserving the latent structure of the data, and to feed these features to machine learning models. Here, we adopt this framework and present a novel approach termed T-REx, which extracts features from images of haplotypes across sampled individuals using tensor decomposition, and then makes predictions from these features using classical machine learning methods. As a proof of concept, we explore the performance of T-REx on simulated neutral and selective sweep scenarios and find that it has high power and accuracy to discriminate sweeps from neutrality, robustness to common technical hurdles, and easy visualization of feature importance. Therefore, T-REx is a powerful addition to the toolkit for detecting adaptive processes from genomic data.