Abstract
R package pcadapt is a user-friendly R package for performing genome scans for local adaptation. Here, we present version 4 of pcadapt which substantially improves computational efficiency ...while providing similar results. This improvement is made possible by using a different format for storing genotypes and a different algorithm for computing principal components of the genotype matrix, which is the most computationally demanding step in method pcadapt. These changes are seamlessly integrated into the existing pcadapt package, and users will experience a large reduction in computation time (by a factor of 20–60 in our analyses) as compared with previous versions.
The emergence of farming during the Neolithic transition, including the domestication of livestock, was a critical point in the evolution of human kind. The goat (Capra hircus) was one of the first ...domesticated ungulates. In this study, we compared the genetic diversity of domestic goats to that of the modern representatives of their wild ancestor, the bezoar, by analyzing 473 samples collected over the whole distribution range of the latter species. This partly confirms and significantly clarifies the goat domestication scenario already proposed by archaeological evidence. All of the mitochondrial DNA haplogroups found in current domestic goats have also been found in the bezoar. The geographic distribution of these haplogroups in the wild ancestor allowed the localization of the main domestication centers. We found no haplotype that could have been domesticated in the eastern half of the Iranian Plateau, nor further to the east. A signature of population expansion in bezoars of the C haplogroup suggests an early domestication center on the Central Iranian Plateau (Yazd and Kerman Provinces) and in the Southern Zagros (Fars Province), possibly corresponding to the management of wild flocks. However, the contribution of this center to the current domestic goat population is rather low (1.4%). We also found a second domestication center covering a large area in Eastern Anatolia, and possibly in Northern and Central Zagros. This last domestication center is the likely origin of almost all domestic goats today. This finding is consistent with archaeological data identifying Eastern Anatolia as an important domestication center.
To characterize natural selection, various analytical methods for detecting candidate genomic regions have been developed. We propose to perform genome-wide scans of natural selection using principal ...component analysis (PCA). We show that the common FST index of genetic differentiation between populations can be viewed as the proportion of variance explained by the principal components. Considering the correlations between genetic variants and each principal component provides a conceptual framework to detect genetic variants involved in local adaptation without any prior definition of populations. To validate the PCA-based approach, we consider the 1000 Genomes data (phase 1) considering 850 individuals coming from Africa, Asia, and Europe. The number of genetic variants is of the order of 36 millions obtained with a low-coverage sequencing depth (3×). The correlations between genetic variation and each principal component provide well-known targets for positive selection (EDAR, SLC24A5, SLC45A2, DARC), and also new candidate genes (APPBPP2, TP1A1, RTTN, KCNMA, MYO5C) and noncoding RNAs. In addition to identifying genes involved in biological adaptation, we identify two biological pathways involved in polygenic adaptation that are related to the innate immune system (beta defensins) and to lipid metabolism (fatty acid omega oxidation). An additional analysis of European data shows that a genome scan based on PCA retrieves classical examples of local adaptation even when there are no well-defined populations. PCA-based statistics, implemented in the PCAdapt R package and the PCAdapt fast open-source software, retrieve well-known signals of human adaptation, which is encouraging for future whole-genome sequencing project, especially when defining populations is difficult.
Actions taken to control the coronavirus disease 2019 (COVID-19) pandemic have conspicuously reduced motor vehicle traffic, potentially alleviating auditory pressures on animals that rely on sound ...for survival and reproduction. Here, by comparing soundscapes and songs across the San Francisco Bay Area before and during the recent statewide shutdown, we evaluated whether a common songbird responsively exploited newly emptied acoustic space. We show that noise levels in urban areas were substantially lower during the shutdown, characteristic of traffic in the mid-1950s. We also show that birds responded by producing higher performance songs at lower amplitudes, effectively maximizing communication distance and salience. These findings illustrate that behavioral traits can change rapidly in response to newly favorable conditions, indicating an inherent resilience to long-standing anthropogenic pressures such as noise pollution.
Mediation analysis is used in epidemiology to identify pathways through which exposures influence health. The advent of high-throughput (omics) technologies gives opportunities to perform mediation ...analysis with a high-dimension pool of covariates.
We aimed to highlight some biostatistical issues of this expanding field of high-dimension mediation.
The mediation techniques used for a single mediator cannot be generalized in a straightforward manner to high-dimension mediation. Causal knowledge on the relation between covariates is required for mediation analysis, and it is expected to be more limited as dimension and system complexity increase. The methods developed in high dimension can be distinguished according to whether mediators are considered separately or as a whole. Methods considering each potential mediator separately do not allow efficient identification of the indirect effects when mutual influences exist among the mediators, which is expected for many biological (e.g., epigenetic) parameters. In this context, methods considering all potential mediators simultaneously, based, for example, on data reduction techniques, are more adapted to the causal inference framework. Their cost is a possible lack of ability to single out the causal mediators. Moreover, the ability of the mediators to predict the outcome can be overestimated, in particular because many machine-learning algorithms are optimized to increase predictive ability rather than their aptitude to make causal inference. Given the lack of overarching validated framework and the generally complex causal structure of high-dimension data, analysis of high-dimension mediation currently requires great caution and effort to incorporate
biological knowledge. https://doi.org/10.1289/EHP6240.
Adaptation to environmental conditions within the native range of exotic species can condition the invasion success of these species outside their range. The striking success of the Asian tiger ...mosquito, Aedes albopictus, to invade temperate regions has been attributed to the winter survival of diapause eggs in cold environments. In this study, we evaluate genetic polymorphisms (SNPs) and wing morphometric variation among three biogeographical regions of the native range of A. albopictus. Reconstructed demographic histories of populations show an initial expansion in Southeast Asia and suggest that marine regression during late Pleistocene and climate warming after the last glacial period favored expansion of populations in southern and northern regions, respectively. Searching for genomic signatures of selection, we identified significantly differentiated SNPs among which several are located in or within 20 kb distance from candidate genes for cold adaptation. These genes involve cellular and metabolic processes and several of them have been shown to be differentially expressed under diapausing conditions. The three biogeographical regions also differ for wing size and shape, and wing size increases with latitude supporting Bergmann’s rule. Adaptive genetic and morphometric variation observed along the climatic gradient of A. albopictus native range suggests that colonization of northern latitudes promoted adaptation to cold environments prior to its worldwide invasion.
The model plant species Arabidopsis thaliana is successful at colonizing land that has recently undergone human-mediated disturbance. To investigate the prehistoric spread of A. thaliana, we applied ...approximate Bayesian computation and explicit spatial modeling to 76 European accessions sequenced at 876 nuclear loci. We find evidence that a major migration wave occurred from east to west, affecting most of the sampled individuals. The longitudinal gradient appears to result from the plant having spread in Europe from the east approximately 10,000 years ago, with a rate of westward spread of approximately 0.9 km/year. This wave-of-advance model is consistent with a natural colonization from an eastern glacial refugium that overwhelmed ancient western lineages. However, the speed and time frame of the model also suggest that the migration of A. thaliana into Europe may have accompanied the spread of agriculture during the Neolithic transition.
There is a considerable impetus in population genomics to pinpoint loci involved in local adaptation. A powerful approach to find genomic regions subject to local adaptation is to genotype numerous ...molecular markers and look for outlier loci. One of the most common approaches for selection scans is based on statistics that measure population differentiation such as FST. However, there are important caveats with approaches related to FST because they require grouping individuals into populations and they additionally assume a particular model of population structure. Here, we implement a more flexible individual-based approach based on Bayesian factor models. Factor models capture population structure with latent variables called factors, which can describe clustering of individuals into populations or isolation-by-distance patterns. Using hierarchical Bayesian modeling, we both infer population structure and identify outlier loci that are candidates for local adaptation. In order to identify outlier loci, the hierarchical factor model searches for loci that are atypically related to population structure as measured by the latent factors. In a model of population divergence, we show that it can achieve a 2-fold or more reduction of false discovery rate compared with the software BayeScan or with an FST approach. We show that our software can handle large data sets by analyzing the single nucleotide polymorphisms of the Human Genome Diversity Project. The Bayesian factor model is implemented in the open-source PCAdapt software.
ABSTRACT
Motivation
Principal component analysis (PCA) of genetic data is routinely used to infer ancestry and control for population structure in various genetic analyses. However, conducting PCA ...analyses can be complicated and has several potential pitfalls. These pitfalls include (i) capturing linkage disequilibrium (LD) structure instead of population structure, (ii) projected PCs that suffer from shrinkage bias, (iii) detecting sample outliers and (iv) uneven population sizes. In this work, we explore these potential issues when using PCA, and present efficient solutions to these. Following applications to the UK Biobank and the 1000 Genomes project datasets, we make recommendations for best practices and provide efficient and user-friendly implementations of the proposed solutions in R packages bigsnpr and bigutilsr.
Results
For example, we find that PC19–PC40 in the UK Biobank capture complex LD structure rather than population structure. Using our automatic algorithm for removing long-range LD regions, we recover 16 PCs that capture population structure only. Therefore, we recommend using only 16–18 PCs from the UK Biobank to account for population structure confounding. We also show how to use PCA to restrict analyses to individuals of homogeneous ancestry. Finally, when projecting individual genotypes onto the PCA computed from the 1000 Genomes project data, we find a shrinkage bias that becomes large for PC5 and beyond. We then demonstrate how to obtain unbiased projections efficiently using bigsnpr. Overall, we believe this work would be of interest for anyone using PCA in their analyses of genetic data, as well as for other omics data.
Availability and implementation
R packages bigsnpr and bigutilsr can be installed from either CRAN or GitHub (see https://github.com/privefl/bigsnpr). A tutorial on the steps to perform PCA on 1000G data is available at https://privefl.github.io/bigsnpr/articles/bedpca.html. All code used for this paper is available at https://github.com/privefl/paper4-bedpca/tree/master/code.
Supplementary information
Supplementary data are available at Bioinformatics online.
Approximate Bayesian inference on the basis of summary statistics is well-suited to complex problems for which the likelihood is either mathematically or computationally intractable. However the ...methods that use rejection suffer from the curse of dimensionality when the number of summary statistics is increased. Here we propose a machine-learning approach to the estimation of the posterior density by introducing two innovations. The new method fits a nonlinear conditional heteroscedastic regression of the parameter on the summary statistics, and then adaptively improves estimation using importance sampling. The new algorithm is compared to the state-of-the-art approximate Bayesian methods, and achieves considerable reduction of the computational burden in two examples of inference in statistical genetics and in a queueing model.