The R package pcadapt performs genome scans to detect genes under selection based on population genomic data. It assumes that candidate markers are outliers with respect to how they are related to ...population structure. Because population structure is ascertained with principal component analysis, the package is fast and works with large‐scale data. It can handle missing data and pooled sequencing data. By contrast to population‐based approaches, the package handle admixed individuals and does not require grouping individuals into populations. Since its first release, pcadapt has evolved in terms of both statistical approach and software implementation. We present results obtained with robust Mahalanobis distance, which is a new statistic for genome scans available in the 2.0 and later versions of the package. When hierarchical population structure occurs, Mahalanobis distance is more powerful than the communality statistic that was implemented in the first version of the package. Using simulated data, we compare pcadapt to other computer programs for genome scans (BayeScan, hapflk, OutFLANK, sNMF). We find that the proportion of false discoveries is around a nominal false discovery rate set at 10% with the exception of BayeScan that generates 40% of false discoveries. We also find that the power of BayeScan is severely impacted by the presence of admixed individuals whereas pcadapt is not impacted. Last, we find that pcadapt and hapflk are the most powerful in scenarios of population divergence and range expansion. Because pcadapt handles next‐generation sequencing data, it is a valuable tool for data analysis in molecular ecology.
Approximate Bayesian Computation (ABC) in practice Csilléry, Katalin; Blum, Michael G.B.; Gaggiotti, Oscar E. ...
Trends in ecology & evolution (Amsterdam),
07/2010, Letnik:
25, Številka:
7
Journal Article
Recenzirano
Understanding the forces that influence natural variation within and among populations has been a major objective of evolutionary biologists for decades. Motivated by the growth in computational ...power and data complexity, modern approaches to this question make intensive use of simulation methods. Approximate Bayesian Computation (ABC) is one of these methods. Here we review the foundations of ABC, its recent algorithmic developments, and its applications in evolutionary biology and ecology. We argue that the use of ABC should incorporate all aspects of Bayesian data analysis: formulation, fitting, and improvement of a model. ABC can be a powerful tool to make inferences with complex models if these principles are carefully applied.
The history of click-speaking Khoe-San, and African populations in general, remains poorly understood. We genotyped ~2.3 million single-nucleotide polymorphisms in 220 southern Africans and found ...that the Khoe-San diverged from other populations ≥100,000 years ago, but population structure within the Khoe-San dated back to about 35,000 years ago. Genetic variation in various sub-Saharan populations did not localize the origin of modern humans to a single geographic region within Africa; instead, it indicated a history of admixture and stratification. We found evidence of adaptation targeting muscle function and immune response; potential adaptive introgression of protection from ultraviolet light; and selection predating modern human diversification, involving skeletal and neurological development. These new findings illustrate the importance of African genomic diversity in understanding human evolutionary history.
Polygenic Risk Scores (PRS) combine genotype information across many single-nucleotide polymorphisms (SNPs) to give a score reflecting the genetic risk of developing a disease. PRS might have a major ...impact on public health, possibly allowing for screening campaigns to identify high-genetic risk individuals for a given disease. The "Clumping+Thresholding" (C+T) approach is the most common method to derive PRS. C+T uses only univariate genome-wide association studies (GWAS) summary statistics, which makes it fast and easy to use. However, previous work showed that jointly estimating SNP effects for computing PRS has the potential to significantly improve the predictive performance of PRS as compared to C+T. In this paper, we present an efficient method for the joint estimation of SNP effects using individual-level data, allowing for practical application of penalized logistic regression (PLR) on modern datasets including hundreds of thousands of individuals. Moreover, our implementation of PLR directly includes automatic choices for hyper-parameters. We also provide an implementation of penalized linear regression for quantitative traits. We compare the performance of PLR, C+T and a derivation of random forests using both real and simulated data. Overall, we find that PLR achieves equal or higher predictive performance than C+T in most scenarios considered, while being scalable to biobank data. In particular, we find that improvement in predictive performance is more pronounced when there are few effects located in nearby genomic regions with correlated SNPs; for instance, in simulations, AUC values increase from 83% with the best prediction of C+T to 92.5% with PLR. We confirm these results in a data analysis of a case-control study for celiac disease where PLR and the standard C+T method achieve AUC values of 89% and of 82.5%. Applying penalized linear regression to 350,000 individuals of the UK Biobank, we predict height with a larger correlation than with the best prediction of C+T (∼65% instead of ∼55%), further demonstrating its scalability and strong predictive power, even for highly polygenic traits. Moreover, using 150,000 individuals of the UK Biobank, we are able to predict breast cancer better than C+T, fitting PLR in a few minutes only. In conclusion, this paper demonstrates the feasibility and relevance of using penalized regression for PRS computation when large individual-level datasets are available, thanks to the efficient implementation available in our R package bigstatsr.
Approximate Bayesian Computation is a family of likelihood-free inference techniques that are well suited to models defined in terms of a stochastic generating mechanism. In a nutshell, Approximate ...Bayesian Computation proceeds by computing summary statistics s
obs
from the data and simulating summary statistics for different values of the parameter Θ. The posterior distribution is then approximated by an estimator of the conditional density g(Θ|s
obs
). In this paper, we derive the asymptotic bias and variance of the standard estimators of the posterior distribution which are based on rejection sampling and linear adjustment. Additionally, we introduce an original estimator of the posterior distribution based on quadratic adjustment and we show that its bias contains a fewer number of terms than the estimator with linear adjustment. Although we find that the estimators with adjustment are not universally superior to the estimator based on rejection sampling, we find that they can achieve better performance when there is a nearly homoscedastic relationship between the summary statistics and the parameter of interest. To make this relationship as homoscedastic as possible, we propose to use transformations of the summary statistics. In different examples borrowed from the population genetics and epidemiological literature, we show the potential of the methods with adjustment and of the transformations of the summary statistics. Supplemental materials containing the details of the proofs are available online.
Abstract
Motivation
Genome-wide datasets produced for association studies have dramatically increased in size over the past few years, with modern datasets commonly including millions of variants ...measured in dozens of thousands of individuals. This increase in data size is a major challenge severely slowing down genomic analyses, leading to some software becoming obsolete and researchers having limited access to diverse analysis tools.
Results
Here we present two R packages, bigstatsr and bigsnpr, allowing for the analysis of large scale genomic data to be performed within R. To address large data size, the packages use memory-mapping for accessing data matrices stored on disk instead of in RAM. To perform data pre-processing and data analysis, the packages integrate most of the tools that are commonly used, either through transparent system calls to existing software, or through updated or improved implementation of existing methods. In particular, the packages implement fast and accurate computations of principal component analysis and association studies, functions to remove single nucleotide polymorphisms in linkage disequilibrium and algorithms to learn polygenic risk scores on millions of single nucleotide polymorphisms. We illustrate applications of the two R packages by analyzing a case-control genomic dataset for celiac disease, performing an association study and computing polygenic risk scores. Finally, we demonstrate the scalability of the R packages by analyzing a simulated genome-wide dataset including 500 000 individuals and 1 million markers on a single desktop computer.
Availability and implementation
https://privefl.github.io/bigstatsr/ and https://privefl.github.io/bigsnpr/.
Supplementary information
Supplementary data are available at Bioinformatics online.
Ordination is a common tool in ecology that aims at representing complex biological information in a reduced space. In landscape genetics, ordination methods such as principal component analysis ...(PCA) have been used to detect adaptive variation based on genomic data. Taking advantage of environmental data in addition to genotype data, redundancy analysis (RDA) is another ordination approach that is useful to detect adaptive variation. This study aims at proposing a test statistic based on RDA to search for loci under selection. We compare redundancy analysis to pcadapt, which is a nonconstrained ordination method, and to a latent factor mixed model (LFMM), which is a univariate genotype–environment association method. Individual‐based simulations identify evolutionary scenarios where RDA genome scans have a greater statistical power than genome scans based on PCA. By constraining the analysis with environmental variables, RDA performs better than PCA in identifying adaptive variation when selection gradients are weakly correlated with population structure. In addition, we show that if RDA and LFMM have a similar power to identify genetic markers associated with environmental variables, the RDA‐based procedure has the advantage to identify the main selective gradients as a combination of environmental variables. To give a concrete illustration of RDA in population genomics, we apply this method to the detection of outliers and selective gradients on an SNP data set of Populus trichocarpa (Geraldes et al., ). The RDA‐based approach identifies the main selective gradient contrasting southern and coastal populations to northern and continental populations in the north‐western American coast.
Polygenic prediction has the potential to contribute to precision medicine. Clumping and thresholding (C+T) is a widely used method to derive polygenic scores. When using C+T, several p value ...thresholds are tested to maximize predictive ability of the derived polygenic scores. Along with this p value threshold, we propose to tune three other hyper-parameters for C+T. We implement an efficient way to derive thousands of different C+T scores corresponding to a grid over four hyper-parameters. For example, it takes a few hours to derive 123K different C+T scores for 300K individuals and 1M variants using 16 physical cores. We find that optimizing over these four hyper-parameters improves the predictive performance of C+T in both simulations and real data applications as compared to tuning only the p value threshold. A particularly large increase can be noted when predicting depression status, from an AUC of 0.557 (95% CI: 0.544–0.569) when tuning only the p value threshold to an AUC of 0.592 (95% CI: 0.580–0.604) when tuning all four hyper-parameters we propose for C+T. We further propose stacked clumping and thresholding (SCT), a polygenic score that results from stacking all derived C+T scores. Instead of choosing one set of hyper-parameters that maximizes prediction in some training set, SCT learns an optimal linear combination of all C+T scores by using an efficient penalized regression. We apply SCT to eight different case-control diseases in the UK biobank data and find that SCT substantially improves prediction accuracy with an average AUC increase of 0.035 over standard C+T.
The emergence of farming during the Neolithic transition, including the domestication of livestock, was a critical point in the evolution of human kind. The goat (Capra hircus) was one of the first ...domesticated ungulates. In this study, we compared the genetic diversity of domestic goats to that of the modern representatives of their wild ancestor, the bezoar, by analyzing 473 samples collected over the whole distribution range of the latter species. This partly confirms and significantly clarifies the goat domestication scenario already proposed by archaeological evidence. All of the mitochondrial DNA haplogroups found in current domestic goats have also been found in the bezoar. The geographic distribution of these haplogroups in the wild ancestor allowed the localization of the main domestication centers. We found no haplotype that could have been domesticated in the eastern half of the Iranian Plateau, nor further to the east. A signature of population expansion in bezoars of the C haplogroup suggests an early domestication center on the Central Iranian Plateau (Yazd and Kerman Provinces) and in the Southern Zagros (Fars Province), possibly corresponding to the management of wild flocks. However, the contribution of this center to the current domestic goat population is rather low (1.4%). We also found a second domestication center covering a large area in Eastern Anatolia, and possibly in Northern and Central Zagros. This last domestication center is the likely origin of almost all domestic goats today. This finding is consistent with archaeological data identifying Eastern Anatolia as an important domestication center.
Approximate Bayesian computation (ABC) methods make use of comparisons between simulated and observed summary statistics to overcome the problem of computationally intractable likelihood functions. ...As the practical implementation of ABC requires computations based on vectors of summary statistics, rather than full data sets, a central question is how to derive low-dimensional summary statistics from the observed data with minimal loss of information. In this article we provide a comprehensive review and comparison of the performance of the principal methods of dimension reduction proposed in the ABC literature. The methods are split into three nonmutually exclusive classes consisting of best subset selection methods, projection techniques and regularization. In addition, we introduce two new methods of dimension reduction. The first is a best subset selection method based on Akaike and Bayesian information criteria, and the second uses ridge regression as a regularization procedure. We illustrate the performance of these dimension reduction techniques through the analysis of three challenging models and data sets.