We present a genotype imputation method that scales to millions of reference samples. The imputation method, based on the Li and Stephens model and implemented in Beagle v.4.1, is parallelized and ...memory efficient, making it well suited to multi-core computer processors. It achieves fast, accurate, and memory-efficient genotype imputation by restricting the probability model to markers that are genotyped in the target samples and by performing linear interpolation to impute ungenotyped variants. We compare Beagle v.4.1 with Impute2 and Minimac3 by using 1000 Genomes Project data, UK10K Project data, and simulated data. All three methods have similar accuracy but different memory requirements and different computation times. When imputing 10 Mb of sequence data from 50,000 reference samples, Beagle’s throughput was more than 100× greater than Impute2’s throughput on our computer servers. When imputing 10 Mb of sequence data from 200,000 reference samples in VCF format, Minimac3 consumed 26× more memory per computational thread and 15× more CPU time than Beagle. We demonstrate that Beagle v.4.1 scales to much larger reference panels by performing imputation from a simulated reference panel having 5 million samples and a mean marker density of one marker per four base pairs.
Existing methods for estimating historical effective population size from genetic data have been unable to accurately estimate effective population size during the most recent past. We present a ...non-parametric method for accurately estimating recent effective population size by using inferred long segments of identity by descent (IBD). We found that inferred segments of IBD contain information about effective population size from around 4 generations to around 50 generations ago for SNP array data and to over 200 generations ago for sequence data. In human populations that we examined, the estimates of effective size were approximately one-third of the census size. We estimate the effective population size of European-ancestry individuals in the UK four generations ago to be eight million and the effective population size of Finland four generations ago to be 0.7 million. Our method is implemented in the open-source IBDNe software package.
Segments of indentity-by-descent (IBD) detected from high-density genetic data are useful for many applications, including long-range phase determination, phasing family data, imputation, IBD ...mapping, and heritability analysis in founder populations. We present Refined IBD, a new method for IBD segment detection. Refined IBD achieves both computational efficiency and highly accurate IBD segment reporting by searching for IBD in two steps. The first step (identification) uses the GERMLINE algorithm to find shared haplotypes exceeding a length threshold. The second step (refinement) evaluates candidate segments with a probabilistic approach to assess the evidence for IBD. Like GERMLINE, Refined IBD allows for IBD reporting on a haplotype level, which facilitates determination of multi-individual IBD and allows for haplotype-based downstream analyses. To investigate the properties of Refined IBD, we simulate SNP data from a model with recent superexponential population growth that is designed to match United Kingdom data. The simulation results show that Refined IBD achieves a better power/accuracy profile than fastIBD or GERMLINE. We find that a single run of Refined IBD achieves greater power than 10 runs of fastIBD. We also apply Refined IBD to SNP data for samples from the United Kingdom and from Northern Finland and describe the IBD sharing in these data sets. Refined IBD is powerful, highly accurate, and easy to use and is implemented in Beagle version 4.
The American College of Medical Genetics and Genomics (ACMG) and Association of Molecular Pathology (AMP) recently published important new guidelines aiming to improve and standardize the ...pathogenicity classification of genomic variants. The Clinical Sequencing Exploratory Research (CSER) consortium evaluated the use of these guidelines across nine laboratories. One identified obstacle to consistent usage of the ACMG-AMP guidelines is the lack of a definition of cosegregation as criteria for pathogenicity classification. Cosegregation data differ from many other types of pathogenicity data in being quantitative. However, the ACMG-AMP guidelines do not define quantitative criteria for use of these data. Here, such quantitative criteria, in an easily implementable form, are proposed.
Whole-genome association studies present many new statistical and computational challenges due to the large quantity of data obtained. One of these challenges is haplotype inference; methods for ...haplotype inference designed for small data sets from candidate-gene studies do not scale well to the large number of individuals genotyped in whole-genome association studies. We present a new method and software for inference of haplotype phase and missing data that can accurately phase data from whole-genome association studies, and we present the first comparison of haplotype-inference methods for real and simulated data sets with thousands of genotyped individuals. We find that our method outperforms existing methods in terms of both speed and accuracy for large data sets with thousands of individuals and densely spaced genetic markers, and we use our method to phase a real data set of 3,002 individuals genotyped for 490,032 markers in 3.1 days of computing time, with 99% of masked alleles imputed correctly. Our method is implemented in the Beagle software package, which is freely available.
We present methods for imputing data for ungenotyped markers and for inferring haplotype phase in large data sets of unrelated individuals and parent-offspring trios. Our methods make use of known ...haplotype phase when it is available, and our methods are computationally efficient so that the full information in large reference panels with thousands of individuals is utilized. We demonstrate that substantial gains in imputation accuracy accrue with increasingly large reference panel sizes, particularly when imputing low-frequency variants, and that unphased reference panels can provide highly accurate genotype imputation. We place our methodology in a unified framework that enables the simultaneous use of unphased and phased data from trios and unrelated individuals in a single analysis. For unrelated individuals, our imputation methods produce well-calibrated posterior genotype probabilities and highly accurate allele-frequency estimates. For trios, our haplotype-inference method is four orders of magnitude faster than the gold-standard PHASE program and has excellent accuracy. Our methods enable genotype imputation to be performed with unphased trio or unrelated reference panels, thus accounting for haplotype-phase uncertainty in the reference panel. We present a useful measure of imputation accuracy, allelic R2, and show that this measure can be estimated accurately from posterior genotype probabilities. Our methods are implemented in version 3.0 of the BEAGLE software package.
Existing methods for identity by descent (IBD) segment detection were designed for SNP array data, not sequence data. Sequence data have a much higher density of genetic variants and a different ...allele frequency distribution, and can have higher genotype error rates. Consequently, best practices for IBD detection in SNP array data do not necessarily carry over to sequence data. We present a method, IBDseq, for detecting IBD segments in sequence data and a method, SEQERR, for estimating genotype error rates at low-frequency variants by using detected IBD. The IBDseq method estimates probabilities of genotypes observed with error for each pair of individuals under IBD and non-IBD models. The ratio of estimated probabilities under the two models gives a LOD score for IBD. We evaluate several IBD detection methods that are fast enough for application to sequence data (IBDseq, Beagle Refined IBD, PLINK, and GERMLINE) under multiple parameter settings, and we show that IBDseq achieves high power and accuracy for IBD detection in sequence data. The SEQERR method estimates genotype error rates by comparing observed and expected rates of pairs of homozygote and heterozygote genotypes at low-frequency variants in IBD segments. We demonstrate the accuracy of SEQERR in simulated data, and we apply the method to estimate genotype error rates in sequence data from the UK10K and 1000 Genomes projects.
Anatomically modern humans interbred with Neanderthals and with a related archaic population known as Denisovans. Genomes of several Neanderthals and one Denisovan have been sequenced, and these ...reference genomes have been used to detect introgressed genetic material in present-day human genomes. Segments of introgression also can be detected without use of reference genomes, and doing so can be advantageous for finding introgressed segments that are less closely related to the sequenced archaic genomes. We apply a new reference-free method for detecting archaic introgression to 5,639 whole-genome sequences from Eurasia and Oceania. We find Denisovan ancestry in populations from East and South Asia and Papuans. Denisovan ancestry comprises two components with differing similarity to the sequenced Altai Denisovan individual. This indicates that at least two distinct instances of Denisovan admixture into modern humans occurred, involving Denisovan populations that had different levels of relatedness to the sequenced Altai Denisovan.
Display omitted
Display omitted
•Asian genomes carry introgressed DNA from Denisovans and Neanderthals•East Asians show evidence of introgression from two distinct Denisovan populations•South Asians and Oceanians carry introgression from one Denisovan population
Two waves of Denisovan ancestry have shaped present-day humans.
We present a method, fastIBD, for finding tracts of identity by descent (IBD) between pairs of individuals. FastIBD can be applied to thousands of samples across genome-wide SNP data and is ...significantly more powerful for finding short tracts of IBD than existing methods for finding IBD tracts in such data. We show that fastIBD can detect facets of population structure that are not revealed by other methods. In the Wellcome Trust Case Control Consortium bipolar disorder case-control data, we find a genome-wide excess of IBD in case-case pairs of individuals compared to control-control pairs. We show that this excess can be explained by the geographical clustering of cases. We also show that it is possible to use fastIBD to generate highly accurate estimates of genome-wide IBD sharing between pairs of distant relatives. This is useful for estimation of relationship and for adjusting for relatedness in association studies. FastIBD is incorporated in the freely available Beagle software package.
Genotype Imputation from Large Reference Panels Das, Sayantan; Abecasis, Gonçalo R; Browning, Brian L
Annual review of genomics and human genetics,
08/2018, Letnik:
19, Številka:
1
Journal Article
Recenzirano
Genotype imputation has become a standard tool in genome-wide association studies because it enables researchers to inexpensively approximate whole-genome sequence data from genome-wide ...single-nucleotide polymorphism array data. Genotype imputation increases statistical power, facilitates fine mapping of causal variants, and plays a key role in meta-analyses of genome-wide association studies. Only variants that were previously observed in a reference panel of sequenced individuals can be imputed. However, the rapid increase in the number of deeply sequenced individuals will soon make it possible to assemble enormous reference panels that greatly increase the number of imputable variants. In this review, we present an overview of genotype imputation and describe the computational techniques that make it possible to impute genotypes from reference panels with millions of individuals.