Abstract
Summary
We present FATHMM-XF, a method for predicting pathogenic point mutations in the human genome. Drawing on an extensive feature set, FATHMM-XF outperforms competitors on benchmark ...tests, particularly in non-coding regions where the majority of pathogenic mutations are likely to be found.
Availability and implementation
The FATHMM-XF web server is available at http://fathmm.biocompute.org.uk/fathmm-xf/, and as tracks on the Genome Tolerance Browser: http://gtb.biocompute.org.uk. Predictions are provided for human genome version GRCh37/hg19. The data used for this project can be downloaded from: http://fathmm.biocompute.org.uk/fathmm-xf/
Supplementary information
Supplementary data are available at Bioinformatics online.
Technological advances have enabled the identification of an increasingly large spectrum of single nucleotide variants within the human genome, many of which may be associated with monogenic disease ...or complex traits. Here, we propose an integrative approach, named FATHMM-MKL, to predict the functional consequences of both coding and non-coding sequence variants. Our method utilizes various genomic annotations, which have recently become available, and learns to weight the significance of each component annotation source.
We show that our method outperforms current state-of-the-art algorithms, CADD and GWAVA, when predicting the functional consequences of non-coding variants. In addition, FATHMM-MKL is comparable to the best of these algorithms when predicting the impact of coding variants. The method includes a confidence measure to rank order predictions.
Results from genome-wide association studies (GWAS) can be used to infer causal relationships between phenotypes, using a strategy known as 2-sample Mendelian randomization (2SMR) and bypassing the ...need for individual-level data. However, 2SMR methods are evolving rapidly and GWAS results are often insufficiently curated, undermining efficient implementation of the approach. We therefore developed MR-Base (<ext-link ext-link-type="uri" xlink:href="http://www.mrbase.org">http://www.mrbase.org</ext-link>): a platform that integrates a curated database of complete GWAS results (no restrictions according to statistical significance) with an application programming interface, web app and R packages that automate 2SMR. The software includes several sensitivity analyses for assessing the impact of horizontal pleiotropy and other violations of assumptions. The database currently comprises 11 billion single nucleotide polymorphism-trait associations from 1673 GWAS and is updated on a regular basis. Integrating data with software ensures more rigorous application of hypothesis-driven analyses and allows millions of potential causal relationships to be efficiently evaluated in phenome-wide association studies.
The influence of genetic variation on complex diseases is potentially mediated through a range of highly dynamic epigenetic processes exhibiting temporal variation during development and later life. ...Here we present a catalogue of the genetic influences on DNA methylation (methylation quantitative trait loci (mQTL)) at five different life stages in human blood: children at birth, childhood, adolescence and their mothers during pregnancy and middle age.
We show that genetic effects on methylation are highly stable across the life course and that developmental change in the genetic contribution to variation in methylation occurs primarily through increases in environmental or stochastic effects. Though we map a large proportion of the cis-acting genetic variation, a much larger component of genetic effects influencing methylation are acting in trans. However, only 7 % of discovered mQTL are trans-effects, suggesting that the trans component is highly polygenic. Finally, we estimate the contribution of mQTL to variation in complex traits and infer that methylation may have a causal role consistent with an infinitesimal model in which many methylation sites each have a small influence, amounting to a large overall contribution.
DNA methylation contains a significant heritable component that remains consistent across the lifespan. Our results suggest that the genetic component of methylation may have a causal role in complex traits. The database of mQTL presented here provide a rich resource for those interested in investigating the role of methylation in disease.
ABSTRACT
The rate at which nonsynonymous single nucleotide polymorphisms (nsSNPs) are being identified in the human genome is increasing dramatically owing to advances in whole‐genome/whole‐exome ...sequencing technologies. Automated methods capable of accurately and reliably distinguishing between pathogenic and functionally neutral nsSNPs are therefore assuming ever‐increasing importance. Here, we describe the Functional Analysis Through Hidden Markov Models (FATHMM) software and server: a species‐independent method with optional species‐specific weightings for the prediction of the functional effects of protein missense variants. Using a model weighted for human mutations, we obtained performance accuracies that outperformed traditional prediction methods (i.e., SIFT, PolyPhen, and PANTHER) on two separate benchmarks. Furthermore, in one benchmark, we achieve performance accuracies that outperform current state‐of‐the‐art prediction methods (i.e., SNPs&GO and MutPred). We demonstrate that FATHMM can be efficiently applied to high‐throughput/large‐scale human and nonhuman genome sequencing projects with the added benefit of phenotypic outcome associations. To illustrate this, we evaluated nsSNPs in wheat (Triticum spp.) to identify some of the important genetic variants responsible for the phenotypic differences introduced by intense selection during domestication. A Web‐based implementation of FATHMM, including a high‐throughput batch facility and a downloadable standalone package, is available at http://fathmm.biocompute.org.uk.
Evidence suggests that in utero exposure to undernutrition and overnutrition might affect adiposity in later life. Epigenetic modification is suggested as a plausible mediating mechanism.
We used ...multivariable linear regression and a negative control design to examine offspring epigenome-wide DNA methylation in relation to maternal and offspring adiposity in 1018 participants.
Compared with neonatal offspring of normal weight mothers, 28 and 1621 CpG sites were differentially methylated in offspring of obese and underweight mothers, respectively false discovert rate (FDR)-corrected P-value < 0.05), with no overlap in the sites that maternal obesity and underweight relate to. A positive association, where higher methylation is associated with a body mass index (BMI) outside the normal range, was seen at 78.6% of the sites associated with obesity and 87.9% of the sites associated with underweight. Associations of maternal obesity with offspring methylation were stronger than associations of paternal obesity, supporting an intrauterine mechanism. There were no consistent associations of gestational weight gain with offspring DNA methylation. In general, sites that were hypermethylated in association with maternal obesity or hypomethylated in association with maternal underweight tended to be positively associated with offspring adiposity, and sites hypomethylated in association with maternal obesity or hypermethylated in association with maternal underweight tended to be inversely associated with offspring adiposity.
Our data suggest that both maternal obesity and, to a larger degree, underweight affect the neonatal epigenome via an intrauterine mechanism, but weight gain during pregnancy has little effect. We found some evidence that associations of maternal underweight with lower offspring adiposity and maternal obesity with greater offspring adiposity may be mediated via increased DNA methylation.
The number of missense mutations being identified in cancer genomes has greatly increased as a consequence of technological advances and the reduced cost of whole-genome/whole-exome sequencing ...methods. However, a high proportion of the amino acid substitutions detected in cancer genomes have little or no effect on tumour progression (passenger mutations). Therefore, accurate automated methods capable of discriminating between driver (cancer-promoting) and passenger mutations are becoming increasingly important. In our previous work, we developed the Functional Analysis through Hidden Markov Models (FATHMM) software and, using a model weighted for inherited disease mutations, observed improved performances over alternative computational prediction algorithms. Here, we describe an adaptation of our original algorithm that incorporates a cancer-specific model to potentiate the functional analysis of driver mutations.
The performance of our algorithm was evaluated using two separate benchmarks. In our analysis, we observed improved performances when distinguishing between driver mutations and other germ line variants (both disease-causing and putatively neutral mutations). In addition, when discriminating between somatic driver and passenger mutations, we observed performances comparable with the leading computational prediction algorithms: SPF-Cancer and TransFIC.
A web-based implementation of our cancer-specific model, including a downloadable stand-alone package, is available at http://fathmm.biocompute.org.uk.
For somatic point mutations in coding and non-coding regions of the genome, we propose CScape, an integrative classifier for predicting the likelihood that mutations are cancer drivers. Tested on ...somatic mutations, CScape tends to outperform alternative methods, reaching 91% balanced accuracy in coding regions and 70% in non-coding regions, while even higher accuracy may be achieved using thresholds to isolate high-confidence predictions. Positive predictions tend to cluster in genomic regions, so we apply a statistical approach to isolate coding and non-coding regions of the cancer genome that appear enriched for high-confidence predicted disease-drivers. Predictions and software are available at http://CScape.biocompute.org.uk/ .
The association of copy number variations (CNVs), differing numbers of copies of genetic sequence at locations in the genome, with phenotypes such as intellectual disability has been almost ...exclusively evaluated using clinically ascertained cohorts. The contribution of these genetic variants to cognitive phenotypes in the general population remains unclear.
To investigate the clinical features conferred by CNVs associated with known syndromes in adult carriers without clinical preselection and to assess the genome-wide consequences of rare CNVs (frequency ≤0.05%; size ≥250 kilobase pairs kb) on carriers' educational attainment and intellectual disability prevalence in the general population.
The population biobank of Estonia contains 52,000 participants enrolled from 2002 through 2010. General practitioners examined participants and filled out a questionnaire of health- and lifestyle-related questions, as well as reported diagnoses. Copy number variant analysis was conducted on a random sample of 7877 individuals and genotype-phenotype associations with education and disease traits were evaluated. Our results were replicated on a high-functioning group of 993 Estonians and 3 geographically distinct populations in the United Kingdom, the United States, and Italy.
Phenotypes of genomic disorders in the general population, prevalence of autosomal CNVs, and association of these variants with educational attainment (from less than primary school through scientific degree) and prevalence of intellectual disability.
Of the 7877 in the Estonian cohort, we identified 56 carriers of CNVs associated with known syndromes. Their phenotypes, including cognitive and psychiatric problems, epilepsy, neuropathies, obesity, and congenital malformations are similar to those described for carriers of identical rearrangements ascertained in clinical cohorts. A genome-wide evaluation of rare autosomal CNVs (frequency, ≤0.05%; ≥250 kb) identified 831 carriers (10.5%) of the screened general population. Eleven of 216 (5.1%) carriers of a deletion of at least 250 kb (odds ratio OR, 3.16; 95% CI, 1.51-5.98; P = 1.5e-03) and 6 of 102 (5.9%) carriers of a duplication of at least 1 Mb (OR, 3.67; 95% CI, 1.29-8.54; P = .008) had an intellectual disability compared with 114 of 6819 (1.7%) in the Estonian cohort. The mean education attainment was 3.81 (P = 1.06e-04) among 248 (≥250 kb) deletion carriers and 3.69 (P = 5.024e-05) among 115 duplication carriers (≥1 Mb). Of the deletion carriers, 33.5% did not graduate from high school (OR, 1.48; 95% CI, 1.12-1.95; P = .005) and 39.1% of duplication carriers did not graduate high school (OR, 1.89; 95% CI, 1.27-2.8; P = 1.6e-03). Evidence for an association between rare CNVs and lower educational attainment was supported by analyses of cohorts of adults from Italy and the United States and adolescents from the United Kingdom.
Known pathogenic CNVs in unselected, but assumed to be healthy, adult populations may be associated with unrecognized clinical sequelae. Additionally, individually rare but collectively common intermediate-size CNVs may be negatively associated with educational attainment. Replication of these findings in additional population groups is warranted given the potential implications of this observation for genomics research, clinical care, and public health.
A major cause of autosomal dominant disease is haploinsufficiency, whereby a single copy of a gene is not sufficient to maintain the normal function of the gene. A large proportion of existing ...methods for predicting haploinsufficiency incorporate biological networks, e.g. protein-protein interaction networks that have recently been shown to introduce study bias. As a result, these methods tend to perform best on well-studied genes, but underperform on less studied genes. The advent of large genome sequencing consortia, such as the 1000 genomes project, NHLBI Exome Sequencing Project and the Exome Aggregation Consortium creates an urgent need for unbiased haploinsufficiency prediction methods.
Here, we describe a machine learning approach, called HIPred, that integrates genomic and evolutionary information from ENSEMBL, with functional annotations from the Encyclopaedia of DNA Elements consortium and the NIH Roadmap Epigenomics Project to predict haploinsufficiency, without the study bias described earlier. We benchmark HIPred using several datasets and show that our unbiased method performs as well as, and in most cases, outperforms existing biased algorithms.
HIPred scores for all gene identifiers are available at: https://github.com/HAShihab/HIPred .
h.shihab@bristol.ac.uk or tom.gaunt@bristol.ac.uk.
Supplementary data are available at Bioinformatics online.