Human genetics has been haunted by the mystery of "missing heritability" of common traits. Although studies have discovered >1,200 variants associated with common diseases and traits, these variants ...typically appear to explain only a minority of the heritability. The proportion of heritability explained by a set of variants is the ratio of (i) the heritability due to these variants (numerator), estimated directly from their observed effects, to (ii) the total heritability (denominator), inferred indirectly from population data. The prevailing view has been that the explanation for missing heritability lies in the numerator—that is, in as-yet undiscovered variants. While many variants surely remain to be found, we show here that a substantial portion of missing heritability could arise from overestimation of the denominator, creating "phantom heritability." Specifically, (i) estimates of total heritability implicitly assume the trait involves no genetic interactions (epistasis) among loci; (ii) this assumption is not justified, because models with interactions are also consistent with observable data; and (iii) under such models, the total heritability may be much smaller and thus the proportion of heritability explained much larger. For example, 80% of the currently missing heritability for Crohn's disease could be due to genetic interactions, if the disease involves interaction among three pathways. In short, missing heritability need not directly correspond to missing variants, because current estimates of total heritability may be significantly inflated by genetic interactions. Finally, we describe a method for estimating heritability from isolated populations that is not inflated by genetic interactions.
Despite years of active research into the role of DNA repair and replication in mutagenesis, surprisingly little is known about the origin of spontaneous human mutation in the germ line. With the ...advent of high-throughput sequencing, genome-scale data have revealed statistical properties of mutagenesis in humans. These properties include variation of the mutation rate and spectrum along the genome at different scales in relation to epigenomic features and dependency on parental age. Moreover, mutations originated in mothers are less frequent than mutations originated in fathers and have a distinct genomic distribution. Statistical analyses that interpret these patterns in the context of known biochemistry can provide mechanistic models of mutagenesis in humans.
Cancer genomics efforts have identified genes and regulatory elements driving cancer development and neoplastic progression. From a microevolution standpoint, these are subject to positive selection. ...Although elusive in current studies, genes whose wild-type coding sequences are needed for tumor growth are also of key interest. They are expected to experience negative selection and stay intact under pressure of incessant mutation. The detection of significantly mutated (or undermutated) genes is completely confounded by the genomic heterogeneity of cancer mutation. Here we present a hierarchical framework that allows modeling of coding point mutations. Application of the model to sequencing data from 17 cancer types demonstrates an increased power to detect known cancer driver genes and identifies new significantly mutated genes with highly plausible biological functions. The signal of negative selection is very subtle, but is detectable in several cancer types and in a pan-cancer data set. It is enriched in cell-essential genes identified in a CRISPR screen, as well as in genes with reported roles in cancer.
PolyPhen-2 (Polymorphism Phenotyping v2), available as software and via a Web server, predicts the possible impact of amino acid substitutions on the stability and function of human proteins using ...structural and comparative evolutionary considerations. It performs functional annotation of single-nucleotide polymorphisms (SNPs), maps coding SNPs to gene transcripts, extracts protein sequence annotations and structural attributes, and builds conservation profiles. It then estimates the probability of the missense mutation being damaging based on a combination of all these properties. PolyPhen-2 features include a high-quality multiple protein sequence alignment pipeline and a prediction method employing machine-learning classification. The software also integrates the UCSC Genome Browser's human genome annotations and MultiZ multiple alignments of vertebrate genomes with the human genome. PolyPhen-2 is capable of analyzing large volumes of data produced by next-generation sequencing projects, thanks to built-in support for high-performance computing environments like Grid Engine and Platform LSF.
Cancer genomes contain large numbers of somatic mutations but few of these mutations drive tumor development. Current approaches either identify driver genes on the basis of mutational recurrence or ...approximate the functional consequences of nonsynonymous mutations by using bioinformatic scores. Passenger mutations are enriched in characteristic nucleotide contexts, whereas driver mutations occur in functional positions, which are not necessarily surrounded by a particular nucleotide context. We observed that mutations in contexts that deviate from the characteristic contexts around passenger mutations provide a signal in favor of driver genes. We therefore developed a method that combines this feature with the signals traditionally used for driver-gene identification. We applied our method to whole-exome sequencing data from 11,873 tumor-normal pairs and identified 460 driver genes that clustered into 21 cancer-related pathways. Our study provides a resource of driver genes across 28 tumor types with additional driver genes identified according to mutations in unusual nucleotide contexts.
Human genetic variation is distributed nonrandomly across the genome, though the principles governing its distribution are only partially known. DNA replication creates opportunities for mutation, ...and the timing of DNA replication correlates with the density of SNPs across the human genome. To enable deeper investigation of how DNA replication timing relates to human mutation and variation, we generated a high-resolution map of the human genome’s replication timing program and analyzed its relationship to point mutations, copy number variations, and the meiotic recombination hotspots utilized by males and females. DNA replication timing associated with point mutations far more strongly than predicted from earlier analyses and showed a stronger relationship to transversion than transition mutations. Structural mutations arising from recombination-based mechanisms and recombination hotspots used more extensively by females were enriched in early-replicating parts of the genome, though these relationships appeared to relate more strongly to the genomic distribution of causative sequence features. These results indicate differential and sex-specific relationship of DNA replication timing to different forms of mutation and recombination.
Cancer is a disease potentiated by mutations in somatic cells. Cancer mutations are not distributed uniformly along the human genome. Instead, different human genomic regions vary by up to fivefold ...in the local density of cancer somatic mutations, posing a fundamental problem for statistical methods used in cancer genomics. Epigenomic organization has been proposed as a major determinant of the cancer mutational landscape. However, both somatic mutagenesis and epigenomic features are highly cell-type-specific. We investigated the distribution of mutations in multiple independent samples of diverse cancer types and compared them to cell-type-specific epigenomic features. Here we show that chromatin accessibility and modification, together with replication timing, explain up to 86% of the variance in mutation rates along cancer genomes. The best predictors of local somatic mutation density are epigenomic features derived from the most likely cell type of origin of the corresponding malignancy. Moreover, we find that cell-of-origin chromatin features are much stronger determinants of cancer mutation profiles than chromatin features of matched cancer cell lines. Furthermore, we show that the cell type of origin of a cancer can be accurately determined based on the distribution of mutations along its genome. Thus, the DNA sequence of a cancer genome encompasses a wealth of information about the identity and epigenomic features of its cell of origin.
The genetic basis of most traits is highly polygenic and dominated by non-coding alleles. It is widely assumed that such alleles exert small regulatory effects on the expression of
-linked genes. ...However, despite the availability of gene expression and epigenomic datasets, few variant-to-gene links have emerged. It is unclear whether these sparse results are due to limitations in available data and methods, or to deficiencies in the underlying assumed model. To better distinguish between these possibilities, we identified 220 gene-trait pairs in which protein-coding variants influence a complex trait or its Mendelian cognate. Despite the presence of expression quantitative trait loci near most GWAS associations, by applying a gene-based approach we found limited evidence that the baseline expression of trait-related genes explains GWAS associations, whether using colocalization methods (8% of genes implicated), transcription-wide association (2% of genes implicated), or a combination of regulatory annotations and distance (4% of genes implicated). These results contradict the hypothesis that most complex trait-associated variants coincide with homeostatic expression QTLs, suggesting that better models are needed. The field must confront this deficit and pursue this 'missing regulation.'
Genetic studies have revealed thousands of loci predisposing to hundreds of human diseases and traits, revealing important biological pathways and defining novel therapeutic hypotheses. However, the ...genes discovered to date typically explain less than half of the apparent heritability. Because efforts have largely focused on common genetic variants, one hypothesis is that much of the missing heritability is due to rare genetic variants. Studies of common variants are typically referred to as genomewide association studies, whereas studies of rare variants are often simply called sequencing studies. Because they are actually closely related, we use the terms common variant association study (CVAS) and rare variant association study (RVAS). In this paper, we outline the similarities and differences between RVAS and CVAS and describe a conceptual framework for the design of RVAS. We apply the framework to address key questions about the sample sizes needed to detect association, the relative merits of testing disruptive alleles vs. missense alleles, frequency thresholds for filtering alleles, the value of predictors of the functional impact of missense alleles, the potential utility of isolated populations, the value of gene-set analysis, and the utility of de novo mutations. The optimal design depends critically on the selection coefficient against deleterious alleles and thus varies across genes. The analysis shows that common variant and rare variant studies require similarly large sample collections. In particular, a well-powered RVAS should involve discovery sets with at least 25,000 cases, together with a substantial replication set.
Most autoimmune-disease-risk effects identified by genome-wide association studies (GWAS) localize to open chromatin with gene-regulatory activity. GWAS loci are also enriched in expression ...quantitative trait loci (eQTLs), thus suggesting that most risk variants alter gene expression. However, because causal variants are difficult to identify, and cis-eQTLs occur frequently, it remains challenging to identify specific instances of disease-relevant changes to gene regulation. Here, we used a novel joint likelihood framework with higher resolution than that of previous methods to identify loci where autoimmune-disease risk and an eQTL are driven by a single shared genetic effect. Using eQTLs from three major immune subpopulations, we found shared effects in only ∼25% of the loci examined. Thus, we show that a fraction of gene-regulatory changes suggest strong mechanistic hypotheses for disease risk, but we conclude that most risk mechanisms are not likely to involve changes in basal gene expression.