Understanding the core set of genes that are necessary for basic developmental functions is one of the central goals in biology. Studies in model organisms identified a significant fraction of ...essential genes through the analysis of null-mutations that lead to lethality. Recent large-scale next-generation sequencing efforts have provided unprecedented data on genetic variation in human. However, evolutionary and genomic characteristics of human essential genes have never been directly studied on a genome-wide scale. Here we use detailed phenotypic resources available for the mouse and deep genomics sequencing data from human populations to characterize patterns of genetic variation and mutational burden in a set of 2,472 human orthologs of known essential genes in the mouse. Consistent with the action of strong, purifying selection, these genes exhibit comparatively reduced levels of sequence variation, skew in allele frequency towards more rare, and exhibit increased conservation across the primate and rodent lineages relative to the remainder of genes in the genome. In individual genomes we observed ~12 rare mutations within essential genes predicted to be damaging. Consistent with the hypothesis that mutations in essential genes are risk factors for neurodevelopmental disease, we show that de novo variants in patients with Autism Spectrum Disorder are more likely to occur in this collection of genes. While incomplete, our set of human orthologs shows characteristics fully consistent with essential function in human and thus provides a resource to inform and facilitate interpretation of sequence data in studies of human disease.
Germline mutation is the mechanism by which genetic variation in a population is created. Inferences derived from mutation rate models are fundamental to many population genetics methods. Previous ...models have demonstrated that nucleotides flanking polymorphic sites-the local sequence context-explain variation in the probability that a site is polymorphic. However, limitations to these models exist as the size of the local sequence context window expands. These include a lack of robustness to data sparsity at typical sample sizes, lack of regularization to generate parsimonious models and lack of quantified uncertainty in estimated rates to facilitate comparison between models. To address these limitations, we developed Baymer, a regularized Bayesian hierarchical tree model that captures the heterogeneous effect of sequence contexts on polymorphism probabilities. Baymer implements an adaptive Metropolis-within-Gibbs Markov Chain Monte Carlo sampling scheme to estimate the posterior distributions of sequence-context based probabilities that a site is polymorphic. We show that Baymer accurately infers polymorphism probabilities and well-calibrated posterior distributions, robustly handles data sparsity, appropriately regularizes to return parsimonious models, and scales computationally at least up to 9-mer context windows. We demonstrate application of Baymer in three ways-first, identifying differences in polymorphism probabilities between continental populations in the 1000 Genomes Phase 3 dataset, second, in a sparse data setting to examine the use of polymorphism models as a proxy for de novo mutation probabilities as a function of variant age, sequence context window size, and demographic history, and third, comparing model concordance between different great ape species. We find a shared context-dependent mutation rate architecture underlying our models, enabling a transfer-learning inspired strategy for modeling germline mutations. In summary, Baymer is an accurate polymorphism probability estimation algorithm that automatically adapts to data sparsity at different sequence context levels, thereby making efficient use of the available data.
The rate of single-nucleotide polymorphism varies substantially across the human genome and fundamentally influences evolution and incidence of genetic disease. Previous studies have only considered ...the immediately flanking nucleotides around a polymorphic site--the site's trinucleotide sequence context--to study polymorphism levels across the genome. Moreover, the impact of larger sequence contexts has not been fully clarified, even though context substantially influences rates of polymorphism. Using a new statistical framework and data from the 1000 Genomes Project, we demonstrate that a heptanucleotide context explains >81% of variability in substitution probabilities, highlighting new mutation-promoting motifs at ApT dinucleotide, CAAT and TACG sequences. Our approach also identifies previously undocumented variability in C-to-T substitutions at CpG sites, which is not immediately explained by differential methylation intensity. Using our model, we present informative substitution intolerance scores for genes and a new intolerance score for amino acids, and we demonstrate clinical use of the model in neuropsychiatric diseases.
Balancing selection occurs when multiple alleles are maintained in a population, which can result in their preservation over long evolutionary time periods. A characteristic signature of this ...long-term balancing selection is an excess number of intermediate frequency polymorphisms near the balanced variant. However, the expected distribution of allele frequencies at these loci has not been extensively detailed, and therefore existing summary statistic methods do not explicitly take it into account. Using simulations, we show that new mutations which arise in close proximity to a site targeted by balancing selection accumulate at frequencies nearly identical to that of the balanced allele. In order to scan the genome for balancing selection, we propose a new summary statistic, β, which detects these clusters of alleles at similar frequencies. Simulation studies show that compared with existing summary statistics, our measure has improved power to detect balancing selection, and is reasonably powered in non-equilibrium demographic models and under a range of recombination and mutation rates. We compute β on 1000 Genomes Project data to identify loci potentially subjected to long-term balancing selection in humans. We report two balanced haplotypes-localized to the genes WFS1 and CADM2-that are strongly linked to association signals for complex traits. Our approach is computationally efficient and applicable to species that lack appropriate outgroup sequences, allowing for well-powered analysis of selection in the wide variety of species for which population data are rapidly being generated.
Observational studies have identified height as a strong risk factor for atrial fibrillation, but this finding may be limited by residual confounding. We aimed to examine genetic variation in height ...within the Mendelian randomization (MR) framework to determine whether height has a causal effect on risk of atrial fibrillation.
In summary-level analyses, MR was performed using summary statistics from genome-wide association studies of height (GIANT/UK Biobank; 693,529 individuals) and atrial fibrillation (AFGen; 65,446 cases and 522,744 controls), finding that each 1-SD increase in genetically predicted height increased the odds of atrial fibrillation (odds ratio OR 1.34; 95% CI 1.29 to 1.40; p = 5 × 10-42). This result remained consistent in sensitivity analyses with MR methods that make different assumptions about the presence of pleiotropy, and when accounting for the effects of traditional cardiovascular risk factors on atrial fibrillation. Individual-level phenome-wide association studies of height and a height genetic risk score were performed among 6,567 European-ancestry participants of the Penn Medicine Biobank (median age at enrollment 63 years, interquartile range 55-72; 38% female; recruitment 2008-2015), confirming prior observational associations between height and atrial fibrillation. Individual-level MR confirmed that each 1-SD increase in height increased the odds of atrial fibrillation, including adjustment for clinical and echocardiographic confounders (OR 1.89; 95% CI 1.50 to 2.40; p = 0.007). The main limitations of this study include potential bias from pleiotropic effects of genetic variants, and lack of generalizability of individual-level findings to non-European populations.
In this study, we observed evidence that height is likely a positive causal risk factor for atrial fibrillation. Further study is needed to determine whether risk prediction tools including height or anthropometric risk factors can be used to improve screening and primary prevention of atrial fibrillation, and whether biological pathways involved in height may offer new targets for treatment of atrial fibrillation.
Technological advances make it possible to use high-throughput sequencing as a primary discovery tool of medical genetics, specifically for assaying rare variation. Still this approach faces the ...analytic challenge that the influence of very rare variants can only be evaluated effectively as a group. A further complication is that any given rare variant could have no effect, could increase risk, or could be protective. We propose here the C-alpha test statistic as a novel approach for testing for the presence of this mixture of effects across a set of rare variants. Unlike existing burden tests, C-alpha, by testing the variance rather than the mean, maintains consistent power when the target set contains both risk and protective variants. Through simulations and analysis of case/control data, we demonstrate good power relative to existing methods that assess the burden of rare variants in individuals.
Genome-wide association (GWA) studies have identified numerous, replicable, genetic associations between common single nucleotide polymorphisms (SNPs) and risk of common autoimmune and inflammatory ...(immune-mediated) diseases, some of which are shared between two diseases. Along with epidemiological and clinical evidence, this suggests that some genetic risk factors may be shared across diseases-as is the case with alleles in the Major Histocompatibility Locus. In this work we evaluate the extent of this sharing for 107 immune disease-risk SNPs in seven diseases: celiac disease, Crohn's disease, multiple sclerosis, psoriasis, rheumatoid arthritis, systemic lupus erythematosus, and type 1 diabetes. We have developed a novel statistic for Cross Phenotype Meta-Analysis (CPMA) which detects association of a SNP to multiple, but not necessarily all, phenotypes. With it, we find evidence that 47/107 (44%) immune-mediated disease risk SNPs are associated to multiple-but not all-immune-mediated diseases (SNP-wise P(CPMA)<0.01). We also show that distinct groups of interacting proteins are encoded near SNPs which predispose to the same subsets of diseases; we propose these as the mechanistic basis of shared disease risk. We are thus able to leverage genetic data across diseases to construct biological hypotheses about the underlying mechanism of pathogenesis.
The identification of signals of very recent positive selection provides information about the adaptation of modern humans to local conditions. We report here on a genome-wide scan for signals of ...very recent positive selection in favor of variants that have not yet reached fixation. We describe a new analytical method for scanning single nucleotide polymorphism (SNP) data for signals of recent selection, and apply this to data from the International HapMap Project. In all three continental groups we find widespread signals of recent positive selection. Most signals are region-specific, though a significant excess are shared across groups. Contrary to some earlier low resolution studies that suggested a paucity of recent selection in sub-Saharan Africans, we find that by some measures our strongest signals of selection are from the Yoruba population. Finally, since these signals indicate the existence of genetic variants that have substantially different fitnesses, they must indicate loci that are the source of significant phenotypic variation. Though the relevant phenotypes are generally not known, such loci should be of particular interest in mapping studies of complex traits. For this purpose we have developed a set of SNPs that can be used to tag the strongest approximately 250 signals of recent selection in each population.
Success along the tenure track requires more than hard work and long hours. Here, the experiences of a recently tenured professor are distilled into a collection of tips to assist others along the ...path.
A number of epidemiological and genetic studies have attempted to determine whether levels of circulating lipids are associated with risks of various cancers, including breast cancer (BC). However, ...it remains unclear whether a causal relationship exists between lipids and BC. If alteration of lipid levels also reduced risk of BC, this could present a target for disease prevention. This study aimed to assess a potential causal relationship between genetic variants associated with plasma lipid traits (high-density lipoprotein, HDL; low-density lipoprotein, LDL; triglycerides, TGs) with risk for BC using Mendelian randomization (MR).
Data from genome-wide association studies in up to 215,551 participants from the Million Veteran Program (MVP) were used to construct genetic instruments for plasma lipid traits. The effect of these instruments on BC risk was evaluated using genetic data from the BCAC (Breast Cancer Association Consortium) based on 122,977 BC cases and 105,974 controls. Using MR, we observed that a 1-standard-deviation genetically determined increase in HDL levels is associated with an increased risk for all BCs (HDL: OR odds ratio = 1.08, 95% confidence interval CI = 1.04-1.13, P < 0.001). Multivariable MR analysis, which adjusted for the effects of LDL, TGs, body mass index (BMI), and age at menarche, corroborated this observation for HDL (OR = 1.06, 95% CI = 1.03-1.10, P = 4.9 × 10-4) and also found a relationship between LDL and BC risk (OR = 1.03, 95% CI = 1.01-1.07, P = 0.02). We did not observe a difference in these relationships when stratified by breast tumor estrogen receptor (ER) status. We repeated this analysis using genetic variants independent of the leading association at core HDL pathway genes and found that these variants were also associated with risk for BCs (OR = 1.11, 95% CI = 1.06-1.16, P = 1.5 × 10-6), including locus-specific associations at ABCA1 (ATP Binding Cassette Subfamily A Member 1), APOE-APOC1-APOC4-APOC2 (Apolipoproteins E, C1, C4, and C2), and CETP (Cholesteryl Ester Transfer Protein). In addition, we found evidence that genetic variation at the ABO locus is associated with both lipid levels and BC. Through multiple statistical approaches, we minimized and tested for the confounding effects of pleiotropy and population stratification on our analysis; however, the possible existence of residual pleiotropy and stratification remains a limitation of this study.
We observed that genetically elevated plasma HDL and LDL levels appear to be associated with increased BC risk. Future studies are required to understand the mechanism underlying this putative causal relationship, with the goal of developing potential therapeutic strategies aimed at altering the cholesterol-mediated effect on BC risk.