High-throughput sequencing technology has enabled population-based studies of the role of the human microbiome in disease etiology and exposure response. Distance-based analysis is a popular strategy ...for evaluating the overall association between microbiome diversity and outcome, wherein the phylogenetic distance between individuals’ microbiome profiles is computed and tested for association via permutation. Despite their practical popularity, distance-based approaches suffer from important challenges, especially in selecting the best distance and extending the methods to alternative outcomes, such as survival outcomes. We propose the microbiome regression-based kernel association test (MiRKAT), which directly regresses the outcome on the microbiome profiles via the semi-parametric kernel machine regression framework. MiRKAT allows for easy covariate adjustment and extension to alternative outcomes while non-parametrically modeling the microbiome through a kernel that incorporates phylogenetic distance. It uses a variance-component score statistic to test for the association with analytical p value calculation. The model also allows simultaneous examination of multiple distances, alleviating the problem of choosing the best distance. Our simulations demonstrated that MiRKAT provides correctly controlled type I error and adequate power in detecting overall association. “Optimal” MiRKAT, which considers multiple candidate distances, is robust in that it suffers from little power loss in comparison to when the best distance is used and can achieve tremendous power gain in comparison to when a poor distance is chosen. Finally, we applied MiRKAT to real microbiome datasets to show that microbial communities are associated with smoking and with fecal protease levels after confounders are controlled for.
GWAS have emerged as popular tools for identifying genetic variants that are associated with disease risk. Standard analysis of a case-control GWAS involves assessing the association between each ...individual genotyped SNP and disease risk. However, this approach suffers from limited reproducibility and difficulties in detecting multi-SNP and epistatic effects. As an alternative analytical strategy, we propose grouping SNPs together into SNP sets on the basis of proximity to genomic features such as genes or haplotype blocks, then testing the joint effect of each SNP set. Testing of each SNP set proceeds via the logistic kernel-machine-based test, which is based on a statistical framework that allows for flexible modeling of epistatic and nonlinear SNP effects. This flexibility and the ability to naturally adjust for covariate effects are important features of our test that make it appealing in comparison to individual SNP tests and existing multimarker tests. Using simulated data based on the International HapMap Project, we show that SNP-set testing can have improved power over standard individual-SNP analysis under a wide range of settings. In particular, we find that our approach has higher power than individual-SNP analysis when the median correlation between the disease-susceptibility variant and the genotyped SNPs is moderate to high. When the correlation is low, both individual-SNP analysis and the SNP-set analysis tend to have low power. We apply SNP-set analysis to analyze the Cancer Genetic Markers of Susceptibility (CGEMS) breast cancer GWAS discovery-phase data.
Depression is a common condition, but current treatments are only effective in a subset of individuals. To identify new treatment targets, we integrated depression genome-wide association study ...(GWAS) results (N = 500,199) with human brain proteomes (N = 376) to perform a proteome-wide association study of depression followed by Mendelian randomization. We identified 19 genes that were consistent with being causal in depression, acting via their respective cis-regulated brain protein abundance. We replicated nine of these genes using an independent depression GWAS (N = 307,353) and another human brain proteomic dataset (N = 152). Eleven of the 19 genes also had cis-regulated mRNA levels that were associated with depression, based on integration of the depression GWAS with human brain transcriptomes (N = 888). Meta-analysis of the discovery and replication proteome-wide association study analyses identified 25 brain proteins consistent with being causal in depression, 20 of which were not previously implicated in depression by GWAS. Together, these findings provide promising brain protein targets for further mechanistic and therapeutic studies.
Transcriptome-wide association studies (TWAS) have been widely used to integrate transcriptomic and genetic data to study complex human diseases. Within a test dataset lacking transcriptomic data, ...traditional two-stage TWAS methods first impute gene expression by creating a weighted sum that aggregates SNPs with their corresponding cis-eQTL effects on reference transcriptome. Traditional TWAS methods then employ a linear regression model to assess the association between imputed gene expression and test phenotype, thereby assuming the effect of a cis-eQTL SNP on test phenotype is a linear function of the eQTL's estimated effect on reference transcriptome. To increase TWAS robustness to this assumption, we propose a novel Variance-Component TWAS procedure (VC-TWAS) that assumes the effects of cis-eQTL SNPs on phenotype are random (with variance proportional to corresponding reference cis-eQTL effects) rather than fixed. VC-TWAS is applicable to both continuous and dichotomous phenotypes, as well as individual-level and summary-level GWAS data. Using simulated data, we show VC-TWAS is more powerful than traditional TWAS methods based on a two-stage Burden test, especially when eQTL genetic effects on test phenotype are no longer a linear function of their eQTL genetic effects on reference transcriptome. We further applied VC-TWAS to both individual-level (N = ~3.4K) and summary-level (N = ~54K) GWAS data to study Alzheimer's dementia (AD). With the individual-level data, we detected 13 significant risk genes including 6 known GWAS risk genes such as TOMM40 that were missed by traditional TWAS methods. With the summary-level data, we detected 57 significant risk genes considering only cis-SNPs and 71 significant genes considering both cis- and trans- SNPs, which also validated our findings with the individual-level GWAS data. Our VC-TWAS method is implemented in the TIGAR tool for public use.
The 3q29 deletion confers increased risk for neuropsychiatric phenotypes including intellectual disability, autism spectrum disorder, generalized anxiety disorder, and a >40-fold increased risk for ...schizophrenia. To investigate consequences of the 3q29 deletion in an experimental system, we used CRISPR/Cas9 technology to introduce a heterozygous deletion into the syntenic interval on C57BL/6 mouse chromosome 16. mRNA abundance for 20 of the 21 genes in the interval was reduced by ~50%, while protein levels were reduced for only a subset of these, suggesting a compensatory mechanism. Mice harboring the deletion manifested behavioral impairments in multiple domains including social interaction, cognitive function, acoustic startle, and amphetamine sensitivity, with some sex-dependent manifestations. In addition, 3q29 deletion mice showed reduced body weight throughout development consistent with the phenotype of 3q29 deletion syndrome patients. Of the genes within the interval, DLG1 has been hypothesized as a contributor to the neuropsychiatric phenotypes. However, we show that Dlg1
mice did not exhibit the behavioral deficits seen in mice harboring the full 3q29 deletion. These data demonstrate the following: the 3q29 deletion mice are a valuable experimental system that can be used to interrogate the biology of 3q29 deletion syndrome; behavioral manifestations of the 3q29 deletion may have sex-dependent effects; and mouse-specific behavior phenotypes associated with the 3q29 deletion are not solely due to haploinsufficiency of Dlg1.
ABSTRACT
DNA methylation is an important epigenetic mechanism that has been linked to complex diseases and is of great interest to researchers as a potential link between genome, environment, and ...disease. As the scale of DNA methylation association studies approaches that of genome‐wide association studies, issues such as population stratification will need to be addressed. It is well‐documented that failure to adjust for population stratification can lead to false positives in genetic association studies, but population stratification is often unaccounted for in DNA methylation studies. Here, we propose several approaches to correct for population stratification using principal components (PCs) from different subsets of genome‐wide methylation data. We first illustrate the potential for confounding due to population stratification by demonstrating widespread associations between DNA methylation and race in 388 individuals (365 African American and 23 Caucasian). We subsequently evaluate the performance of our PC‐based approaches and other methods in adjusting for confounding due to population stratification. Our simulations show that (1) all of the methods considered are effective at removing inflation due to population stratification, and (2) maximum power can be obtained with single‐nucleotide polymorphism (SNP)‐based PCs, followed by methylation‐based PCs, which outperform both surrogate variable analysis and genomic control. Among our different approaches to computing methylation‐based PCs, we find that PCs based on CpG sites chosen for their potential to proxy nearby SNPs can provide a powerful and computationally efficient approach to adjust for population stratification in DNA methylation studies when genome‐wide SNP data are unavailable.
Fragile X syndrome (FXS), a common inherited form of mental retardation, is caused by the functional absence of the fragile X mental retardation protein (FMRP), an RNA-binding protein that regulates ...the translation of specific mRNAs at synapses. Altered synaptic plasticity has been described in a mouse FXS model. However, the mechanism by which the loss of FMRP alters synaptic function, and subsequently causes the mental impairment, is unknown. Here, in cultured hippocampal neurons, we used siRNAs against Fmr1 to demonstrate that a reduction of FMRP in dendrites leads to an increase in internalization of the α-amino-3-hydroxy-5-methyl-4-isoxazole propionic acid receptor (AMPAR) subunit, GluR1, in dendrites. This abnormal AMPAR trafficking was caused by spontaneous action potential-driven network activity without synaptic stimulation by an exogenous agonist and was rescued by 2-methyl-6-phenylethynyl-pyridine (MPEP), an mGluR5-specific inverse agonist. Because AMPAR internalization depends on local protein synthesis after mGluR5 stimulation, FMRP, a negative regulator of translation, may be viewed as a counterbalancing signal, wherein the absence of FMRP leads to an apparent excess of mGluR5 signaling in dendrites. Because AMPAR trafficking is a driving process for synaptic plasticity underlying learning and memory, our data suggest that hypersensitive AMPAR internalization in response to excess mGluR signaling may represent a principal cellular defect in FXS, which may be corrected by using mGluR antagonists.
Many case-control tests of rare variation are implemented in statistical frameworks that make correction for confounders like population stratification difficult. Simple permutation of disease status ...is unacceptable for resolving this issue because the replicate data sets do not have the same confounding as the original data set. These limitations make it difficult to apply rare-variant tests to samples in which confounding most likely exists, e.g., samples collected from admixed populations. To enable the use of such rare-variant methods in structured samples, as well as to facilitate permutation tests for any situation in which case-control tests require adjustment for confounding covariates, we propose to establish the significance of a rare-variant test via a modified permutation procedure. Our procedure uses Fisher’s noncentral hypergeometric distribution to generate permuted data sets with the same structure present in the actual data set such that inference is valid in the presence of confounding factors. We use simulated sequence data based on coalescent models to show that our permutation strategy corrects for confounding due to population stratification that, if ignored, would otherwise inflate the size of a rare-variant test. We further illustrate the approach by using sequence data from the Dallas Heart Study of energy metabolism traits. Researchers can implement our permutation approach by using the R package BiasedUrn.
Recent efforts have focused on developing methylation risk scores (MRS), a weighted sum of the individual's DNA methylation (DNAm) values of pre-selected CpG sites. Most of the current MRS approaches ...that utilize Epigenome-wide association studies (EWAS) summary statistics only include genome-wide significant CpG sites and do not consider co-methylation. New methods that relax the p-value threshold to include more CpG sites and account for the inter-correlation of DNAm might improve the predictive performance of MRS. We paired informed co-methylation pruning with P-value thresholding to generate pruning and thresholding (P+T) MRS and evaluated its performance among multi-ancestry populations. Through simulation studies and real data analyses, we demonstrated that pruning provides an improvement over simple thresholding methods for prediction of phenotypes. We demonstrated that European-derived summary statistics can be used to develop P+T MRS among other populations such as African populations. However, the prediction accuracy of P+T MRS may differ across multi-ancestry population due to environmental/cultural/social differences.
Association mapping of complex traits typically employs tagSNP genotype data to identify a trait locus within a region of interest. However, considerable debate exists regarding the most powerful ...strategy for utilizing such tagSNP data for inference. A popular approach tests each tagSNP within the region individually, but such tests could lose power as a result of incomplete linkage disequilibrium between the genotyped tagSNP and the trait locus. Alternatively, one can jointly test all tagSNPs simultaneously within the region (by using genotypes or haplotypes), but such multivariate tests have large degrees of freedom that can also compromise power. Here, we consider a semiparametric model for quantitative-trait mapping that uses genetic information from multiple tagSNPs simultaneously in analysis but produces a test statistic with reduced degrees of freedom compared to existing multivariate approaches. We fit this model by using a dimension-reducing technique called least-squares kernel machines, which we show is identical to analysis using a specific linear mixed model (which we can fit by using standard software packages like SAS and R). Using simulated SNP data based on real data from the International HapMap Project, we demonstrate that our approach often has superior performance for association mapping of quantitative traits compared to the popular approach of single-tagSNP testing. Our approach is also flexible, because it allows easy modeling of covariates and, if interest exists, high-dimensional interactions among tagSNPs and environmental predictors.