As large-scale studies of gene expression with multiple sources of biological and technical variation become widely adopted, characterizing these drivers of variation becomes essential to ...understanding disease biology and regulatory genetics.
We describe a statistical and visualization framework, variancePartition, to prioritize drivers of variation based on a genome-wide summary, and identify genes that deviate from the genome-wide trend. Using a linear mixed model, variancePartition quantifies variation in each expression trait attributable to differences in disease status, sex, cell or tissue type, ancestry, genetic background, experimental stimulus, or technical variables. Analysis of four large-scale transcriptome profiling datasets illustrates that variancePartition recovers striking patterns of biological and technical variation that are reproducible across multiple datasets.
Our open source software, variancePartition, enables rapid interpretation of complex gene expression studies as well as other high-throughput genomics assays. variancePartition is available from Bioconductor: http://bioconductor.org/packages/variancePartition .
Abstract
Summary
Large-scale transcriptome studies with multiple samples per individual are widely used to study disease biology. Yet, current methods for differential expression are inadequate for ...cross-individual testing for these repeated measures designs. Most problematic, we observe across multiple datasets that current methods can give reproducible false-positive findings that are driven by genetic regulation of gene expression, yet are unrelated to the trait of interest. Here, we introduce a statistical software package, dream, that increases power, controls the false positive rate, enables multiple types of hypothesis tests, and integrates with standard workflows. In 12 analyses in 6 independent datasets, dream yields biological insight not found with existing software while addressing the issue of reproducible false-positive findings.
Availability and implementation
Dream is available within the variancePartition Bioconductor package at http://bioconductor.org/packages/variancePartition.
Contact
gabriel.hoffman@mssm.edu
Supplementary information
Supplementary data are available at Bioinformatics online.
Population structure and kinship are widespread confounding factors in genome-wide association studies (GWAS). It has been standard practice to include principal components of the genotypes in a ...regression model in order to account for population structure. More recently, the linear mixed model (LMM) has emerged as a powerful method for simultaneously accounting for population structure and kinship. The statistical theory underlying the differences in empirical performance between modeling principal components as fixed versus random effects has not been thoroughly examined. We undertake an analysis to formalize the relationship between these widely used methods and elucidate the statistical properties of each. Moreover, we introduce a new statistic, effective degrees of freedom, that serves as a metric of model complexity and a novel low rank linear mixed model (LRLMM) to learn the dimensionality of the correction for population structure and kinship, and we assess its performance through simulations. A comparison of the results of LRLMM and a standard LMM analysis applied to GWAS data from the Multi-Ethnic Study of Atherosclerosis (MESA) illustrates how our theoretical results translate into empirical properties of the mixed model. Finally, the analysis demonstrates the ability of the LRLMM to substantially boost the strength of an association for HDL cholesterol in Europeans.
The development of human-induced pluripotent stem cells (hiPSCs) has made possible patient-specific modeling across the spectrum of human disease. Here, we discuss recent advances in psychiatric ...genomics and post-mortem studies that provide critical insights concerning cell-type composition and sample size that should be considered when designing hiPSC-based studies of complex genetic disease. We review recent hiPSC-based models of SZ, in light of our new understanding of critical power limitations in the design of hiPSC-based studies of complex genetic disorders. Three possible solutions are a movement towards genetically stratified cohorts of rare variant patients, application of CRISPR technologies to engineer isogenic neural cells to study the impact of common variants, and integration of advanced genetics and hiPSC-based datasets in future studies. Overall, we emphasize that to advance the reproducibility and relevance of hiPSC-based studies, stem cell biologists must contemplate statistical and biological considerations that are already well accepted in the field of genetics. We conclude with a discussion of the hypothesis of biological convergence of disease-through molecular, cellular, circuit, and patient level phenotypes-and how this might emerge through hiPSC-based studies.
While large-scale, genome-wide association studies (GWAS) have identified hundreds of loci associated with brain-related traits, identification of the variants, genes and molecular mechanisms ...underlying these traits remains challenging. Integration of GWAS with expression quantitative trait loci (eQTLs) and identification of shared genetic architecture have been widely adopted to nominate genes and candidate causal variants. However, this approach is limited by sample size, statistical power and linkage disequilibrium. We developed the multivariate multiple QTL approach and performed a large-scale, multi-ancestry eQTL meta-analysis to increase power and fine-mapping resolution. Analysis of 3,983 RNA-sequenced samples from 2,119 donors, including 474 non-European individuals, yielded an effective sample size of 3,154. Joint statistical fine-mapping of eQTL and GWAS identified 329 variant-trait pairs for 24 brain-related traits driven by 204 unique candidate causal variants for 189 unique genes. This integrative analysis identifies candidate causal variants and elucidates potential regulatory mechanisms for genes underlying schizophrenia, bipolar disorder and Alzheimer's disease.
The power of human induced pluripotent stem cell (hiPSC)-based studies to resolve the smaller effects of common variants within the size of cohorts that can be realistically assembled remains ...uncertain. We identified and accounted for a variety of technical and biological sources of variation in a large case/control schizophrenia (SZ) hiPSC-derived cohort of neural progenitor cells and neurons. Reducing the stochastic effects of the differentiation process by correcting for cell type composition boosted the SZ signal and increased the concordance with post-mortem data sets. We predict a growing convergence between hiPSC and post-mortem studies as both approaches expand to larger cohort sizes. For studies of complex genetic disorders, to maximize the power of hiPSC cohorts currently feasible, in most cases and whenever possible, we recommend expanding the number of individuals even at the expense of the number of replicate hiPSC clones.
The mechanisms by which common risk variants of small effect interact to contribute to complex genetic disorders are unclear. Here, we apply a genetic approach, using isogenic human induced ...pluripotent stem cells, to evaluate the effects of schizophrenia (SZ)-associated common variants predicted to function as SZ expression quantitative trait loci (eQTLs). By integrating CRISPR-mediated gene editing, activation and repression technologies to study one putative SZ eQTL (FURIN rs4702) and four top-ranked SZ eQTL genes (FURIN, SNAP91, TSNARE1 and CLCN3), our platform resolves pre- and postsynaptic neuronal deficits, recapitulates genotype-dependent gene expression differences and identifies convergence downstream of SZ eQTL gene perturbations. Our observations highlight the cell-type-specific effects of common variants and demonstrate a synergistic effect between SZ eQTL genes that converges on synaptic function. We propose that the links between rare and common variants implicated in psychiatric disease risk constitute a potentially generalizable phenomenon occurring more widely in complex genetic disorders.
Variability in induced pluripotent stem cell (iPSC) lines remains a concern for disease modeling and regenerative medicine. We have used RNA-sequencing analysis and linear mixed models to examine the ...sources of gene expression variability in 317 human iPSC lines from 101 individuals. We found that ∼50% of genome-wide expression variability is explained by variation across individuals and identified a set of expression quantitative trait loci that contribute to this variation. These analyses coupled with allele-specific expression show that iPSCs retain a donor-specific gene expression pattern. Network, pathway, and key driver analyses showed that Polycomb targets contribute significantly to the non-genetic variability seen within and across individuals, highlighting this chromatin regulator as a likely source of reprogramming-based variability. Our findings therefore shed light on variation between iPSC lines and illustrate the potential for our dataset and other similar large-scale analyses to identify underlying drivers relevant to iPSC applications.
Display omitted
•Gene expression analysis characterizes 317 human iPSC lines from 101 individuals•eQTLs contribute significantly to a cross individual variation in iPSC lines•Polycomb target genes are a significant source of non-genetic variation•Predictive networks highlight candidate key drivers of differentiation efficiency
Using large-scale analyses of over 300 iPSC lines, Chang, Quertermous, Lemischka, and colleagues of the NHLBI NextGen consortium examine sources of gene expression variation between lines and illustrate how this approach can identify genetic and non-genetic drivers relevant to line variation with implications for iPSC characterization and disease modeling.
Transcriptome-wide association studies integrate gene expression data with common risk variation to identify gene-trait associations. By incorporating epigenome data to estimate the functional ...importance of genetic variation on gene expression, we generate a small but significant improvement in the accuracy of transcriptome prediction and increase the power to detect significant expression-trait associations. Joint analysis of 14 large-scale transcriptome datasets and 58 traits identify 13,724 significant expression-trait associations that converge on biological processes and relevant phenotypes in human and mouse phenotype databases. We perform drug repurposing analysis and identify compounds that mimic, or reverse, trait-specific changes. We identify genes that exhibit agonistic pleiotropy for genetically correlated traits that converge on shared biological pathways and elucidate distinct processes in disease etiopathogenesis. Overall, this comprehensive analysis provides insight into the specificity and convergence of gene expression on susceptibility to complex traits.
Abstract
Identifying functional variants underlying disease risk and adoption of personalized medicine are currently limited by the challenge of interpreting the functional consequences of genetic ...variants. Predicting the functional effects of disease-associated protein-coding variants is increasingly routine. Yet, the vast majority of risk variants are non-coding, and predicting the functional consequence and prioritizing variants for functional validation remains a major challenge. Here, we develop a deep learning model to accurately predict locus-specific signals from four epigenetic assays using only DNA sequence as input. Given the predicted epigenetic signal from DNA sequence for the reference and alternative alleles at a given locus, we generate a score of the predicted epigenetic consequences for 438 million variants observed in previous sequencing projects. These impact scores are assay-specific, are predictive of allele-specific transcription factor binding and are enriched for variants associated with gene expression and disease risk. Nucleotide-level functional consequence scores for non-coding variants can refine the mechanism of known functional variants, identify novel risk variants and prioritize downstream experiments.