Single-cell RNA-sequencing (scRNA-seq) has emerged as a revolutionary tool that allows us to address scientific questions that eluded examination just a few years ago. With the advantages of ...scRNA-seq come computational challenges that are just beginning to be addressed. In this article, we highlight the computational methods available for the design and analysis of scRNA-seq experiments, their advantages and disadvantages in various settings, the open questions for which novel methods are needed, and expected future developments in this exciting area.
Motivation: The power of a microarray experiment derives from the identification of genes differentially regulated across biological conditions. To date, differential regulation is most often taken ...to mean differential expression, and a number of useful methods for identifying differentially expressed (DE) genes or gene sets are available. However, such methods are not able to identify many relevant classes of differentially regulated genes. One important example concerns differentially co-expressed (DC) genes. Results: We propose an approach, gene set co-expression analysis (GSCA), to identify DC gene sets. The GSCA approach provides a false discovery rate controlled list of interesting gene sets, does not require that genes be highly correlated in at least one biological condition and is readily applied to data from individual or multiple experiments, as we demonstrate using data from studies of lung cancer and diabetes. Availability: The GSCA approach is implemented in R and available at www.biostat.wisc.edu/∼kendzior/GSCA/. Contact: kendzior@biostat.wisc.edu Supplementary information: Supplementary data are available at Bioinformatics online.
The normalization of RNA-seq data is essential for accurate downstream inference, but the assumptions upon which most normalization methods are based are not applicable in the single-cell setting. ...Consequently, applying existing normalization methods to single-cell RNA-seq data introduces artifacts that bias downstream analyses. To address this, we introduce SCnorm for accurate and efficient normalization of single-cell RNA-seq data.
An important challenge in pre-processing data from droplet-based single-cell RNA sequencing protocols is distinguishing barcodes associated with real cells from those binding background reads. ...Existing methods test barcodes individually and consequently do not leverage the strong cell-to-cell correlation present in most datasets. To improve cell detection, we introduce CB2, a cluster-based approach for distinguishing real cells from background barcodes. As demonstrated in simulated and case study datasets, CB2 has increased power for identifying real cells which allows for the identification of novel subpopulations and improves the precision of downstream analyses.
Messenger RNA expression is important in normal development and differentiation, as well as in manifestation of disease. RNA-seq experiments allow for the identification of differentially expressed ...(DE) genes and their corresponding isoforms on a genome-wide scale. However, statistical methods are required to ensure that accurate identifications are made. A number of methods exist for identifying DE genes, but far fewer are available for identifying DE isoforms. When isoform DE is of interest, investigators often apply gene-level (count-based) methods directly to estimates of isoform counts. Doing so is not recommended. In short, estimating isoform expression is relatively straightforward for some groups of isoforms, but more challenging for others. This results in estimation uncertainty that varies across isoform groups. Count-based methods were not designed to accommodate this varying uncertainty, and consequently, application of them for isoform inference results in reduced power for some classes of isoforms and increased false discoveries for others.
Taking advantage of the merits of empirical Bayesian methods, we have developed EBSeq for identifying DE isoforms in an RNA-seq experiment comparing two or more biological conditions. Results demonstrate substantially improved power and performance of EBSeq for identifying DE isoforms. EBSeq also proves to be a robust approach for identifying DE genes.
An R package containing examples and sample datasets is available at http://www.biostat.wisc.edu/kendzior/EBSEQ/.
Supplementary data are available at Bioinformatics online.
The ability to quantify cellular heterogeneity is a major advantage of single-cell technologies. However, statistical methods often treat cellular heterogeneity as a nuisance. We present a novel ...method to characterize differences in expression in the presence of distinct expression states within and among biological conditions. We demonstrate that this framework can detect differential expression patterns under a wide range of settings. Compared to existing approaches, this method has higher power to detect subtle differences in gene expression distributions that are more complex than a mean shift, and can characterize those differences. The freely available R package scDD implements the approach.
Genome-wide association studies (GWAS) have demonstrated the ability to identify the strongest causal common variants in complex human diseases. However, to date, the massive data generated from GWAS ...have not been maximally explored to identify true associations that fail to meet the stringent level of association required to achieve genome-wide significance. Genetics of gene expression (GGE) studies have shown promise towards identifying DNA variations associated with disease and providing a path to functionally characterize findings from GWAS. Here, we present the first empiric study to systematically characterize the set of single nucleotide polymorphisms associated with expression (eSNPs) in liver, subcutaneous fat, and omental fat tissues, demonstrating these eSNPs are significantly more enriched for SNPs that associate with type 2 diabetes (T2D) in three large-scale GWAS than a matched set of randomly selected SNPs. This enrichment for T2D association increases as we restrict to eSNPs that correspond to genes comprising gene networks constructed from adipose gene expression data isolated from a mouse population segregating a T2D phenotype. Finally, by restricting to eSNPs corresponding to genes comprising an adipose subnetwork strongly predicted as causal for T2D, we dramatically increased the enrichment for SNPs associated with T2D and were able to identify a functionally related set of diabetes susceptibility genes. We identified and validated malic enzyme 1 (Me1) as a key regulator of this T2D subnetwork in mouse and provided support for the association of this gene to T2D in humans. This integration of eSNPs and networks provides a novel approach to identify disease susceptibility networks rather than the single SNPs or genes traditionally identified through GWAS, thereby extracting additional value from the wealth of data currently being generated by GWAS.
Aging has been associated with widespread changes at the gene expression level in multiple mammalian tissues. We have used high density oligonucleotide arrays and novel statistical methods to ...identify specific transcriptional classes that may uncover biological processes that play a central role in mammalian aging.
We identified 712 transcripts that are differentially expressed in young (5 month old) and old (25-month old) mouse skeletal muscle. Caloric restriction (CR) completely or partially reversed 87% of the changes in expression. Examination of individual genes revealed a transcriptional profile indicative of increased p53 activity in the older muscle. To determine whether the increase in p53 activity is associated with transcriptional activation of apoptotic targets, we performed RT-PCR on four well known mediators of p53-induced apoptosis: puma, noxa, tnfrsf10b and bok. Expression levels for these proapoptotic genes increased significantly with age (P < 0.05), while CR significantly lowered expression levels for these genes as compared to control fed old mice (P < 0.05). Age-related induction of p53-related genes was observed in multiple tissues, but was not observed in young SOD2+/- and GPX4+/- mice, suggesting that oxidative stress does not induce the expression of these genes. Western blot analysis confirmed that protein levels for both p21 and GADD45a, two established transcriptional targets of p53, were higher in the older muscle tissue.
These observations support a role for p53-mediated transcriptional program in mammalian aging and suggest that mechanisms other than reactive oxygen species are involved in the age-related transcriptional activation of p53 targets.
Development of treatments for vocal dysphonia has been inhibited by lack of human vocal fold (VF) mucosa models because of difficulty in procuring VF epithelial cells, epithelial cells' limited ...proliferative capacity and absence of cell lines. Here we report development of engineered VF mucosae from hiPSC, transfected via TALEN constructs for green fluorescent protein, that mimic development of VF epithelial cells in utero. Modulation of FGF signaling achieves stratified squamous epithelium from definitive and anterior foregut derived cultures. Robust culturing of these cells on collagen-fibroblast constructs produces three-dimensional models comparable to in vivo VF mucosa. Furthermore, we demonstrate mucosal inflammation upon exposure of these constructs to 5% cigarette smoke extract. Upregulation of pro-inflammatory genes in epithelium and fibroblasts leads to aberrant VF mucosa remodeling. Collectively, our results demonstrate that hiPSC-derived VF mucosa is a versatile tool for future investigation of genetic and molecular mechanisms underlying epithelium-fibroblasts interactions in health and disease.
Physical forces, such as mechanical stress, are essential for tissue homeostasis and influence gene expression of cells. In particular, the fibroblast has demonstrated sensitivity to extracellular ...matrices with assumed adaptation upon various mechanical loads. The purpose of this study was to compare the vocal fold fibroblast genotype, known for its unique mechanically stressful tissue environment, with cellular counterparts at various other anatomic locales to identify differences in functional gene expression profiles.
By using RNA-seq technology, we identified differentially expressed gene programs (DEseq2) among seven normal human fibroblast primary cell lines from healthy cadavers, which included: vocal fold, trachea, lung, abdomen, scalp, upper gingiva, and soft palate. Unsupervised gene expression analysis yielded 6216 genes differentially expressed across all anatomic sites. Hierarchical cluster analysis revealed grouping based on anatomic site origin rather than donor, suggesting global fibroblast phenotype heterogeneity. Sex and age-related effects were negligible. Functional enrichment analyses based on separate post-hoc 2-group comparisons revealed several functional themes within the vocal fold fibroblast related to transcription factors for signaling pathways regulating pluripotency of stem cells and extracellular matrix components such as cell signaling, migration, proliferation, and differentiation potential.
Human fibroblasts display a phenomenon of global topographic differentiation, which is maintained in isolation via in vitro assays. Epigenetic mechanical influences on vocal fold tissue may play a role in uniquely modelling and maintaining the local environmental cellular niche during homeostasis with vocal fold fibroblasts distinctly specialized related to their anatomic positional and developmental origins established during embryogenesis.