Abstract
Motivation
Quality control (QC) is a critical step in single-cell RNA-seq (scRNA-seq) data analysis. Low-quality cells are removed from the analysis during the QC process to avoid ...misinterpretation of the data. An important QC metric is the mitochondrial proportion (mtDNA%), which is used as a threshold to filter out low-quality cells. Early publications in the field established a threshold of 5% and since then, it has been used as a default in several software packages for scRNA-seq data analysis, and adopted as a standard in many scRNA-seq studies. However, the validity of using a uniform threshold across different species, single-cell technologies, tissues and cell types has not been adequately assessed.
Results
We systematically analyzed 5 530 106 cells reported in 1349 annotated datasets available in the PanglaoDB database and found that the average mtDNA% in scRNA-seq data across human tissues is significantly higher than in mouse tissues. This difference is not confounded by the platform used to generate the data. Based on this finding, we propose new reference values of the mtDNA% for 121 tissues of mouse and 44 tissues of humans. In general, for mouse tissues, the 5% threshold performs well to distinguish between healthy and low-quality cells. However, for human tissues, the 5% threshold should be reconsidered as it fails to accurately discriminate between healthy and low-quality cells in 29.5% (13 of 44) tissues analyzed. We conclude that omitting the mtDNA% QC filter or adopting a suboptimal mtDNA% threshold may lead to erroneous biological interpretations of scRNA-seq data.
Availabilityand implementation
The code used to download datasets, perform the analyzes and produce the figures is available at https://github.com/dosorio/mtProportion.
Supplementary information
Supplementary data are available at Bioinformatics online.
Abstract
Motivation
Single-cell RNA sequencing (scRNA-seq) technology has revolutionized the way research is done in biomedical sciences. It provides an unprecedented level of resolution across ...individual cells for studying cell heterogeneity and gene expression variability. Analyzing scRNA-seq data is challenging though, due to the sparsity and high dimensionality of the data.
Results
I developed scGEAToolbox—a Matlab toolbox for scRNA-seq data analysis. It contains a comprehensive set of functions for data normalization, feature selection, batch correction, imputation, cell clustering, trajectory/pseudotime analysis and network construction, which can be combined and integrated to building custom workflow. Although most of the functions are implemented in native Matlab, wrapper functions are provided to allow users to call the ‘third-party’ tools developed in Matlab or other languages. Furthermore, scGEAToolbox is equipped with sophisticated graphical user interfaces generated with App Designer, making it an easy-to-use application for quick data processing.
Availability and implementation
https://github.com/jamesjcai/scGEAToolbox.
Supplementary information
Supplementary data are available at Bioinformatics online.
Summary We propose an empirical Bayes formulation of the structure learning problem, where the prior specification assumes that all node variables have the same error variance, an assumption known to ...ensure the identifiability of the underlying causal directed acyclic graph. To facilitate efficient posterior computation, we approximate the posterior probability of each ordering by that of a best directed acyclic graph model, which naturally leads to an order-based Markov chain Monte Carlo algorithm. Strong selection consistency for our model in high-dimensional settings is proved under a condition that allows heterogeneous error variances, and the mixing behaviour of our sampler is theoretically investigated. Furthermore, we propose a new iterative top-down algorithm, which quickly yields an approximate solution to the structure learning problem and can be used to initialize the Markov chain Monte Carlo sampler. We demonstrate that our method outperforms other state-of-the-art algorithms under various simulation settings, and conclude the paper with a single-cell real-data study illustrating practical advantages of the proposed method.
Abstract
Motivation
Characterizing cells with rare molecular phenotypes is one of the promises of high throughput single-cell RNA sequencing (scRNA-seq) techniques. However, collecting enough cells ...with the desired molecular phenotype in a single experiment is challenging, requiring several samples preprocessing steps to filter and collect the desired cells experimentally before sequencing. Data integration of multiple public single-cell experiments stands as a solution for this problem, allowing the collection of enough cells exhibiting the desired molecular signatures. By increasing the sample size of the desired cell type, this approach enables a robust cell type transcriptome characterization.
Results
Here, we introduce rPanglaoDB, an R package to download and merge the uniformly processed and annotated scRNA-seq data provided by the PanglaoDB database. To show the potential of rPanglaoDB for collecting rare cell types by integrating multiple public datasets, we present a biological application collecting and characterizing a set of 157 fibrocytes. Fibrocytes are a rare monocyte-derived cell type, that exhibits both the inflammatory features of macrophages and the tissue remodeling properties of fibroblasts. This constitutes the first fibrocytes’ unbiased transcriptome profile report. We compared the transcriptomic profile of the fibrocytes against the fibroblasts collected from the same tissue samples and confirm their associated relationship with healing processes in tissue damage and infection through the activation of the prostaglandin biosynthesis and regulation pathway.
Availability and implementation
rPanglaoDB is implemented as an R package available through the CRAN repositories https://CRAN.R-project.org/package=rPanglaoDB.
In the fungal pathogen Cryptococcus neoformans, the switch from yeast to hypha is an important morphological process preceding the meiotic events during sexual development. Morphotype is also known ...to be associated with cryptococcal virulence potential. Previous studies identified the regulator Znf2 as a key decision maker for hypha formation and as an anti-virulence factor. By a forward genetic screen, we discovered that a long non-coding RNA (lncRNA) RZE1 functions upstream of ZNF2 in regulating yeast-to-hypha transition. We demonstrate that RZE1 functions primarily in cis and less effectively in trans. Interestingly, RZE1's function is restricted to its native nucleus. Accordingly, RZE1 does not appear to directly affect Znf2 translation or the subcellular localization of Znf2 protein. Transcriptome analysis indicates that the loss of RZE1 reduces the transcript level of ZNF2 and Znf2's prominent downstream targets. In addition, microscopic examination using single molecule fluorescent in situ hybridization (smFISH) indicates that the loss of RZE1 increases the ratio of ZNF2 transcripts in the nucleus versus those in the cytoplasm. Taken together, this lncRNA controls Cryptococcus yeast-to-hypha transition through regulating the key morphogenesis regulator Znf2. This is the first functional characterization of a lncRNA in a human fungal pathogen. Given the potential large number of lncRNAs in the genomes of Cryptococcus and other fungal pathogens, the findings implicate lncRNAs as an additional layer of genetic regulation during fungal development that may well contribute to the complexity in these "simple" eukaryotes.
Expression quantitative trait loci (eQTL) studies have established convincing relationships between genetic variants and gene expression. Most of these studies focused on the mean of gene expression ...level, but not the variance of gene expression level (i.e., gene expression variability). In the present study, we systematically explore genome-wide association between genetic variants and gene expression variability in humans. We adapt the double generalized linear model (dglm) to simultaneously fit the means and the variances of gene expression among the three possible genotypes of a biallelic SNP. The genomic loci showing significant association between the variances of gene expression and the genotypes are termed expression variability QTL (evQTL). Using a data set of gene expression in lymphoblastoid cell lines (LCLs) derived from 210 HapMap individuals, we identify cis-acting evQTL involving 218 distinct genes, among which 8 genes, ADCY1, CTNNA2, DAAM2, FERMT2, IL6, PLOD2, SNX7, and TNFRSF11B, are cross-validated using an extra expression data set of the same LCLs. We also identify ∼300 trans-acting evQTL between >13,000 common SNPs and 500 randomly selected representative genes. We employ two distinct scenarios, emphasizing single-SNP and multiple-SNP effects on expression variability, to explain the formation of evQTL. We argue that detecting evQTL may represent a novel method for effectively screening for genetic interactions, especially when the multiple-SNP influence on expression variability is implied. The implication of our results for revealing genetic mechanisms of gene expression variability is discussed.
Duchenne muscular dystrophy (DMD) causes progressive disability in 1 of every 5,000 boys due to the lack of functional dystrophin protein. Despite much advancement in knowledge about DMD disease ...presentation and progression-attributable in part to studies using mouse and canine models of the disease-current DMD treatments are not equally effective in all patients. There remains, therefore, a need for translational animal models in which novel treatment targets can be identified and evaluated. Golden Retriever muscular dystrophy (GRMD) is a phenotypically and genetically homologous animal model of DMD. As with DMD, speed of disease progression in GRMD varies substantially. However, unlike DMD, all GRMD dogs possess the same causal mutation; therefore genetic modifiers of phenotypic variation are relatively easier to identify. Furthermore, the GRMD dogs used in this study reside within the same colony, reducing the confounding effects of environment on phenotypic variation. To detect modifiers of disease progression, we developed gene expression profiles using RNA sequencing for 9 dogs: 6 GRMD dogs (3 with faster-progressing and 3 with slower-progressing disease, based on quantitative, objective biomarkers) and 3 control dogs from the same colony. All dogs were evaluated at 2 time points: early disease onset (3 months of age) and the point at which GRMD stabilizes (6 months of age) using quantitative, objective biomarkers identified as robust against the effects of relatedness/inbreeding. Across all comparisons, the most differentially expressed genes fell into 3 categories: myogenesis/muscle regeneration, metabolism, and inflammation. Our findings are largely in concordance with DMD and mouse model studies, reinforcing the utility of GRMD as a translational model. Novel findings include the strong up-regulation of chitinase 3-like 1 (CHI3L1) in faster-progressing GRMD dogs, suggesting previously unexplored mechanisms underlie progression speed in GRMD and DMD. In summary, our findings support the utility of RNA sequencing for evaluating potential biomarkers of GRMD progression speed, and are valuable for identifying new avenues of exploration in DMD research.
Display omitted
The metastasis-associated lung adenocarcinoma transcript 1 (MALAT1) is a long noncoding RNA and its overexpression is associated with the development of many types of malignancy. ...MALAT1 null mice show no overt phenotype. However, in transcriptome analysis of MALAT1 null mice we found significant upregulation of nuclear factor-erythroid 2 p45-related factor 2 (Nrf2) regulated antioxidant genes including Nqo1 and Cat with significant reduction in reactive oxygen species (ROS) and greatly reduced ROS-generated protein carbonylation in hepatocyte and islets. We performed lncRNA pulldown assay using biotinylated antisense oligonucleotides against MALAT1 and found MALAT1 interacted with Nrf2, suggesting Nrf2 is transcriptionally regulated by MALAT1. Exposure to excessive ROS has been shown to cause insulin resistance through activation of c-Jun N-terminal kinase (JNK) which leads to inhibition of insulin receptor substrate 1 (IRS-1) and insulin-induced phosphorylation of serine/threonine kinase Akt. We found MALAT1 ablation suppressed JNK activity with concomitant insulin-induced activation of IRS-1 and phosphorylation of Akt suggesting MALAT1 regulated insulin responses. MALAT1 null mice exhibited sensitized insulin-signaling response to fast-refeeding and glucose/insulin challenges and significantly increased insulin secretion in response to glucose challenge in isolated MALAT1 null islets, suggesting an increased insulin sensitivity. In summary, we demonstrate that MALAT1 plays an important role in regulating insulin sensitivity and has the potential as a therapeutic target for the treatment of diabetes as well as other diseases caused by excessive exposure to ROS.
Gene expression as an intermediate molecular phenotype has been a focus of research interest. In particular, studies of expression quantitative trait loci (eQTL) have offered promise for ...understanding gene regulation through the discovery of genetic variants that explain variation in gene expression levels. Existing eQTL methods are designed for assessing the effects of common variants, but not rare variants. Here, we address the problem by establishing a novel analytical framework for evaluating the effects of rare or private variants on gene expression. Our method starts from the identification of outlier individuals that show markedly different gene expression from the majority of a population, and then reveals the contributions of private SNPs to the aberrant gene expression in these outliers. Using population-scale mRNA sequencing data, we identify outlier individuals using a multivariate approach. We find that outlier individuals are more readily detected with respect to gene sets that include genes involved in cellular regulation and signal transduction, and less likely to be detected with respect to the gene sets with genes involved in metabolic pathways and other fundamental molecular functions. Analysis of polymorphic data suggests that private SNPs of outlier individuals are enriched in the enhancer and promoter regions of corresponding aberrantly-expressed genes, suggesting a specific regulatory role of private SNPs, while the commonly-occurring regulatory genetic variants (i.e., eQTL SNPs) show little evidence of involvement. Additional data suggest that non-genetic factors may also underlie aberrant gene expression. Taken together, our findings advance a novel viewpoint relevant to situations wherein common eQTLs fail to predict gene expression when heritable, rare inter-individual variation exists. The analytical framework we describe, taking into consideration the reality of differential phenotypic robustness, may be valuable for investigating complex traits and conditions.