The emerging single-cell RNA sequencing (scRNA-seq) technologies enable the investigation of transcriptomic landscapes at the single-cell resolution. ScRNA-seq data analysis is complicated by excess ...zero counts, the so-called dropouts due to low amounts of mRNA sequenced within individual cells. We introduce scImpute, a statistical method to accurately and robustly impute the dropouts in scRNA-seq data. scImpute automatically identifies likely dropouts, and only perform imputation on these values without introducing new biases to the rest data. scImpute also detects outlier cells and excludes them from imputation. Evaluation based on both simulated and real human and mouse scRNA-seq data suggests that scImpute is an effective tool to recover transcriptome dynamics masked by dropouts. scImpute is shown to identify likely dropouts, enhance the clustering of cell subpopulations, improve the accuracy of differential expression analysis, and aid the study of gene expression dynamics.
To investigate molecular mechanisms underlying cell state changes, a crucial analysis is to identify differentially expressed (DE) genes along the pseudotime inferred from single-cell RNA-sequencing ...data. However, existing methods do not account for pseudotime inference uncertainty, and they have either ill-posed p-values or restrictive models. Here we propose PseudotimeDE, a DE gene identification method that adapts to various pseudotime inference methods, accounts for pseudotime inference uncertainty, and outputs well-calibrated p-values. Comprehensive simulations and real-data applications verify that PseudotimeDE outperforms existing methods in false discovery rate control and power.
Researchers view vast zeros in single-cell RNA-seq data differently: some regard zeros as biological signals representing no or low gene expression, while others regard zeros as missing data to be ...corrected. To help address the controversy, here we discuss the sources of biological and non-biological zeros; introduce five mechanisms of adding non-biological zeros in computational benchmarking; evaluate the impacts of non-biological zeros on data analysis; benchmark three input data types: observed counts, imputed counts, and binarized counts; discuss the open questions regarding non-biological zeros; and advocate the importance of transparent analysis.
When identifying differentially expressed genes between two conditions using human population RNA-seq samples, we found a phenomenon by permutation analysis: two popular bioinformatics methods, ...DESeq2 and edgeR, have unexpectedly high false discovery rates. Expanding the analysis to limma-voom, NOISeq, dearseq, and Wilcoxon rank-sum test, we found that FDR control is often failed except for the Wilcoxon rank-sum test. Particularly, the actual FDRs of DESeq2 and edgeR sometimes exceed 20% when the target FDR is 5%. Based on these results, for population-level RNA-seq studies with large sample sizes, we recommend the Wilcoxon rank-sum test.
Statistics requantitates the central dogma Li, Jingyi Jessica; Biggin, Mark D
Science (American Association for the Advancement of Science),
03/2015, Letnik:
347, Številka:
6226
Journal Article
Recenzirano
Odprti dostop
Transcription, not translation, chiefly determines protein abundance in mammals
Also see Research Article by
Jovanovic
et al.
Mammalian proteins are expressed at ∼10
3
to 10
8
molecules per cell (
1
...). Differences between cell types, between normal and disease states, and between individuals are largely defined by changes in the abundance of proteins, which are in turn determined by rates of transcription, messenger RNA (mRNA) degradation, translation, and protein degradation. If the rates for one of these steps differ much more than the rates of the other three, that step would be dominant in defining the variation in protein expression. Over the past decade, system-wide studies have claimed that in animals, differences in translation rates predominate (
2
–
5
). On page 1112 of this issue, Jovanovic
et al.
(
6
), as well as recent studies by Battle
et al.
(
7
) and Li
et al.
(
1
), challenge this conclusion, suggesting that transcriptional control makes the larger contribution.
A critical challenge in microbiome data analysis is the existence of many non-biological zeros, which distort taxon abundance distributions, complicate data analysis, and jeopardize the reliability ...of scientific discoveries. To address this issue, we propose the first imputation method for microbiome data-mbImpute-to identify and recover likely non-biological zeros by borrowing information jointly from similar samples, similar taxa, and optional metadata including sample covariates and taxon phylogeny. We demonstrate that mbImpute improves the power of identifying disease-related taxa from microbiome data of type 2 diabetes and colorectal cancer, and mbImpute preserves non-zero distributions of taxa abundances.
A pressing challenge in single-cell transcriptomics is to benchmark experimental protocols and computational methods. A solution is to use computational simulators, but existing simulators cannot ...simultaneously achieve three goals: preserving genes, capturing gene correlations, and generating any number of cells with varying sequencing depths. To fill this gap, we propose scDesign2, a transparent simulator that achieves all three goals and generates high-fidelity synthetic data for multiple single-cell gene expression count-based technologies. In particular, scDesign2 is advantageous in its transparent use of probabilistic models and its ability to capture gene correlations via copulas.
In single-cell RNA sequencing (scRNA-seq), doublets form when two cells are encapsulated into one reaction volume. The existence of doublets, which appear to be—but are not—real cells, is a key ...confounder in scRNA-seq data analysis. Computational methods have been developed to detect doublets in scRNA-seq data; however, the scRNA-seq field lacks a comprehensive benchmarking of these methods, making it difficult for researchers to choose an appropriate method for specific analyses. We conducted a systematic benchmark study of nine cutting-edge computational doublet-detection methods. Our study included 16 real datasets, which contained experimentally annotated doublets, and 112 realistic synthetic datasets. We compared doublet-detection methods regarding detection accuracy under various experimental settings, impacts on downstream analyses, and computational efficiencies. Our results show that existing methods exhibited diverse performance and distinct advantages in different aspects. Overall, the DoubletFinder method has the best detection accuracy, and the cxds method has the highest computational efficiency. A record of this paper’s transparent peer review process is included in the Supplemental Information.
Display omitted
•Benchmark of nine computational doublet-detection methods for scRNA-seq data•Evaluation based on the most comprehensive set of real data with labeled doublets•Diverse performance and distinct advantages of different doublet-detection methods•DoubletFinder excels in detection accuracy; cxds leads in computational efficiency
We conduct a systematic benchmark study of nine cutting-edge computational doublet-detection methods. We evaluate the methods’ detection accuracy, impacts on downstream analyses, and computational efficiency, using a comprehensive set of real and synthetic data. Although no method dominates in all aspects, the DoubletFinder and cxds methods have the best detection accuracy and computational efficiency, respectively.
Two-dimensional (2D) embedding methods are crucial for single-cell data visualization. Popular methods such as t-distributed stochastic neighbor embedding (t-SNE) and uniform manifold approximation ...and projection (UMAP) are commonly used for visualizing cell clusters; however, it is well known that t-SNE and UMAP's 2D embeddings might not reliably inform the similarities among cell clusters. Motivated by this challenge, we present a statistical method, scDEED, for detecting dubious cell embeddings output by a 2D-embedding method. By calculating a reliability score for every cell embedding based on the similarity between the cell's 2D-embedding neighbors and pre-embedding neighbors, scDEED identifies the cell embeddings with low reliability scores as dubious and those with high reliability scores as trustworthy. Moreover, by minimizing the number of dubious cell embeddings, scDEED provides intuitive guidance for optimizing the hyperparameters of an embedding method. We show the effectiveness of scDEED on multiple datasets for detecting dubious cell embeddings and optimizing the hyperparameters of t-SNE and UMAP.