Recent advances in single-cell biotechnologies have resulted in high-dimensional datasets with increased complexity, making feature selection an essential technique for single-cell data analysis. ...Here, we revisit feature selection techniques and summarise recent developments. We review their application to a range of single-cell data types generated from traditional cytometry and imaging technologies and the latest array of single-cell omics technologies. We highlight some of the challenges and future directions and finally consider their scalability and make general recommendations on each type of feature selection method. We hope this review stimulates future research and application of feature selection in the single-cell era.
The identification of genes that vary across spatial domains in tissues and cells is an essential step for spatial transcriptomics data analysis. Given the critical role it serves for downstream data ...interpretations, various methods for detecting spatially variable genes (SVGs) have been proposed. However, the lack of benchmarking complicates the selection of a suitable method.
Here we systematically evaluate a panel of popular SVG detection methods on a large collection of spatial transcriptomics datasets, covering various tissue types, biotechnologies, and spatial resolutions. We address questions including whether different methods select a similar set of SVGs, how reliable is the reported statistical significance from each method, how accurate and robust is each method in terms of SVG detection, and how well the selected SVGs perform in downstream applications such as clustering of spatial domains. Besides these, practical considerations such as computational time and memory usage are also crucial for deciding which method to use.
Our study evaluates the performance of each method from multiple aspects and highlights the discrepancy among different methods when calling statistically significant SVGs across diverse datasets. Overall, our work provides useful considerations for choosing methods for identifying SVGs and serves as a key reference for the future development of related methods.
Single-cell RNA-seq (scRNA-seq) data simulation is critical for evaluating computational methods for analysing scRNA-seq data especially when ground truth is experimentally unattainable. The ...reliability of evaluation depends on the ability of simulation methods to capture properties of experimental data. However, while many scRNA-seq data simulation methods have been proposed, a systematic evaluation of these methods is lacking. We develop a comprehensive evaluation framework, SimBench, including a kernel density estimation measure to benchmark 12 simulation methods through 35 scRNA-seq experimental datasets. We evaluate the simulation methods on a panel of data properties, ability to maintain biological signals, scalability and applicability. Our benchmark uncovers performance differences among the methods and highlights the varying difficulties in simulating data characteristics. Furthermore, we identify several limitations including maintaining heterogeneity of distribution. These results, together with the framework and datasets made publicly available as R packages, will guide simulation methods selection and their future development.
A key task in single-cell RNA-seq (scRNA-seq) data analysis is to accurately detect the number of cell types in the sample, which can be critical for downstream analyses such as cell type ...identification. Various scRNA-seq data clustering algorithms have been specifically designed to automatically estimate the number of cell types through optimising the number of clusters in a dataset. The lack of benchmark studies, however, complicates the choice of the methods.
We systematically benchmark a range of popular clustering algorithms on estimating the number of cell types in a variety of settings by sampling from the Tabula Muris data to create scRNA-seq datasets with a varying number of cell types, varying number of cells in each cell type, and different cell type proportions. The large number of datasets enables us to assess the performance of the algorithms, covering four broad categories of approaches, from various aspects using a panel of criteria. We further cross-compared the performance on datasets with high cell numbers using Tabula Muris and Tabula Sapiens data.
We identify the strengths and weaknesses of each method on multiple criteria including the deviation of estimation from the true number of cell types, variability of estimation, clustering concordance of cells to their predefined cell types, and running time and peak memory usage. We then summarise these results into a multi-aspect recommendation to the users. The proposed stability-based approach for estimating the number of cell types is implemented in an R package and is freely available from ( https://github.com/PYangLab/scCCESS ).
Automated cell type identification is a key computational challenge in single‐cell RNA‐sequencing (scRNA‐seq) data. To capitalise on the large collection of well‐annotated scRNA‐seq datasets, we ...developed scClassify, a multiscale classification framework based on ensemble learning and cell type hierarchies constructed from single or multiple annotated datasets as references. scClassify enables the estimation of sample size required for accurate classification of cell types in a cell type hierarchy and allows joint classification of cells when multiple references are available. We show that scClassify consistently performs better than other supervised cell type classification methods across 114 pairs of reference and testing data, representing a diverse combination of sizes, technologies and levels of complexity, and further demonstrate the unique components of scClassify through simulations and compendia of experimental datasets. Finally, we demonstrate the scalability of scClassify on large single‐cell atlases and highlight a novel application of identifying subpopulations of cells from the Tabula Muris data that were unidentified in the original publication. Together, scClassify represents state‐of‐the‐art methodology in automated cell type identification from scRNA‐seq data.
Synopsis
scClassify is a multiscale classification framework based on ensemble learning and cell type hierarchies, enabling sample size estimation required for accurate cell type classification and joint classification of cells using multiple references.
scClassify performs multiscale cell type classification based on cell type hierarchies constructed from single or multiple reference datasets.
It implements a post‐hoc clustering procedure for discovering novel cell types from cells that are unassigned due to the absence of their types in the reference data.
It enables the estimation of the number of cells required in a reference dataset to accurately discriminate a given cell type in a cell type hierarchy.
Application to large atlas datasets such as Tabula Muris demonstrates its ability to refine cell types and identify cells from sub‐populations.
scClassify is a multiscale classification framework based on ensemble learning and cell type hierarchies, enabling sample size estimation required for accurate cell type classification and joint classification of cells using multiple references.
Concerted examination of multiple collections of single-cell RNA sequencing (RNA-seq) data promises further biological insights that cannot be uncovered with individual datasets. Here we present ...scMerge, an algorithm that integrates multiple single-cell RNA-seq datasets using factor analysis of stably expressed genes and pseudoreplicates across datasets. Using a large collection of public datasets, we benchmark scMerge against published methods and demonstrate that it consistently provides improved cell type separation by removing unwanted factors; scMerge can also enhance biological discovery through robust data integration, which we show through the inference of development trajectory in a liver dataset collection.
Abstract
Background
Feature selection is an essential task in single-cell RNA-seq (scRNA-seq) data analysis and can be critical for gene dimension reduction and downstream analyses, such as gene ...marker identification and cell type classification. Most popular methods for feature selection from scRNA-seq data are based on the concept of differential distribution wherein a statistical model is used to detect changes in gene expression among cell types. Recent development of deep learning-based feature selection methods provides an alternative approach compared to traditional differential distribution-based methods in that the importance of a gene is determined by neural networks.
Results
In this work, we explore the utility of various deep learning-based feature selection methods for scRNA-seq data analysis. We sample from Tabula Muris and Tabula Sapiens atlases to create scRNA-seq datasets with a range of data properties and evaluate the performance of traditional and deep learning-based feature selection methods for cell type classification, feature selection reproducibility and diversity, and computational time.
Conclusions
Our study provides a reference for future development and application of deep learning-based feature selection methods for single-cell omics data analyses.
Mammary stem/progenitor cells (MaSCs) maintain self-renewal of the mammary epithelium during puberty and pregnancy. DNA methylation provides a potential epigenetic mechanism for maintaining cellular ...memory during self-renewal. Although DNA methyltransferases (DNMTs) are dispensable for embryonic stem cell maintenance, their role in maintaining MaSCs and cancer stem cells (CSCs) in constantly replenishing mammary epithelium is unclear. Here we show that DNMT1 is indispensable for MaSC maintenance. Furthermore, we find that DNMT1 expression is elevated in mammary tumours, and mammary gland-specific DNMT1 deletion protects mice from mammary tumorigenesis by limiting the CSC pool. Through genome-scale methylation studies, we identify ISL1 as a direct DNMT1 target, hypermethylated and downregulated in mammary tumours and CSCs. DNMT inhibition or ISL1 expression in breast cancer cells limits CSC population. Altogether, our studies uncover an essential role for DNMT1 in MaSC and CSC maintenance and identify DNMT1-ISL1 axis as a potential therapeutic target for breast cancer treatment.
Cell type-specific master transcription factors (TFs) play vital roles in defining cell identity and function. However, the roles ubiquitous factors play in the specification of cell identity remain ...underappreciated. Here we show that the ubiquitous CCAAT-binding NF-Y complex is required for the maintenance of embryonic stem cell (ESC) identity and is an essential component of the core pluripotency network. Genome-wide studies in ESCs and neurons reveal that NF-Y regulates not only genes with housekeeping functions through cell type-invariant promoter-proximal binding, but also genes required for cell identity by binding to cell type-specific enhancers with master TFs. Mechanistically, NF-Y’s distinct DNA-binding mode promotes master/pioneer TF binding at enhancers by facilitating a permissive chromatin conformation. Our studies unearth a conceptually unique function for histone-fold domain (HFD) protein NF-Y in promoting chromatin accessibility and suggest that other HFD proteins with analogous structural and DNA-binding properties may function in similar ways.
Display omitted
•NF-Y uses distinct modes to regulate housekeeping and cell-identity programs•NF-Y co-occupies active enhancers with cell type-specific master TFs•NF-Y promotes master TF binding by facilitating a favorable chromatin conformation•NF-Y is required for the maintenance of ESC identity
The ubiquitously expressed NF-Y regulates key cell-cycle genes by binding to their proximal promoters. Oldfield et al. demonstrate a cell type-specific role for NF-Y at active enhancers, where it promotes chromatin accessibility for cell type-specific master transcription factors through its distinct DNA-binding mode.
Single-cell RNA-sequencing (scRNA-seq) is a transformative technology, allowing global transcriptomes of individual cells to be profiled with high accuracy. An essential task in scRNA-seq data ...analysis is the identification of cell types from complex samples or tissues profiled in an experiment. To this end, clustering has become a key computational technique for grouping cells based on their transcriptome profiles, enabling subsequent cell type identification from each cluster of cells. Due to the high feature-dimensionality of the transcriptome (i.e. the large number of measured genes in each cell) and because only a small fraction of genes are cell type-specific and therefore informative for generating cell type-specific clusters, clustering directly on the original feature/gene dimension may lead to uninformative clusters and hinder correct cell type identification.
Here, we propose an autoencoder-based cluster ensemble framework in which we first take random subspace projections from the data, then compress each random projection to a low-dimensional space using an autoencoder artificial neural network, and finally apply ensemble clustering across all encoded datasets to generate clusters of cells. We employ four evaluation metrics to benchmark clustering performance and our experiments demonstrate that the proposed autoencoder-based cluster ensemble can lead to substantially improved cell type-specific clusters when applied with both the standard k-means clustering algorithm and a state-of-the-art kernel-based clustering algorithm (SIMLR) designed specifically for scRNA-seq data. Compared to directly using these clustering algorithms on the original datasets, the performance improvement in some cases is up to 100%, depending on the evaluation metric used.
Our results suggest that the proposed framework can facilitate more accurate cell type identification as well as other downstream analyses. The code for creating the proposed autoencoder-based cluster ensemble framework is freely available from https://github.com/gedcom/scCCESS.