Abstract
Motivation
Single-cell RNA sequencing (scRNA-seq) technologies enable the study of transcriptional heterogeneity at the resolution of individual cells and have an increasing impact on ...biomedical research. However, it is known that these methods sometimes wrongly consider two or more cells as single cells, and that a number of so-called doublets is present in the output of such experiments. Treating doublets as single cells in downstream analyses can severely bias a study’s conclusions, and therefore computational strategies for the identification of doublets are needed.
Results
With scds, we propose two new approaches for in silico doublet identification: Co-expression based doublet scoring (cxds) and binary classification based doublet scoring (bcds). The co-expression based approach, cxds, utilizes binarized (absence/presence) gene expression data and, employing a binomial model for the co-expression of pairs of genes, yields interpretable doublet annotations. bcds, on the other hand, uses a binary classification approach to discriminate artificial doublets from original data. We apply our methods and existing computational doublet identification approaches to four datasets with experimental doublet annotations and find that our methods perform at least as well as the state of the art, at comparably little computational cost. We observe appreciable differences between methods and across datasets and that no approach dominates all others. In summary, scds presents a scalable, competitive approach that allows for doublet annotation of datasets with thousands of cells in a matter of seconds.
Availability and implementation
scds is implemented as a Bioconductor R package (doi: 10.18129/B9.bioc.scds).
Supplementary information
Supplementary data are available at Bioinformatics online.
Non-coding gene regulatory enhancers are essential to transcription in mammalian cells. As a result, a large variety of experimental and computational strategies have been developed to identify ...cis-regulatory enhancer sequences. Given the differences in the biological signals assayed, some variation in the enhancers identified by different methods is expected; however, the concordance of enhancers identified by different methods has not been comprehensively evaluated. This is critically needed, since in practice, most studies consider enhancers identified by only a single method. Here, we compare enhancer sets from eleven representative strategies in four biological contexts.
All sets we evaluated overlap significantly more than expected by chance; however, there is significant dissimilarity in their genomic, evolutionary, and functional characteristics, both at the element and base-pair level, within each context. The disagreement is sufficient to influence interpretation of candidate SNPs from GWAS studies, and to lead to disparate conclusions about enhancer and disease mechanisms. Most regions identified as enhancers are supported by only one method, and we find limited evidence that regions identified by multiple methods are better candidates than those identified by a single method. As a result, we cannot recommend the use of any single enhancer identification strategy in all settings.
Our results highlight the inherent complexity of enhancer biology and identify an important challenge to mapping the genetic architecture of complex disease. Greater appreciation of how the diverse enhancer identification strategies in use today relate to the dynamic activity of gene regulatory regions is needed to enable robust and reproducible results.
Heart development is a continuous process involving significant remodeling during embryogenesis and neonatal stages. To date, several groups have used single-cell sequencing to characterize the heart ...transcriptomes but failed to capture the progression of heart development at most stages. This has left gaps in understanding the contribution of each cell type across cardiac development. Here, we report the transcriptional profile of the murine heart from early embryogenesis to late neonatal stages. Through further analysis of this dataset, we identify several transcriptional features. We identify gene expression modules enriched at early embryonic and neonatal stages; multiple cell types in the left and right atriums are transcriptionally distinct at neonatal stages; many congenital heart defect-associated genes have cell type-specific expression; stage-unique ligand-receptor interactions are mostly between epicardial cells and other cell types at neonatal stages; and mutants of epicardium-expressed genes Wt1 and Tbx18 have different heart defects. Assessment of this dataset serves as an invaluable source of information for studies of heart development.
Gene-regulatory enhancers have been identified using various approaches, including evolutionary conservation, regulatory protein binding, chromatin modifications, and DNA sequence motifs. To ...integrate these different approaches, we developed EnhancerFinder, a two-step method for distinguishing developmental enhancers from the genomic background and then predicting their tissue specificity. EnhancerFinder uses a multiple kernel learning approach to integrate DNA sequence motifs, evolutionary patterns, and diverse functional genomics datasets from a variety of cell types. In contrast with prediction approaches that define enhancers based on histone marks or p300 sites from a single cell line, we trained EnhancerFinder on hundreds of experimentally verified human developmental enhancers from the VISTA Enhancer Browser. We comprehensively evaluated EnhancerFinder using cross validation and found that our integrative method improves the identification of enhancers over approaches that consider a single type of data, such as sequence motifs, evolutionary conservation, or the binding of enhancer-associated proteins. We find that VISTA enhancers active in embryonic heart are easier to identify than enhancers active in several other embryonic tissues, likely due to their uniquely high GC content. We applied EnhancerFinder to the entire human genome and predicted 84,301 developmental enhancers and their tissue specificity. These predictions provide specific functional annotations for large amounts of human non-coding DNA, and are significantly enriched near genes with annotated roles in their predicted tissues and lead SNPs from genome-wide association studies. We demonstrate the utility of EnhancerFinder predictions through in vivo validation of novel embryonic gene regulatory enhancers from three developmental transcription factor loci. Our genome-wide developmental enhancer predictions are freely available as a UCSC Genome Browser track, which we hope will enable researchers to further investigate questions in developmental biology.
Atoh7 is transiently expressed in retinal progenitor cells (RPCs) and is required for retinal ganglion cell (RGC) differentiation. In humans, a deletion in a distal non-coding regulatory region ...upstream of ATOH7 is associated with optic nerve atrophy and blindness. Here, we functionally interrogate the significance of the Atoh7 regulatory landscape to retinogenesis in mice. Deletion of the Atoh7 enhancer structure leads to RGC deficiency, optic nerve hypoplasia, and retinal blood vascular abnormalities, phenocopying inactivation of Atoh7. Further, loss of the Atoh7 remote enhancer impacts ipsilaterally projecting RGCs and disrupts proper axonal projections to the visual thalamus. Deletion of the Atoh7 remote enhancer is also associated with the dysregulation of axonogenesis genes, including the derepression of the axon repulsive cue Robo3. Our data provide insights into how Atoh7 enhancer elements function to promote RGC development and optic nerve formation and highlight a key role of Atoh7 in the transcriptional control of axon guidance molecules.
Display omitted
•The Atoh7 cis-regulatory landscape is required for RGC development and optic nerve formation•Deletion of the Atoh7 remote enhancer impacts ipsilateral RGCs and retinotectal mapping•Loss of Atoh7 leads to derepression of the axonal guidance cue Robo3
In this study, Mehta et al. dissect the roles of Atoh7 cis-regulatory elements during retinogenesis. They demonstrate that deleting the enhancer landscape upstream of Atoh7 phenocopies Atoh7 mutant mice histologically and transcriptionally. They also found that loss of the remote enhancer affects ipsilateral RGCs and their projections to the brain.
Elimination of the proliferating germline extends lifespan in C. elegans. This phenomenon provides a unique platform to understand how complex metazoans retain metabolic homeostasis when challenged ...with major physiological perturbations. Here, we demonstrate that two conserved transcription regulators essential for the longevity of germline-less adults, DAF-16/FOXO3A and TCER-1/TCERG1, concurrently enhance the expression of multiple genes involved in lipid synthesis and breakdown, and that both gene classes promote longevity. Lipidomic analyses revealed that key lipogenic processes, including de novo fatty acid synthesis, triglyceride production, desaturation and elongation, are augmented upon germline removal. Our data suggest that lipid anabolic and catabolic pathways are coordinately augmented in response to germline loss, and this metabolic shift helps preserve lipid homeostasis. DAF-16 and TCER-1 also perform essential inhibitory functions in germline-ablated animals. TCER-1 inhibits the somatic gene-expression program that facilitates reproduction and represses anti-longevity genes, whereas DAF-16 impedes ribosome biogenesis. Additionally, we discovered that TCER-1 is critical for optimal fertility in normal adults, suggesting that the protein acts as a switch supporting reproductive fitness or longevity depending on the presence or absence of the germline. Collectively, our data offer insights into how organisms adapt to changes in reproductive status, by utilizing the activating and repressive functions of transcription factors and coordinating fat production and degradation.
Stem cells are defined as self-renewing cell populations that can differentiate into multiple distinct cell types. However, hundreds of different human cell lines from embryonic, fetal and adult ...sources have been called stem cells, even though they range from pluripotent cells-typified by embryonic stem cells, which are capable of virtually unlimited proliferation and differentiation-to adult stem cell lines, which can generate a far more limited repertoire of differentiated cell types. The rapid increase in reports of new sources of stem cells and their anticipated value to regenerative medicine has highlighted the need for a general, reproducible method for classification of these cells. We report here the creation and analysis of a database of global gene expression profiles (which we call the 'stem cell matrix') that enables the classification of cultured human stem cells in the context of a wide variety of pluripotent, multipotent and differentiated cell types. Using an unsupervised clustering method to categorize a collection of ∼150 cell samples, we discovered that pluripotent stem cell lines group together, whereas other cell types, including brain-derived neural stem cell lines, are very diverse. Using further bioinformatic analysis we uncovered a protein-protein network (PluriNet) that is shared by the pluripotent cells (embryonic stem cells, embryonal carcinomas and induced pluripotent cells). Analysis of published data showed that the PluriNet seems to be a common characteristic of pluripotent cells, including mouse embryonic stem and induced pluripotent cells and human oocytes. Our results offer a new strategy for classifying stem cells and support the idea that pluripotency and self-renewal are under tight control by specific molecular networks.
Motivation: Standard analysis routines for microarray data aim at differentially expressed genes. In this paper, we address the complementary problem of detecting sets of differentially co-expressed ...genes in two phenotypically distinct sets of expression profiles. Results: We introduce a score for differential co-expression and suggest a computationally efficient algorithm for finding high scoring sets of genes. The use of our novel method is demonstrated in the context of simulations and on real expression data from a clinical study.
The kidney is a complex organ composed of more than 30 terminally differentiated cell types that all are required to perform its numerous homeostatic functions. Defects in kidney development are a ...significant cause of chronic kidney disease in children, which can lead to kidney failure that can only be treated by transplant or dialysis. A better understanding of molecular mechanisms that drive kidney development is important for designing strategies to enhance renal repair and regeneration. In this study, we profiled gene expression in the developing mouse kidney at embryonic day 14.5 at single-cell resolution. Consistent with previous studies, clusters with distinct transcriptional signatures clearly identify major compartments and cell types of the developing kidney. Cell cycle activity distinguishes between the "primed" and "self-renewing" sub-populations of nephron progenitors, with increased expression of the cell cycle-related genes Birc5, Cdca3, Smc2 and Smc4 in "primed" nephron progenitors. In addition, augmented expression of cell cycle related genes Birc5, Cks2, Ccnb1, Ccnd1 and Tuba1a/b was detected in immature distal tubules, suggesting cell cycle regulation may be required for early events of nephron patterning and tubular fusion between the distal nephron and collecting duct epithelia.
Non-protein-coding genetic variants are a major driver of the genetic risk for human disease; however, identifying which non-coding variants contribute to diseases and their mechanisms remains ...challenging. In silico variant prioritization methods quantify a variant’s severity, but for most methods, the specific phenotype and disease context of the prediction remain poorly defined. For example, many commonly used methods provide a single, organism-wide score for each variant, while other methods summarize a variant’s impact in certain tissues and/or cell types. Here, we propose a complementary disease-specific variant prioritization scheme, which is motivated by the observation that variants contributing to disease often operate through specific biological mechanisms. We combine tissue/cell-type-specific variant scores (e.g., GenoSkyline, FitCons2, DNA accessibility) into disease-specific scores with a logistic regression approach and apply it to ∼25,000 non-coding variants spanning 111 diseases. We show that this disease-specific aggregation significantly improves the association of common non-coding genetic variants with disease (average precision: 0.151, baseline = 0.09), compared with organism-wide scores (GenoCanyon, LINSIGHT, GWAVA, Eigen, CADD; average precision: 0.129, baseline = 0.09). Further on, disease similarities based on data-driven aggregation weights highlight meaningful disease groups, and it provides information about tissues and cell types that drive these similarities. We also show that so-learned similarities are complementary to genetic similarities as quantified by genetic correlation. Overall, our approach demonstrates the strengths of disease-specific variant prioritization, leads to improvement in non-coding variant prioritization, and enables interpretable models that link variants to disease via specific tissues and/or cell types.
Non-coding genetic variants constitute the majority of disease-associated genetic variation in humans. In this study, Liang et al. show that variant prioritization within a specific disease context improves performance and that it enables the linking of variants to disease via specific tissues and cell types.