Super-enhancers are clusters of transcriptional enhancers that drive cell-type-specific gene expression and are crucial to cell identity. Many disease-associated sequence variations are enriched in ...super-enhancer regions of disease-relevant cell types. Thus, super-enhancers can be used as potential biomarkers for disease diagnosis and therapeutics. Current studies have identified super-enhancers in more than 100 cell types and demonstrated their functional importance. However, a centralized resource to integrate all these findings is not currently available. We developed dbSUPER (http://bioinfo.au.tsinghua.edu.cn/dbsuper/), the first integrated and interactive database of super-enhancers, with the primary goal of providing a resource for assistance in further studies related to transcriptional control of cell identity and disease. dbSUPER provides a responsive and user-friendly web interface to facilitate efficient and comprehensive search and browsing. The data can be easily sent to Galaxy instances, GREAT and Cistrome web-servers for downstream analysis, and can also be visualized in the UCSC genome browser where custom tracks can be added automatically. The data can be downloaded and exported in variety of formats. Furthermore, dbSUPER lists genes associated with super-enhancers and also links to external databases such as GeneCards, UniProt and Entrez. dbSUPER also provides an overlap analysis tool to annotate user-defined regions. We believe dbSUPER is a valuable resource for the biology and genetic research communities.
Abstract
Summary
The excessive amount of zeros in single-cell RNA-seq (scRNA-seq) data includes 'real' zeros due to the on-off nature of gene transcription in single cells and 'dropout' zeros due to ...technical reasons. Existing differential expression (DE) analysis methods cannot distinguish these two types of zeros. We developed an R package DEsingle which employed Zero-Inflated Negative Binomial model to estimate the proportion of real and dropout zeros and to define and detect three types of DE genes in scRNA-seq data with higher accuracy.
Availability and implementation
The R package DEsingle is freely available at Bioconductor (https://bioconductor.org/packages/DEsingle).
Supplementary information
Supplementary data are available at Bioinformatics online.
High-throughput RNA sequencing (RNA-seq) is rapidly emerging as a major quantitative transcriptome profiling platform. Here, we present DEGseq, an R package to identify differentially expressed genes ...or isoforms for RNA-seq data from different samples. In this package, we integrated three existing methods, and introduced two novel methods based on MA-plot to detect and visualize gene expression difference. Availability: The R package and a quick-start vignette is available at http://bioinfo.au.tsinghua.edu.cn/software/degseq Contact: xwwang@tsinghua.edu.cn; zhangxg@tsinghua.edu.cn Supplementary information: Supplementary data are available at Bioinformatics online.
Abstract
Motivation
Single-cell RNA sequencing (scRNA-seq) techniques have revolutionized the investigation of transcriptomic landscape in individual cells. Recent advancements in spatial ...transcriptomic technologies further enable gene expression profiling and spatial organization mapping of cells simultaneously. Among the technologies, imaging-based methods can offer higher spatial resolutions, while they are limited by either the small number of genes imaged or the low gene detection sensitivity. Although several methods have been proposed for enhancing spatially resolved transcriptomics, inadequate accuracy of gene expression prediction and insufficient ability of cell-population identification still impede the applications of these methods.
Results
We propose stPlus, a reference-based method that leverages information in scRNA-seq data to enhance spatial transcriptomics. Based on an auto-encoder with a carefully tailored loss function, stPlus performs joint embedding and predicts spatial gene expression via a weighted k-nearest-neighbor. stPlus outperforms baseline methods with higher gene-wise and cell-wise Spearman correlation coefficients. We also introduce a clustering-based approach to assess the enhancement performance systematically. Using the data enhanced by stPlus, cell populations can be better identified than using the measured data. The predicted expression of genes unique to scRNA-seq data can also well characterize spatial cell heterogeneity. Besides, stPlus is robust and scalable to datasets of diverse gene detection sensitivity levels, sample sizes and number of spatially measured genes. We anticipate stPlus will facilitate the analysis of spatial transcriptomics.
Availability and implementation
stPlus with detailed documents is freely accessible at http://health.tsinghua.edu.cn/software/stPlus/ and the source code is openly available on https://github.com/xy-chen16/stPlus.
Though most of the transcripts are long non-coding RNAs (lncRNAs), little is known about their functions. lncRNAs usually function through interactions with proteins, which implies the importance of ...identifying the binding proteins of lncRNAs in understanding the molecular mechanisms underlying the functions of lncRNAs. Only a few approaches are available for predicting interactions between lncRNAs and proteins. In this study, we introduce a new method lncPro.
By encoding RNA and protein sequences into numeric vectors, we used matrix multiplication to score each RNA-protein pair. This score can be used to measure the interactions between an RNA-protein pair. This method effectively discriminates interacting and non-interacting RNA-protein pairs and predicts RNA-protein interactions within a given complex. Applying this method on all human proteins, we found that the long non-coding RNAs we collected tend to interact with nuclear proteins and RNA-binding proteins.
Compared with the existing approaches, our method shortens the time for training matrix and obtains optimal results based on the model being used. The ability of predicting the associations between lncRNAs and proteins has also been enhanced. Our method provides an idea on how to integrate different information into the prediction process.
Full text
Available for:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
To understand the roles they play in complex diseases, genes need to be investigated in the networks they are involved in. Integration of gene expression and network data is a promising approach to ...prioritize disease-associated genes. Some methods have been developed in this field, but the problem is still far from being solved.
In this paper, we developed a method, Networked Gene Prioritizer (NGP), to prioritize cancer-associated genes. Applications on several breast cancer and lung cancer datasets demonstrated that NGP performs better than the existing methods. It provides stable top ranking genes between independent datasets. The top-ranked genes by NGP are enriched in the cancer-associated pathways. The top-ranked genes by NGP-PLK1, MCM2, MCM3, MCM7, MCM10 and SKP2 might coordinate to promote cell cycle related processes in cancer but not normal cells.
In this paper, we have developed a method named NGP, to prioritize cancer-associated genes. Our results demonstrated that NGP performs better than the existing methods.
Full text
Available for:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
Alternative splicing (AS) is one crucial step of gene expression that must be tightly regulated during neurodevelopment. However, the precise timing of developmental splicing switches and the ...underlying regulatory mechanisms are poorly understood. Here we systematically analyze the temporal regulation of AS in a large number of transcriptome profiles of developing mouse cortices, in vivo purified neuronal subtypes, and neurons differentiated in vitro. Our analysis reveals early-switch and late-switch exons in genes with distinct functions, and these switches accurately define neuronal maturation stages. Integrative modeling suggests that these switches are under direct and combinatorial regulation by distinct sets of neuronal RNA-binding proteins including Nova, Rbfox, Mbnl, and Ptbp. Surprisingly, various neuronal subtypes in the sensory systems lack Nova and/or Rbfox expression. These neurons retain the "immature" splicing program in early-switch exons, affecting numerous synaptic genes. These results provide new insights into the organization and regulation of the neurodevelopmental transcriptome.
Metagenomic sequencing is a powerful technology for studying the mixture of microbes or the microbiomes on human and in the environment. One basic task of analyzing metagenomic data is to identify ...the component genomes in the community. This task is challenging due to the complexity of microbiome composition, limited availability of known reference genomes, and usually insufficient sequencing coverage.
As an initial step toward understanding the complete composition of a metagenomic sample, we studied the problem of estimating the total length of all distinct component genomes in a metagenomic sample. We showed that this problem can be solved by estimating the total number of distinct k-mers in all the metagenomic sequencing data. We proposed a method for this estimation based on the sequencing coverage distribution of observed k-mers, and introduced a k-mer redundancy index (KRI) to fill in the gap between the count of distinct k-mers and the total genome length. We showed the effectiveness of the proposed method on a set of carefully designed simulation data corresponding to multiple situations of true metagenomic data. Results on real data indicate that the uncaptured genomic information can vary dramatically across metagenomic samples, with the potential to mislead downstream analyses.
We proposed the question of how long the total genome length of all different species in a microbial community is and introduced a method to answer it.
Full text
Available for:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
Super-enhancers and stretch enhancers represent classes of transcriptional enhancers that have been shown to control the expression of cell identity genes and carry disease- and trait-associated ...variants. Specifically, super-enhancers are clusters of enhancers defined based on the binding occupancy of master transcription factors, chromatin regulators, or chromatin marks, while stretch enhancers are large chromatin-defined regulatory regions of at least 3,000 base pairs. Several studies have characterized these regulatory regions in numerous cell types and tissues to decipher their functional importance. However, the differences and similarities between these regulatory regions have not been fully assessed. We integrated genomic, epigenomic, and transcriptomic data from ten human cell types to perform a comparative analysis of super and stretch enhancers with respect to their chromatin profiles, cell type-specificity, and ability to control gene expression. We found that stretch enhancers are more abundant, more distal to transcription start sites, cover twice as much the genome, and are significantly less conserved than super-enhancers. In contrast, super-enhancers are significantly more enriched for active chromatin marks and cohesin complex, and more transcriptionally active than stretch enhancers. Importantly, a vast majority of super-enhancers (85%) overlap with only a small subset of stretch enhancers (13%), which are enriched for cell type-specific biological functions, and control cell identity genes. These results suggest that super-enhancers are transcriptionally more active and cell type-specific than stretch enhancers, and importantly, most of the stretch enhancers that are distinct from super-enhancers do not show an association with cell identity genes, are less active, and more likely to be poised enhancers.
Full text
Available for:
BFBNIB, GIS, IJS, KISLJ, NUK, PNG, UL, UM, UPUK
GIVE is a framework and library for creating portable and personalized genome browsers. It makes visualizing genomic data as easy as building a laboratory homepage.