Cancer genomic, transcriptomic, and proteomic profiling has generated extensive data that necessitate the development of tools for its analysis and dissemination. We developed UALCAN to provide a ...portal for easy exploring, analyzing, and visualizing these data, allowing users to integrate the data to better understand the gene, proteins, and pathways perturbed in cancer and make discoveries. UALCAN web portal enables analyzing and delivering cancer transcriptome, proteomics, and patient survival data to the cancer research community. With data obtained from The Cancer Genome Atlas (TCGA) project, UALCAN has enabled users to evaluate protein-coding gene expression and its impact on patient survival across 33 types of cancers. The web portal has been used extensively since its release and received immense popularity, underlined by its usage from cancer researchers in more than 100 countries. The present manuscript highlights the task we have undertaken and updates that we have made to UALCAN since its release in 2017. Extensive user feedback motivated us to expand the resource by including data on a) microRNAs (miRNAs), long non-coding RNAs (lncRNAs), and promoter DNA methylation from TCGA and b) mass spectrometry-based proteomics from the Clinical Proteomic Tumor Analysis Consortium (CPTAC). UALCAN provides easy access to pre-computed, tumor subgroup-based gene/protein expression, promoter DNA methylation status, and Kaplan-Meier survival analyses. It also provides new visualization features to comprehend and integrate observations and aids in generating hypotheses for testing. UALCAN is accessible at http://ualcan.path.uab.edu
Differential expression (DE) analysis between cell types for scRNA-seq data by capturing its complicated features is crucial. Recently, different methods have been developed for targeting the ...scRNA-seq data analysis based on different modeling frameworks, assumptions, strategies and test statistic in considering various data features. The scDEA is an ensemble learning-based DE analysis method developed recently, yielding p-values using Lancaster's combination, generated by 12 individual DE analysis methods, and producing more accurate and stable results than individual methods. The objective of our study is to propose a new ensemble learning-based DE analysis method, scHD4E, using top performers in only 4 separate methods. The top performer 4 methods have been selected through an evaluation process using six real scRNA-seq data sets. We conducted comprehensive experiments for five experimental data sets to evaluate our proposed method based on the sample size effects, batch effects, type I error control, gene ontology enrichment analysis, runtime, identified matched DE genes, and semantic similarity measurement between methods. We also perform similar analyses (except the last 3 terms) and compute performance measures like accuracy, F1 score, Mathew's correlation coefficient etc. for a simulated data set. The results show that scHD4E is performs better than all the individual and scDEA methods in all the above perspectives. We expect that scHD4E will serve the modern data scientists for detecting the DEGs in scRNA-seq data analysis. To implement our proposed method, a Github R package scHD4E and its shiny application has been developed, and available in the following links: https://github.com/bbiswas1989/scHD4E and https://github.com/bbiswas1989/scHD4E-Shiny.
•Differential expression analysis between cell types for scRNA-seq data by capturing its complicated features is very crucial for identifying the key genes and their impacts on organism development.•We have proposed a new ensemble learning-based DE analysis method scHD4E, using top performers in only 4 individual methods (selected from 44 DE analysis methods through an evaluation process applying to six real scRNA-seq data sets). scHD4E was developed using Lancaster's combined probability test, which combined p-values generated from the top 4 individual DE analysis methods.•Our method showed more stable and precise results than the existing ensemble learning-based DE analysis method scDEA and the individual methods in the context of sample size effect, batch effects, type I error control, gene ontology (GO) enrichment analysis, runtime, identified matched DEGs and semantic similarity measurement between methods.•scHD4E has performed better in the perspectives of different performance measures like accuracy, MCC, f1 score, etc., which are calculated for simulated data.•An R package scHD4E and its shiny application have been developed to implement our proposed method and are available on GitHub.
RNA-seq is now the technology of choice for genome-wide differential gene expression experiments, but it is not clear how many biological replicates are needed to ensure valid biological ...interpretation of the results or which statistical tools are best for analyzing the data. An RNA-seq experiment with 48 biological replicates in each of two conditions was performed to answer these questions and provide guidelines for experimental design. With three biological replicates, nine of the 11 tools evaluated found only 20%-40% of the significantly differentially expressed (SDE) genes identified with the full set of 42 clean replicates. This rises to >85% for the subset of SDE genes changing in expression by more than fourfold. To achieve >85% for all SDE genes regardless of fold change requires more than 20 biological replicates. The same nine tools successfully control their false discovery rate at ≲5% for all numbers of replicates, while the remaining two tools fail to control their FDR adequately, particularly for low numbers of replicates. For future RNA-seq experiments, these results suggest that at least six biological replicates should be used, rising to at least 12 when it is important to identify SDE genes for all fold changes. If fewer than 12 replicates are used, a superior combination of true positive and false positive performances makes edgeR and DESeq2 the leading tools. For higher replicate numbers, minimizing false positives is more important and DESeq marginally outperforms the other tools.
SCANPY is a scalable toolkit for analyzing single-cell gene expression data. It includes methods for preprocessing, visualization, clustering, pseudotime and trajectory inference, differential ...expression testing, and simulation of gene regulatory networks. Its Python-based implementation efficiently deals with data sets of more than one million cells ( https://github.com/theislab/Scanpy ). Along with SCANPY, we present ANNDATA, a generic class for handling annotated data matrices ( https://github.com/theislab/anndata ).
As the number of single‐cell transcriptomics datasets grows, the natural next step is to integrate the accumulating data to achieve a common ontology of cell types and states. However, it is not ...straightforward to compare gene expression levels across datasets and to automatically assign cell type labels in a new dataset based on existing annotations. In this manuscript, we demonstrate that our previously developed method, scVI, provides an effective and fully probabilistic approach for joint representation and analysis of scRNA‐seq data, while accounting for uncertainty caused by biological and measurement noise. We also introduce single‐cell ANnotation using Variational Inference (scANVI), a semi‐supervised variant of scVI designed to leverage existing cell state annotations. We demonstrate that scVI and scANVI compare favorably to state‐of‐the‐art methods for data integration and cell state annotation in terms of accuracy, scalability, and adaptability to challenging settings. In contrast to existing methods, scVI and scANVI integrate multiple datasets with a single generative model that can be directly used for downstream tasks, such as differential expression. Both methods are easily accessible through scvi‐tools.
SYNOPSIS
This study demonstrates the ability of scVI to integrate single‐cell RNA‐seq datasets in a variety of settings and presents scANVI, a new development based on scVI for automated annotation of cell types and states.
In scVI, datasets from different labs and technologies are integrated in a joint latent space.
In scANVI, cell type annotations are transferred between datasets and across different scenarios.
Uncertainties of differential gene expression in multiple samples are quantified.
The performance of scVI and scANVI in data integration and cell state annotation is superior to other related methods.
This study demonstrates the ability of scVI to integrate single‐cell RNA‐seq datasets in a variety of settings and presents scANVI, a new development based on scVI for automated annotation of cell types and states.
Genes showing higher expression in either tumor or metastatic tissues can help in better understanding tumor formation and can serve as biomarkers of progression or as potential therapy targets. Our ...goal was to establish an integrated database using available transcriptome-level datasets and to create a web platform which enables the mining of this database by comparing normal, tumor and metastatic data across all genes in real time. We utilized data generated by either gene arrays from the Gene Expression Omnibus of the National Center for Biotechnology Information (NCBI-GEO) or RNA-seq from The Cancer Genome Atlas (TCGA), Therapeutically Applicable Research to Generate Effective Treatments (TARGET), and The Genotype-Tissue Expression (GTEx) repositories. The altered expression within different platforms was analyzed separately. Statistical significance was computed using Mann-Whitney or Kruskal-Wallis tests. False Discovery Rate (FDR) was computed using the Benjamini-Hochberg method. The entire database contains 56,938 samples, including 33,520 samples from 3180 gene chip-based studies (453 metastatic, 29,376 tumorous and 3691 normal samples), 11,010 samples from TCGA (394 metastatic, 9886 tumorous and 730 normal), 1193 samples from TARGET (1 metastatic, 1180 tumorous and 12 normal) and 11,215 normal samples from GTEx. The most consistently upregulated genes across multiple tumor types were TOP2A (FC = 7.8), SPP1 (FC = 7.0) and CENPA (FC = 6.03), and the most consistently downregulated gene was ADH1B (FC = 0.15). Validation of differential expression using equally sized training and test sets confirmed the reliability of the database in breast, colon, and lung cancer at an FDR below 10%. The online analysis platform enables unrestricted mining of the database and is accessible at TNMplot.com.
Here we walk through an end-to-end gene-level RNA-Seq differential expression workflow using Bioconductor packages. We will start from the FASTQ files, show how these were aligned to the reference ...genome, and prepare a count matrix which tallies the number of RNA-seq reads/fragments within each gene for each sample.We will perform exploratory data analysis (EDA) for quality assessment and to explore the relationship between samples, perform differential gene expression analysis, and visually explore the results.