System-wide profiling of genes and proteins in mammalian cells produce lists of differentially expressed genes/proteins that need to be further analyzed for their collective functions in order to ...extract new knowledge. Once unbiased lists of genes or proteins are generated from such experiments, these lists are used as input for computing enrichment with existing lists created from prior knowledge organized into gene-set libraries. While many enrichment analysis tools and gene-set libraries databases have been developed, there is still room for improvement.
Here, we present Enrichr, an integrative web-based and mobile software application that includes new gene-set libraries, an alternative approach to rank enriched terms, and various interactive visualization approaches to display enrichment results using the JavaScript library, Data Driven Documents (D3). The software can also be embedded into any tool that performs gene list analysis. We applied Enrichr to analyze nine cancer cell lines by comparing their enrichment signatures to the enrichment signatures of matched normal tissues. We observed a common pattern of up regulation of the polycomb group PRC2 and enrichment for the histone mark H3K27me3 in many cancer cell lines, as well as alterations in Toll-like receptor and interlukin signaling in K562 cells when compared with normal myeloid CD33+ cells. Such analyses provide global visualization of critical differences between normal tissues and cancer cell lines but can be applied to many other scenarios.
Enrichr is an easy to use intuitive enrichment analysis web-based tool providing various types of visualization summaries of collective functions of gene lists. Enrichr is open source and freely available online at: http://amp.pharm.mssm.edu/Enrichr.
Cluster heatmaps are widely used in biology and other fields to uncover clustering patterns in data matrices. Most cluster heatmap packages provide utility functions to divide the dendrograms at a ...certain level to obtain clusters, but it is often difficult to locate the appropriate cut in the dendrogram to obtain the clusters seen in the heatmap or computed by a statistical method. Multiple cuts are required if the clusters locate at different levels in the dendrogram.
We developed DendroX, a web app that provides interactive visualization of a dendrogram where users can divide the dendrogram at any level and in any number of clusters and pass the labels of the identified clusters for functional analysis. Helper functions are provided to extract linkage matrices from cluster heatmap objects in R or Python to serve as input to the app. A graphic user interface was also developed to help prepare input files for DendroX from data matrices stored in delimited text files. The app is scalable and has been tested on dendrograms with tens of thousands of leaf nodes. As a case study, we clustered the gene expression signatures of 297 bioactive chemical compounds in the LINCS L1000 dataset and visualized them in DendroX. Seventeen biologically meaningful clusters were identified based on the structure of the dendrogram and the expression patterns in the heatmap. We found that one of the clusters consisting of mostly naturally occurring compounds is not previously reported and has its members sharing broad anticancer, anti-inflammatory and antioxidant activities.
DendroX solves the problem of matching visually and computationally determined clusters in a cluster heatmap and helps users navigate among different parts of a dendrogram. The identification of a cluster of naturally occurring compounds with shared bioactivities implicates a convergence of biological effects through divergent mechanisms.
We provide a single-cell atlas of idiopathic pulmonary fibrosis (IPF), a fatal interstitial lung disease, by profiling 312,928 cells from 32 IPF, 28 smoker and nonsmoker controls, and 18 chronic ...obstructive pulmonary disease (COPD) lungs. Among epithelial cells enriched in IPF, we identify a previously unidentified population of aberrant basaloid cells that coexpress basal epithelial, mesenchymal, senescence, and developmental markers and are located at the edge of myofibroblast foci in the IPF lung. Among vascular endothelial cells, we identify an ectopically expanded cell population transcriptomically identical to bronchial restricted vascular endothelial cells in IPF. We confirm the presence of both populations by immunohistochemistry and independent datasets. Among stromal cells, we identify IPF myofibroblasts and invasive fibroblasts with partially overlapping cells in control and COPD lungs. Last, we confirm previous findings of profibrotic macrophage populations in the IPF lung. Our comprehensive catalog reveals the complexity and diversity of aberrant cellular populations in IPF.
Comutation plot is a widely used visualization method to deliver a global view of the mutation landscape of large-scale genomic studies. Current tools for creating comutation plot are either offline ...packages that require coding or online web servers with varied features. When a package is used, it often requires repetitive runs of code to adjust a single feature that might only be a few clicks in a web app. But web apps mostly have limited capacity for customization and cannot handle very large genomic files.
To improve on existing tools, we identified features that are most frequently adjusted in creating a plot and incorporate them in Comut-viz that interactively filters and visualizes mutation data as downloadable plots. It includes colored labels for numeric metadata, a preloaded palette for changing colors and two input boxes for adjusting width and height. It accepts standard mutation annotation format (MAF) files as input and can handle large MAF files with more than 200 k rows. As a front-end only app, Comut-viz guarantees privacy of user data and no latency in the analysis.
Comut-viz is a highly responsive and extensible web app to make comutation plots. It provides customization for frequently adjusted features and accepts large genomic files as input. It is suitable for genomic studies with more than a thousand samples.
For the Library of Integrated Network-based Cellular Signatures (LINCS) project many gene expression signatures using the L1000 technology have been produced. The L1000 technology is a cost-effective ...method to profile gene expression in large scale. LINCS Canvas Browser (LCB) is an interactive HTML5 web-based software application that facilitates querying, browsing and interrogating many of the currently available LINCS L1000 data. LCB implements two compacted layered canvases, one to visualize clustered L1000 expression data, and the other to display enrichment analysis results using 30 different gene set libraries. Clicking on an experimental condition highlights gene-sets enriched for the differentially expressed genes from the selected experiment. A search interface allows users to input gene lists and query them against over 100 000 conditions to find the top matching experiments. The tool integrates many resources for an unprecedented potential for new discoveries in systems biology and systems pharmacology. The LCB application is available at http://www.maayanlab.net/LINCS/LCB. Customized versions will be made part of the http://lincscloud.org and http://lincs.hms.harvard.edu websites.
Identifying differentially expressed genes (DEG) is a fundamental step in studies that perform genome wide expression profiling. Typically, DEG are identified by univariate approaches such as ...Significance Analysis of Microarrays (SAM) or Linear Models for Microarray Data (LIMMA) for processing cDNA microarrays, and differential gene expression analysis based on the negative binomial distribution (DESeq) or Empirical analysis of Digital Gene Expression data in R (edgeR) for RNA-seq profiling.
Here we present a new geometrical multivariate approach to identify DEG called the Characteristic Direction. We demonstrate that the Characteristic Direction method is significantly more sensitive than existing methods for identifying DEG in the context of transcription factor (TF) and drug perturbation responses over a large number of microarray experiments. We also benchmarked the Characteristic Direction method using synthetic data, as well as RNA-Seq data. A large collection of microarray expression data from TF perturbations (73 experiments) and drug perturbations (130 experiments) extracted from the Gene Expression Omnibus (GEO), as well as an RNA-Seq study that profiled genome-wide gene expression and STAT3 DNA binding in two subtypes of diffuse large B-cell Lymphoma, were used for benchmarking the method using real data. ChIP-Seq data identifying DNA binding sites of the perturbed TFs, as well as known drug targets of the perturbing drugs, were used as prior knowledge silver-standard for validation. In all cases the Characteristic Direction DEG calling method outperformed other methods. We find that when drugs are applied to cells in various contexts, the proteins that interact with the drug-targets are differentially expressed and more of the corresponding genes are discovered by the Characteristic Direction method. In addition, we show that the Characteristic Direction conceptualization can be used to perform improved gene set enrichment analyses when compared with the gene-set enrichment analysis (GSEA) and the hypergeometric test.
The application of the Characteristic Direction method may shed new light on relevant biological mechanisms that would have remained undiscovered by the current state-of-the-art DEG methods. The method is freely accessible via various open source code implementations using four popular programming languages: R, Python, MATLAB and Mathematica, all available at: http://www.maayanlab.net/CD.
More effective use of targeted anti-cancer drugs depends on elucidating the connection between the molecular states induced by drug treatment and the cellular phenotypes controlled by these states, ...such as cytostasis and death. This is particularly true when mutation of a single gene is inadequate as a predictor of drug response. The current paper describes a data set of ~600 drug cell line pairs collected as part of the NIH LINCS Program ( http://www.lincsproject.org/ ) in which molecular data (reduced dimensionality transcript L1000 profiles) were recorded across dose and time in parallel with phenotypic data on cellular cytostasis and cytotoxicity. We report that transcriptional and phenotypic responses correlate with each other in general, but whereas inhibitors of chaperones and cell cycle kinases induce similar transcriptional changes across cell lines, changes induced by drugs that inhibit intra-cellular signaling kinases are cell-type specific. In some drug/cell line pairs significant changes in transcription are observed without a change in cell growth or survival; analysis of such pairs identifies drug equivalence classes and, in one case, synergistic drug interactions. In this case, synergy involves cell-type specific suppression of an adaptive drug response.
Plasma cell-free DNA (cfDNA) fragmentomics has demonstrated significant differentiation power between cancer patients and healthy individuals, but little is known in pancreatic and biliary tract ...cancers. The aim of this study is to characterize the cfDNA fragmentomics in biliopancreatic cancers and develop an accurate method for cancer detection.
One hundred forty-seven patients with biliopancreatic cancers and 71 non-cancer volunteers were enrolled, including 55 patients with cholangiocarcinoma, 30 with gallbladder cancer, and 62 with pancreatic cancer. Low-coverage whole-genome sequencing (median coverage: 2.9 ×) was performed on plasma cfDNA. Three cfDNA fragmentomic features, including fragment size, end motif and nucleosome footprint, were subjected to construct a stacked machine learning model for cancer detection. Integration of carbohydrate antigen 19-9 (CA19-9) was explored to improve model performance.
The stacked model presented robust performance for cancer detection (area under curve (AUC) of 0.978 in the training cohort, and AUC of 0.941 in the validation cohort), and remained consistent even when using extremely low-coverage sequencing depth of 0.5 × (AUC: 0.905). Besides, our method could also help differentiate biliopancreatic cancer subtypes. By integrating the stacked model and CA19-9 to generate the final detection model, a high accuracy in distinguishing biliopancreatic cancers from non-cancer samples with an AUC of 0.995 was achieved.
Our model demonstrated ultrasensitivity of plasma cfDNA fragementomics in detecting biliopancreatic cancers, fulfilling the unmet accuracy of widely-used serum biomarker CA19-9, and provided an affordable way for accurate noninvasive biliopancreatic cancer screening in clinical practice.
The B- myb (MYBL2) gene is a member of the MYB family of transcription factors and is involved in cell cycle regulation, DNA replication, and maintenance of genomic integrity. However, its function ...during adult development and hematopoiesis is unknown. We show here that conditional inactivation of B- myb in vivo results in depletion of the hematopoietic stem cell (HSC) pool, leading to profound reductions in mature lymphoid, erythroid, and myeloid cells. This defect is autonomous to the bone marrow and is first evident in stem cells, which accumulate in the S and G ₂/M phases. B- myb inactivation also causes defects in the myeloid progenitor compartment, consisting of depletion of common myeloid progenitors but relative sparing of granulocyte–macrophage progenitors. Microarray studies indicate that B- myb –null LSK ⁺ cells differentially express genes that direct myeloid lineage development and commitment, suggesting that B- myb is a key player in controlling cell fate. Collectively, these studies demonstrate that B- myb is essential for HSC and progenitor maintenance and survival during hematopoiesis.
Enrichment analysis is a popular method for analyzing gene sets generated by genome-wide experiments. Here we present a significant update to one of the tools in this domain called Enrichr. Enrichr ...currently contains a large collection of diverse gene set libraries available for analysis and download. In total, Enrichr currently contains 180 184 annotated gene sets from 102 gene set libraries. New features have been added to Enrichr including the ability to submit fuzzy sets, upload BED files, improved application programming interface and visualization of the results as clustergrams. Overall, Enrichr is a comprehensive resource for curated gene sets and a search engine that accumulates biological knowledge for further biological discoveries. Enrichr is freely available at: http://amp.pharm.mssm.edu/Enrichr.