Gene-set enrichment analysis is a useful technique to help functionally characterize large gene lists, such as the results of gene expression experiments. This technique finds functionally coherent ...gene-sets, such as pathways, that are statistically over-represented in a given gene list. Ideally, the number of resulting sets is smaller than the number of genes in the list, thus simplifying interpretation. However, the increasing number and redundancy of gene-sets used by many current enrichment analysis software works against this ideal.
To overcome gene-set redundancy and help in the interpretation of large gene lists, we developed "Enrichment Map", a network-based visualization method for gene-set enrichment results. Gene-sets are organized in a network, where each set is a node and edges represent gene overlap between sets. Automated network layout groups related gene-sets into network clusters, enabling the user to quickly identify the major enriched functional themes and more easily interpret the enrichment results.
Enrichment Map is a significant advance in the interpretation of enrichment analysis. Any research project that generates a list of genes can take advantage of this visualization framework. Enrichment Map is implemented as a freely available and user friendly plug-in for the Cytoscape network visualization software (http://baderlab.org/Software/EnrichmentMap/).
Clinical research and practice in the 21st century is poised to be transformed by analysis of computable electronic medical records and population-level genome-scale patient profiles. Genomic data ...capture genetic and environmental state, providing information on heterogeneity in disease and treatment outcome, but genomic-based clinical risk scores are limited. Achieving the goal of routine precision medicine that takes advantage of these rich genomics data will require computational methods that support heterogeneous data, have excellent predictive performance, and ideally, provide biologically interpretable results. Traditional machine-learning approaches excel at performance, but often have limited interpretability. Patient similarity networks are an emerging paradigm for precision medicine, in which patients are clustered or classified based on their similarities in various features, including genomic profiles. This strategy is analogous to standard medical diagnosis, has excellent performance, is interpretable, and can preserve patient privacy. We review new methods based on patient similarity networks, including Similarity Network Fusion for patient clustering and netDx for patient classification. While these methods are already useful, much work is required to improve their scalability for contemporary genetic cohorts, optimize parameters, and incorporate a wide range of genomics and clinical data. The coming 5 years will provide an opportunity to assess the utility of network-based algorithms for precision medicine.
Display omitted
•Future clinics will combine clinical and genomic data with cellular models for precision medicine.•Statistical risk calculators using genomics need to be interpretable due to small sample sizes.•Patient similarity networks are a new model to integrate data to cluster/classify patients.•Patient similarity networks are accurate, intuitive, preserve patient privacy, and supply mechanistic insight.
Abstract
GeneMANIA (http://genemania.org) is a flexible user-friendly web site for generating hypotheses about gene function, analyzing gene lists and prioritizing genes for functional assays. Given ...a query gene list, GeneMANIA finds functionally similar genes using a wealth of genomics and proteomics data. In this mode, it weights each functional genomic dataset according to its predictive value for the query. Another use of GeneMANIA is gene function prediction. Given a single query gene, GeneMANIA finds genes likely to share function with it based on their interactions with it. Enriched Gene Ontology categories among this set can point to the function of the gene. Nine organisms are currently supported (Arabidopsis thaliana, Caenorhabditis elegans, Danio rerio, Drosophila melanogaster, Escherichia coli, Homo sapiens, Mus musculus, Rattus norvegicus and Saccharomyces cerevisiae). Hundreds of data sets and hundreds of millions of interactions have been collected from GEO, BioGRID, IRefIndex and I2D, as well as organism-specific functional genomics data sets. Users can customize their search by selecting specific data sets to query and by uploading their own data sets to analyze. We have recently updated the user interface to GeneMANIA to make it more intuitive and make more efficient use of visual space. GeneMANIA can now be used effectively on a variety of devices.
Large‐scale cancer genome sequencing has uncovered thousands of gene mutations, but distinguishing tumor driver genes from functionally neutral passenger mutations is a major challenge. We analyzed ...800 cancer genomes of eight types to find single‐nucleotide variants (SNVs) that precisely target phosphorylation machinery, important in cancer development and drug targeting. Assuming that cancer‐related biological systems involve unexpectedly frequent mutations, we used novel algorithms to identify genes with significant phosphorylation‐associated SNVs (pSNVs), phospho‐mutated pathways, kinase networks, drug targets, and clinically correlated signaling modules. We highlight increased survival of patients with TP53 pSNVs, hierarchically organized cancer kinase modules, a novel pSNV in EGFR, and an immune‐related network of pSNVs that correlates with prolonged survival in ovarian cancer. Our findings include multiple actionable cancer gene candidates (FLNB, GRM1, POU2F1), protein complexes (HCF1, ASF1), and kinases (PRKCZ). This study demonstrates new ways of interpreting cancer genomes and presents new leads for cancer research.
Phosphorylation sites of human proteins are frequently mutated in cancer. Statistical analysis of phosphorylation‐associated single nucleotide variants (pSNVs) predicts novel cancer drivers and phospho‐mutation mechanisms in known cancer genes.
Synopsis
Phosphorylation sites of human proteins are frequently mutated in cancer. Statistical analysis of phosphorylation‐associated single nucleotide variants (pSNVs) predicts novel cancer drivers and phospho‐mutation mechanisms in known cancer genes.
We designed the ActiveDriver method to identify significantly mutated signaling regions in proteins. ActiveDriver is complementary to standard frequency‐based methods of mutation significance and helps interpret rare, but site‐specific mutations.
Analysis of somatic mutations in 800 cancer genomes reveals dozens of known and novel cancer genes, including potential drivers that are apparent only when integrating multiple cancer types.
Pathway and network analysis identifies systems with significantly enriched pSNVs, including kinase modules and protein complexes.
Clinical data analysis identifies phospho‐mutations of TP53 that correlate with prolonged patient survival in ovarian and brain cancer. Kinase network analysis highlights multiple survival‐associated signaling modules with pSNVs.
In the post-genomic era, the rapid increase in high-throughput data calls for computational tools capable of integrating data of diverse types and facilitating recognition of biologically meaningful ...patterns within them. For example, protein-protein interaction data sets have been clustered to identify stable complexes, but scientists lack easily accessible tools to facilitate combined analyses of multiple data sets from different types of experiments. Here we present clusterMaker, a Cytoscape plugin that implements several clustering algorithms and provides network, dendrogram, and heat map views of the results. The Cytoscape network is linked to all of the other views, so that a selection in one is immediately reflected in the others. clusterMaker is the first Cytoscape plugin to implement such a wide variety of clustering algorithms and visualizations, including the only implementations of hierarchical clustering, dendrogram plus heat map visualization (tree view), k-means, k-medoid, SCPS, AutoSOME, and native (Java) MCL.
Results are presented in the form of three scenarios of use: analysis of protein expression data using a recently published mouse interactome and a mouse microarray data set of nearly one hundred diverse cell/tissue types; the identification of protein complexes in the yeast Saccharomyces cerevisiae; and the cluster analysis of the vicinal oxygen chelate (VOC) enzyme superfamily. For scenario one, we explore functionally enriched mouse interactomes specific to particular cellular phenotypes and apply fuzzy clustering. For scenario two, we explore the prefoldin complex in detail using both physical and genetic interaction clusters. For scenario three, we explore the possible annotation of a protein as a methylmalonyl-CoA epimerase within the VOC superfamily. Cytoscape session files for all three scenarios are provided in the Additional Files section.
The Cytoscape plugin clusterMaker provides a number of clustering algorithms and visualizations that can be used independently or in combination for analysis and visualization of biological data sets, and for confirming or generating hypotheses about biological function. Several of these visualizations and algorithms are only available to Cytoscape users through the clusterMaker plugin. clusterMaker is available via the Cytoscape plugin manager.
Large-scale genomic studies have identified multiple somatic aberrations in breast cancer, including copy number alterations and point mutations. Still, identifying causal variants and emergent ...vulnerabilities that arise as a consequence of genetic alterations remain major challenges. We performed whole-genome small hairpin RNA (shRNA) “dropout screens” on 77 breast cancer cell lines. Using a hierarchical linear regression algorithm to score our screen results and integrate them with accompanying detailed genetic and proteomic information, we identify vulnerabilities in breast cancer, including candidate “drivers,” and reveal general functional genomic properties of cancer cells. Comparisons of gene essentiality with drug sensitivity data suggest potential resistance mechanisms, effects of existing anti-cancer drugs, and opportunities for combination therapy. Finally, we demonstrate the utility of this large dataset by identifying BRD4 as a potential target in luminal breast cancer and PIK3CA mutations as a resistance determinant for BET-inhibitors.
Display omitted
•We screened 77 breast cancer lines using a genome-wide pooled shRNA library•We developed an algorithm (siMEM) to improve identification of context-dependent genes•Integrating screen results with genomic data reveals potential “drivers”•BRD4 is essential for luminal cancer, and mutant PIK3CA confers BET-I resistance
Pooled shRNA screens of a large panel of breast cancer cell lines, coupled with an improved analytical tool, siMEM, and integration with genomic and proteomic data, identify general and context-dependent essential genes in breast cancer. This study constitutes the largest functional characterization of breast cancer to date.
Pathway enrichment analysis helps researchers gain mechanistic insight into gene lists generated from genome-scale (omics) experiments. This method identifies biological pathways that are enriched in ...a gene list more than would be expected by chance. We explain the procedures of pathway enrichment analysis and present a practical step-by-step guide to help interpret gene lists resulting from RNA-seq and genome-sequencing experiments. The protocol comprises three major steps: definition of a gene list from omics data, determination of statistically enriched pathways, and visualization and interpretation of the results. We describe how to use this protocol with published examples of differentially expressed genes and mutated cancer genes; however, the principles can be applied to diverse types of omics data. The protocol describes innovative visualization techniques, provides comprehensive background and troubleshooting guidelines, and uses freely available and frequently updated software, including g:Profiler, Gene Set Enrichment Analysis (GSEA), Cytoscape and EnrichmentMap. The complete protocol can be performed in ~4.5 h and is designed for use by biologists with no prior bioinformatics training.
Abstract
Motivation
The explosive increase of biomedical literature has made information extraction an increasingly important tool for biomedical research. A fundamental task is the recognition of ...biomedical named entities in text (BNER) such as genes/proteins, diseases and species. Recently, a domain-independent method based on deep learning and statistical word embeddings, called long short-term memory network-conditional random field (LSTM-CRF), has been shown to outperform state-of-the-art entity-specific BNER tools. However, this method is dependent on gold-standard corpora (GSCs) consisting of hand-labeled entities, which tend to be small but highly reliable. An alternative to GSCs are silver-standard corpora (SSCs), which are generated by harmonizing the annotations made by several automatic annotation systems. SSCs typically contain more noise than GSCs but have the advantage of containing many more training examples. Ideally, these corpora could be combined to achieve the benefits of both, which is an opportunity for transfer learning. In this work, we analyze to what extent transfer learning improves upon state-of-the-art results for BNER.
Results
We demonstrate that transferring a deep neural network (DNN) trained on a large, noisy SSC to a smaller, but more reliable GSC significantly improves upon state-of-the-art results for BNER. Compared to a state-of-the-art baseline evaluated on 23 GSCs covering four different entity classes, transfer learning results in an average reduction in error of approximately 11%. We found transfer learning to be especially beneficial for target datasets with a small number of labels (approximately 6000 or less).
Availability and implementation
Source code for the LSTM-CRF is available at https://github.com/Franck-Dernoncourt/NeuroNER/ and links to the corpora are available at https://github.com/BaderLab/Transfer-Learning-BNER-Bioinformatics-2018/.
Supplementary information
Supplementary data are available at Bioinformatics online.
Single-cell RNA sequencing (scRNA-seq) can map cell types, states and transitions during dynamic biological processes such as tissue development and regeneration. Many trajectory inference methods ...have been developed to order cells by their progression through a dynamic process. However, when time series data is available, most of these methods do not consider the available time information when ordering cells and are instead designed to work only on a single scRNA-seq data snapshot. We present Tempora, a novel cell trajectory inference method that orders cells using time information from time-series scRNA-seq data. In performance comparison tests, Tempora inferred known developmental lineages from three diverse tissue development time series data sets, beating state of the art methods in accuracy and speed. Tempora works at the level of cell clusters (types) and uses biological pathway information to help identify cell type relationships. This approach increases gene expression signal from single cells, processing speed, and interpretability of the inferred trajectory. Our results demonstrate the utility of a combination of time and pathway information to supervise trajectory inference for scRNA-seq based analysis.
Cytoscape.js is an open-source JavaScript-based graph library. Its most common use case is as a visualization software component, so it can be used to render interactive graphs in a web browser. It ...also can be used in a headless manner, useful for graph operations on a server, such as Node.js.
Cytoscape.js is implemented in JavaScript. Documentation, downloads and source code are available at http://js.cytoscape.org.
gary.bader@utoronto.ca.