Pathway enrichment analysis helps researchers gain mechanistic insight into gene lists generated from genome-scale (omics) experiments. This method identifies biological pathways that are enriched in ...a gene list more than would be expected by chance. We explain the procedures of pathway enrichment analysis and present a practical step-by-step guide to help interpret gene lists resulting from RNA-seq and genome-sequencing experiments. The protocol comprises three major steps: definition of a gene list from omics data, determination of statistically enriched pathways, and visualization and interpretation of the results. We describe how to use this protocol with published examples of differentially expressed genes and mutated cancer genes; however, the principles can be applied to diverse types of omics data. The protocol describes innovative visualization techniques, provides comprehensive background and troubleshooting guidelines, and uses freely available and frequently updated software, including g:Profiler, Gene Set Enrichment Analysis (GSEA), Cytoscape and EnrichmentMap. The complete protocol can be performed in ~4.5 h and is designed for use by biologists with no prior bioinformatics training.
Functional enrichment analysis is a key step in interpreting gene lists discovered in diverse high-throughput experiments. g:Profiler studies flat and ranked gene lists and finds statistically ...significant Gene Ontology terms, pathways and other gene function related terms. Translation of hundreds of gene identifiers is another core feature of g:Profiler. Since its first publication in 2007, our web server has become a popular tool of choice among basic and translational researchers. Timeliness is a major advantage of g:Profiler as genome and pathway information is synchronized with the Ensembl database in quarterly updates. g:Profiler supports 213 species including mammals and other vertebrates, plants, insects and fungi. The 2016 update of g:Profiler introduces several novel features. We have added further functional datasets to interpret gene lists, including transcription factor binding site predictions, Mendelian disease annotations, information about protein expression and complexes and gene mappings of human genetic polymorphisms. Besides the interactive web interface, g:Profiler can be accessed in computational pipelines using our R package, Python interface and BioJS component. g:Profiler is freely available at http://biit.cs.ut.ee/gprofiler/.
Somatic mutations in cancer genomes are associated with DNA replication timing (RT) and chromatin accessibility (CA), however these observations are based on normal tissues and cell lines while ...primary cancer epigenomes remain uncharacterised. Here we use machine learning to model megabase-scale mutation burden in 2,500 whole cancer genomes and 17 cancer types via a compendium of 900 CA and RT profiles covering primary cancers, normal tissues, and cell lines. CA profiles of primary cancers, rather than those of normal tissues, are most predictive of regional mutagenesis in most cancer types. Feature prioritisation shows that the epigenomes of matching cancer types and organ systems are often the strongest predictors of regional mutation burden, highlighting disease-specific associations of mutational processes. The genomic distributions of mutational signatures are also shaped by the epigenomes of matched cancer and tissue types, with SBS5/40, carcinogenic and unknown signatures most accurately predicted by our models. In contrast, fewer associations of RT and regional mutagenesis are found. Lastly, the models highlight genomic regions with overrepresented mutations that dramatically exceed epigenome-derived expectations and show a pan-cancer convergence to genes and pathways involved in development and oncogenesis, indicating the potential of this approach for coding and non-coding driver discovery. The association of regional mutational processes with the epigenomes of primary cancers suggests that the landscape of passenger mutations is predominantly shaped by the epigenomes of cancer cells after oncogenic transformation.
Full text
Available for:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
Large‐scale cancer genome sequencing has uncovered thousands of gene mutations, but distinguishing tumor driver genes from functionally neutral passenger mutations is a major challenge. We analyzed ...800 cancer genomes of eight types to find single‐nucleotide variants (SNVs) that precisely target phosphorylation machinery, important in cancer development and drug targeting. Assuming that cancer‐related biological systems involve unexpectedly frequent mutations, we used novel algorithms to identify genes with significant phosphorylation‐associated SNVs (pSNVs), phospho‐mutated pathways, kinase networks, drug targets, and clinically correlated signaling modules. We highlight increased survival of patients with TP53 pSNVs, hierarchically organized cancer kinase modules, a novel pSNV in EGFR, and an immune‐related network of pSNVs that correlates with prolonged survival in ovarian cancer. Our findings include multiple actionable cancer gene candidates (FLNB, GRM1, POU2F1), protein complexes (HCF1, ASF1), and kinases (PRKCZ). This study demonstrates new ways of interpreting cancer genomes and presents new leads for cancer research.
Phosphorylation sites of human proteins are frequently mutated in cancer. Statistical analysis of phosphorylation‐associated single nucleotide variants (pSNVs) predicts novel cancer drivers and phospho‐mutation mechanisms in known cancer genes.
Synopsis
Phosphorylation sites of human proteins are frequently mutated in cancer. Statistical analysis of phosphorylation‐associated single nucleotide variants (pSNVs) predicts novel cancer drivers and phospho‐mutation mechanisms in known cancer genes.
We designed the ActiveDriver method to identify significantly mutated signaling regions in proteins. ActiveDriver is complementary to standard frequency‐based methods of mutation significance and helps interpret rare, but site‐specific mutations.
Analysis of somatic mutations in 800 cancer genomes reveals dozens of known and novel cancer genes, including potential drivers that are apparent only when integrating multiple cancer types.
Pathway and network analysis identifies systems with significantly enriched pSNVs, including kinase modules and protein complexes.
Clinical data analysis identifies phospho‐mutations of TP53 that correlate with prolonged patient survival in ovarian and brain cancer. Kinase network analysis highlights multiple survival‐associated signaling modules with pSNVs.
Full text
Available for:
FZAB, GIS, IJS, IZUM, KILJ, NLZOH, NUK, OILJ, PILJ, PNG, SAZU, SBCE, SBMB, UL, UM, UPUK
Somatic mutations in cancer genomes include drivers that provide selective advantages to tumor cells and passengers present due to genome instability. Discovery of pan-cancer drivers will help ...characterize biological systems important in multiple cancers and lead to development of better therapies. Driver genes are most often identified by their recurrent mutations across tumor samples. However, some mutations are more important for protein function than others. Thus considering the location of mutations with respect to functional protein sites can predict their mechanisms of action and improve the sensitivity of driver gene detection. Protein phosphorylation is a post-translational modification central to cancer biology and treatment, and frequently altered by driver mutations. Here we used our ActiveDriver method to analyze known phosphorylation sites mutated by single nucleotide variants (SNVs) in The Cancer Genome Atlas Research Network (TCGA) pan-cancer dataset of 3,185 genomes and 12 cancer types. Phosphorylation-related SNVs (pSNVs) occur in ~90% of tumors, show increased conservation and functional mutation impact compared to other protein-coding mutations, and are enriched in cancer genes and pathways. Gene-centric analysis found 150 known and candidate cancer genes with significant pSNV recurrence. Using a novel computational method, we predict that 29% of these mutations directly abolish phosphorylation or modify kinase target sites to rewire signaling pathways. This analysis shows that incorporation of information about protein signaling sites will improve computational pipelines for variant function prediction.
Full text
Available for:
IZUM, KILJ, NUK, PILJ, PNG, SAZU, UL, UM, UPUK
Interpreting the impact of human genome variation on phenotype is challenging. The functional effect of protein-coding variants is often predicted using sequence conservation and population frequency ...data, however other factors are likely relevant. We hypothesized that variants in protein post-translational modification (PTM) sites contribute to phenotype variation and disease. We analyzed fraction of rare variants and non-synonymous to synonymous variant ratio (Ka/Ks) in 7,500 human genomes and found a significant negative selection signal in PTM regions independent of six factors, including conservation, codon usage, and GC-content, that is widely distributed across tissue-specific genes and function classes. PTM regions are also enriched in known disease mutations, suggesting that PTM variation is more likely deleterious. PTM constraint also affects flanking sequence around modified residues and increases around clustered sites, indicating presence of functionally important short linear motifs. Using target site motifs of 124 kinases, we predict that at least ∼180,000 motif-breaker amino acid residues that disrupt PTM sites when substituted, and highlight kinase motifs that show specific negative selection and enrichment of disease mutations. We provide this dataset with corresponding hypothesized mechanisms as a community resource. As an example of our integrative approach, we propose that PTPN11 variants in Noonan syndrome aberrantly activate the protein by disrupting an uncharacterized cluster of phosphorylation sites. Further, as PTMs are molecular switches that are modulated by drugs, we study mutated binding sites of PTM enzymes in disease genes and define a drug-disease network containing 413 novel predicted disease-gene links.
Full text
Available for:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
Protein phosphorylation is important in cellular pathways and altered in disease. We developed MIMP (http://mimp.baderlab.org/), a machine learning method to predict the impact of missense ...single-nucleotide variants (SNVs) on kinase-substrate interactions. MIMP analyzes kinase sequence specificities and predicts whether SNVs disrupt existing phosphorylation sites or create new sites. This helps discover mutations that modify protein function by altering kinase networks and provides insight into disease biology and therapy development.
Full text
Available for:
IZUM, KILJ, NUK, PILJ, PNG, SAZU, UL, UM, UPUK
Cancer genomes are shaped by mutational processes with complex spatial variation at multiple scales. Entire classes of regulatory elements are affected by local variations in mutation frequency. ...However, the underlying mechanisms with functional and genetic determinants remain poorly understood.
We characterise the mutational landscape of 1.3 million gene-regulatory and chromatin architectural elements in 2419 whole cancer genomes with transcriptional and pathway activity, functional conservation and recurrent driver events. We develop RM2, a statistical model that quantifies mutational enrichment or depletion in classes of genomic elements through genetic, trinucleotide and megabase-scale effects. We report a map of localised mutational processes affecting CTCF binding sites, transcription start sites (TSS) and tissue-specific open-chromatin regions. Increased mutation frequency in TSSs associates with mRNA abundance in most cancer types, while open-chromatin regions are generally enriched in mutations. We identify ~ 10,000 CTCF binding sites with core DNA motifs and constitutive binding in 66 cell types that represent focal points of mutagenesis. We detect site-specific mutational signature enrichments, such as SBS40 in open-chromatin regions in prostate cancer and SBS17b in CTCF binding sites in gastrointestinal cancers. Candidate drivers of localised mutagenesis are also apparent: BRAF mutations associate with mutational enrichments at CTCF binding sites in melanoma, and ARID1A mutations with TSS-specific mutagenesis in pancreatic cancer.
Our method and catalogue of localised mutational processes provide novel perspectives to cancer genome evolution, mutagenesis, DNA repair and driver gene discovery. The functional and genetic correlates of mutational processes suggest mechanistic hypotheses for future studies.