Deep learning has become a powerful paradigm to analyze the binding sites of regulatory factors including RNA-binding proteins (RBPs), owing to its strength to learn complex features from possibly ...multiple sources of raw data. However, the interpretability of these models, which is crucial to improve our understanding of RBP binding preferences and functions, has not yet been investigated in significant detail. We have designed a multitask and multimodal deep neural network for characterizing in vivo RBP targets. The model incorporates not only the sequence but also the region type of the binding sites as input, which helps the model to boost the prediction performance. To interpret the model, we quantified the contribution of the input features to the predictive score of each RBP. Learning across multiple RBPs at once, we are able to avoid experimental biases and to identify the RNA sequence motifs and transcript context patterns that are the most important for the predictions of each individual RBP. Our findings are consistent with known motifs and binding behaviors and can provide new insights about the regulatory functions of RBPs.
By mapping the positions of millions of translating ribosomes in the cell, ribosome profiling (Ribo-seq) has established its role as a powerful tool to study gene expression. Several laboratories ...have introduced modifications to the experimental protocol and expanded the repertoire of biochemical methods to study translation transcriptome-wide. However, the diversity of protocols highlights a need for standardization. At the same time, different computational analysis strategies have used Ribo-seq data to identify the set of translated sequences with high confidence. In this review we present an overview of such methodologies, outlining their assumptions, data requirements, and availability. At the interface between RNA and proteins, Ribo-seq can complement data from multiple omics approaches, zooming in on the central role of translation in the molecular cell.
Ribo-seq has become an established protocol to identify translated transcript regions via deep sequencing, closing the gap between RNA sequencing and proteomics.
Recently developed Ribo-seq data analysis strategies use different features as hallmarks of translation. Specifically, the ability to monitor the positions of translating ribosomes with single-nucleotide precision has driven the development of computational tools that rely on ‘subcodon resolution’. Knowing the concrete assumptions and precise goals of different approaches is crucial.
In addition to addressing translation-focused questions, from defining open reading frames to identifying alternative translation initiation sites and estimating differential translation rates, Ribo-seq data show great promise for integrative efforts combining additional omics approaches.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UL, UM, UPCLJ, UPUK, ZRSKP
3.
Deep learning for genomics using Janggu Kopp, Wolfgang; Monti, Remo; Tamburrini, Annalaura ...
Nature communications,
07/2020, Volume:
11, Issue:
1
Journal Article
Peer reviewed
Open access
Abstract
In recent years, numerous applications have demonstrated the potential of deep learning for an improved understanding of biological processes. However, most deep learning tools developed so ...far are designed to address a specific question on a fixed dataset and/or by a fixed model architecture. Here we present Janggu, a python library facilitates deep learning for genomics applications, aiming to ease data acquisition and model evaluation. Among its key features are special dataset objects, which form a unified and flexible data acquisition and pre-processing framework for genomics data that enables streamlining of future research applications through reusable components. Through a numpy-like interface, these dataset objects are directly compatible with popular deep learning libraries, including keras or pytorch. Janggu offers the possibility to visualize predictions as genomic tracks or by exporting them to the bigWig format as well as utilities for keras-based models. We illustrate the functionality of Janggu on several deep learning genomics applications. First, we evaluate different model topologies for the task of predicting binding sites for the transcription factor JunD. Second, we demonstrate the framework on published models for predicting chromatin effects. Third, we show that promoter usage measured by CAGE can be predicted using DNase hypersensitivity, histone modifications and DNA sequence features. We improve the performance of these models due to a novel feature in Janggu that allows us to include high-order sequence features. We believe that Janggu will help to significantly reduce repetitive programming overhead for deep learning applications in genomics, and will enable computational biologists to rapidly assess biological hypotheses.
Efforts to precisely identify tumor human leukocyte antigen (HLA) bound peptides capable of mediating T cell-based tumor rejection still face important challenges. Recent studies suggest that ...non-canonical tumor-specific HLA peptides derived from annotated non-coding regions could elicit anti-tumor immune responses. However, sensitive and accurate mass spectrometry (MS)-based proteogenomics approaches are required to robustly identify these non-canonical peptides. We present an MS-based analytical approach that characterizes the non-canonical tumor HLA peptide repertoire, by incorporating whole exome sequencing, bulk and single-cell transcriptomics, ribosome profiling, and two MS/MS search tools in combination. This approach results in the accurate identification of hundreds of shared and tumor-specific non-canonical HLA peptides, including an immunogenic peptide derived from an open reading frame downstream of the melanoma stem cell marker gene ABCB5. These findings hold great promise for the discovery of previously unknown tumor antigens for cancer immunotherapy.
Recent methodological advances allowed the identification of an increasing number of RNA-binding proteins (RBPs) and their RNA-binding sites. Most of those methods rely, however, on capturing ...proteins associated to polyadenylated RNAs which neglects RBPs bound to non-adenylate RNA classes (tRNA, rRNA, pre-mRNA) as well as the vast majority of species that lack poly-A tails in their mRNAs (including all archea and bacteria). We have developed the Phenol Toluol extraction (PTex) protocol that does not rely on a specific RNA sequence or motif for isolation of cross-linked ribonucleoproteins (RNPs), but rather purifies them based entirely on their physicochemical properties. PTex captures RBPs that bind to RNA as short as 30 nt, RNPs directly from animal tissue and can be used to simplify complex workflows such as PAR-CLIP. Finally, we provide a global RNA-bound proteome of human HEK293 cells and the bacterium Salmonella Typhimurium.
Translation has a fundamental function in defining the fate of the transcribed genome. RNA-sequencing (RNA-seq) data enable the quantification of complex transcript mixtures, often detecting several ...transcript isoforms of unknown functions for one gene. Here, we describe ORFquant, a method to annotate and quantify translation at the level of single open reading frames (ORFs), using information from Ribo-seq data. By developing an approach for transcript filtering, we quantify translation transcriptome-wide, revealing translated ORFs on multiple isoforms per gene. For most genes, one ORF represents the dominant translation product, but we also detect genes with translated ORFs on multiple transcript isoforms, including targets of RNA surveillance mechanisms. Measuring translation across human cell lines reveals the extent of gene-specific differences in protein production, supported by steady-state protein abundance estimates. Computational analysis of Ribo-seq data with ORFquant (https://github.com/lcalviell/ORFquant) provides insights into the heterogeneous functions of complex transcriptomes.
Full text
Available for:
IJS, NUK, SBMB, UL, UM, UPUK
RNA-sequencing protocols can quantify gene expression regulation from transcription to protein synthesis. Ribosome profiling (Ribo-seq) maps the positions of translating ribosomes over the entire ...transcriptome. We have developed RiboTaper (available at https://ohlerlab.mdc-berlin.de/software/), a rigorous statistical approach that identifies translated regions on the basis of the characteristic three-nucleotide periodicity of Ribo-seq data. We used RiboTaper with deep Ribo-seq data from HEK293 cells to derive an extensive map of translation that covered open reading frame (ORF) annotations for more than 11,000 protein-coding genes. We also found distinct ribosomal signatures for several hundred upstream ORFs and ORFs in annotated noncoding genes (ncORFs). Mass spectrometry data confirmed that RiboTaper achieved excellent coverage of the cellular proteome. Although dozens of novel peptide products were validated in this manner, few of the currently annotated long noncoding RNAs appeared to encode stable polypeptides. RiboTaper is a powerful method for comprehensive de novo identification of actively used ORFs from Ribo-seq data.
Pervasive transcription of the human genome results in a heterogeneous mix of coding RNAs and long noncoding RNAs (lncRNAs). Only a small fraction of lncRNAs have demonstrated regulatory functions, ...thus making functional lncRNAs difficult to distinguish from nonfunctional transcriptional byproducts. This difficulty has resulted in numerous competing human lncRNA classifications that are complicated by a steady increase in the number of annotated lncRNAs. To address these challenges, we quantitatively examined transcription, splicing, degradation, localization and translation for coding and noncoding human genes. We observed that annotated lncRNAs had lower synthesis and higher degradation rates than mRNAs and discovered mechanistic differences explaining slower lncRNA splicing. We grouped genes into classes with similar RNA metabolism profiles, containing both mRNAs and lncRNAs to varying extents. These classes exhibited distinct RNA metabolism, different evolutionary patterns and differential sensitivity to cellular RNA-regulatory pathways. Our classification provides an alternative to genomic context-driven annotations of lncRNAs.
Full text
Available for:
IJS, NUK, SBMB, UL, UM, UPUK
The extent to which alternative splicing and long intergenic noncoding RNAs (lincRNAs) contribute to the specialized functions of cells within an organ is poorly understood. We generated a ...comprehensive dataset of gene expression from individual cell types of the Arabidopsis root. Comparisons across cell types revealed that alternative splicing tends to remove parts of coding regions from a longer, major isoform, providing evidence for a progressive mechanism of splicing. Cell-type-specific intron retention suggested a possible origin for this common form of alternative splicing. Coordinated alternative splicing across developmental stages pointed to a role in regulating differentiation. Consistent with this hypothesis, distinct isoforms of a transcription factor were shown to control developmental transitions. lincRNAs were generally lowly expressed at the level of individual cell types, but co-expression clusters provided clues as to their function. Our results highlight insights gained from analysis of expression at the level of individual cell types.
•Cell type expression analyses characterize alt splicing and lincRNAs•Splicing tends to remove parts of coding regions from a longer, major isoform•Coordinated alternative splicing appears to regulate differentiation in the root•For the majority of proteins, peptide and major isoform abundance correlate well
Li et al. present a comprehensive dataset of gene expression generated using short-read sequencing from individual cell types and developmental zones of the Arabidopsis root, complemented by long-read sequencing and quantitative proteomics analyses. The data in this resource characterize cell-type-specific and developmental-stage-specific alternative splicing and lincRNA expression.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP
DNase-seq and ATAC-seq are broadly used methods to assay open chromatin regions genome-wide. The single nucleotide resolution of DNase-seq has been further exploited to infer transcription factor ...binding sites (TFBSs) in regulatory regions through footprinting. Recent studies have demonstrated the sequence bias of DNase I and its adverse effects on footprinting efficiency. However, footprinting and the impact of sequence bias have not been extensively studied for ATAC-seq.
Here, we undertake a systematic comparison of the two methods and show that a modification to the ATAC-seq protocol increases its yield and its agreement with DNase-seq data from the same cell line. We demonstrate that the two methods have distinct sequence biases and correct for these protocol-specific biases when performing footprinting. Despite the differences in footprint shapes, the locations of the inferred footprints in ATAC-seq and DNase-seq are largely concordant. However, the protocol-specific sequence biases in conjunction with the sequence content of TFBSs impact the discrimination of footprint from the background, which leads to one method outperforming the other for some TFs. Finally, we address the depth required for reproducible identification of open chromatin regions and TF footprints.
We demonstrate that the impact of bias correction on footprinting performance is greater for DNase-seq than for ATAC-seq and that DNase-seq footprinting leads to better performance. It is possible to infer concordant footprints by using replicates, highlighting the importance of reproducibility assessment. The results presented here provide an overview of the advantages and limitations of footprinting analyses using ATAC-seq and DNase-seq.