As a data-driven science, genomics largely utilizes machine learning to capture dependencies in data and derive novel biological hypotheses. However, the ability to extract new insights from the ...exponentially increasing volume of genomics data requires more expressive machine learning models. By effectively leveraging large data sets, deep learning has transformed fields such as computer vision and natural language processing. Now, it is becoming the method of choice for many genomics modelling tasks, including predicting the impact of genetic variation on gene regulatory mechanisms such as DNA accessibility and splicing.
The largest sequence-based models of transcription control to date are obtained by predicting genome-wide gene regulatory assays across the human genome. This setting is fundamentally correlative, as ...those models are exposed during training solely to the sequence variation between human genes that arose through evolution, questioning the extent to which those models capture genuine causal signals.
Here we confront predictions of state-of-the-art models of transcription regulation against data from two large-scale observational studies and five deep perturbation assays. The most advanced of these sequence-based models, Enformer, by and large, captures causal determinants of human promoters. However, models fail to capture the causal effects of enhancers on expression, notably in medium to long distances and particularly for highly expressed promoters. More generally, the predicted impact of distal elements on gene expression predictions is small and the ability to correctly integrate long-range information is significantly more limited than the receptive fields of the models suggest. This is likely caused by the escalating class imbalance between actual and candidate regulatory elements as distance increases.
Our results suggest that sequence-based models have advanced to the point that in silico study of promoter regions and promoter variants can provide meaningful insights and we provide practical guidance on how to use them. Moreover, we foresee that it will require significantly more and particularly new kinds of data to train models accurately accounting for distal elements.
Aberrant splicing is a major cause of rare diseases. However, its prediction from genome sequence alone remains in most cases inconclusive. Recently, RNA sequencing has proven to be an effective ...complementary avenue to detect aberrant splicing. Here, we develop FRASER, an algorithm to detect aberrant splicing from RNA sequencing data. Unlike existing methods, FRASER captures not only alternative splicing but also intron retention events. This typically doubles the number of detected aberrant events and identified a pathogenic intron retention in MCOLN1 causing mucolipidosis. FRASER automatically controls for latent confounders, which are widespread and affect sensitivity substantially. Moreover, FRASER is based on a count distribution and multiple testing correction, thus reducing the number of calls by two orders of magnitude over commonly applied z score cutoffs, with a minor loss of sensitivity. Applying FRASER to rare disease diagnostics is demonstrated by reprioritizing a pathogenic aberrant exon truncation in TAZ from a published dataset. FRASER is easy to use and freely available.
TT-seq maps the human transient transcriptome Schwalb, Björn; Michel, Margaux; Zacher, Benedikt ...
Science (American Association for the Advancement of Science),
06/2016, Letnik:
352, Številka:
6290
Journal Article
Recenzirano
Pervasive transcription of the genome produces both stable and transient RNAs. We developed transient transcriptome sequencing (TT-seq), a protocol that uniformly maps the entire range of ...RNA-producing units and estimates rates of RNA synthesis and degradation. Application of TT-seq to human K562 cells recovers stable messenger RNAs and long intergenic noncoding RNAs and additionally maps transient enhancer, antisense, and promoter-associated RNAs. TT-seq analysis shows that enhancer RNAs are short-lived and lack U1 motifs and secondary structure. TT-seq also maps transient RNA downstream of polyadenylation sites and uncovers sites of transcription termination; we found, on average, four transcription termination sites, distributed in a window with a median width of ~3300 base pairs. Termination sites coincide with a DNA motif associated with pausing of RNA polymerase before its release from the genome.
RNA splicing is an essential part of eukaryotic gene expression. Although the mechanism of splicing has been extensively studied in vitro, in vivo kinetics for the two-step splicing reaction remain ...poorly understood. Here, we combine transient transcriptome sequencing (TT-seq) and mathematical modeling to quantify RNA metabolic rates at donor and acceptor splice sites across the human genome. Splicing occurs in the range of minutes and is limited by the speed of RNA polymerase elongation. Splicing kinetics strongly depends on the position and nature of nucleotides flanking splice sites, and on structural interactions between unspliced RNA and small nuclear RNAs in spliceosomal intermediates. Finally, we introduce the 'yield' of splicing as the efficiency of converting unspliced to spliced RNA and show that it is highest for mRNAs and independent of splicing kinetics. These results lead to quantitative models describing how splicing rates and yield are encoded in the human genome.
Genome‐, transcriptome‐ and proteome‐wide measurements provide insights into how biological systems are regulated. However, fundamental aspects relating to which human proteins exist, where they are ...expressed and in which quantities are not fully understood. Therefore, we generated a quantitative proteome and transcriptome abundance atlas of 29 paired healthy human tissues from the Human Protein Atlas project representing human genes by 18,072 transcripts and 13,640 proteins including 37 without prior protein‐level evidence. The analysis revealed that hundreds of proteins, particularly in testis, could not be detected even for highly expressed mRNAs, that few proteins show tissue‐specific expression, that strong differences between mRNA and protein quantities within and across tissues exist and that protein expression is often more stable across tissues than that of transcripts. Only 238 of 9,848 amino acid variants found by exome sequencing could be confidently detected at the protein level showing that proteogenomics remains challenging, needs better computational methods and requires rigorous validation. Many uses of this resource can be envisaged including the study of gene/protein expression regulation and biomarker specificity evaluation.
Synopsis
Proteome and transcriptome quantification across tissues reveals which human genes exist as transcripts and proteins, where they are expressed and in which approximate quantities. Tissue‐specific protein expression is found to be a rare and quantitative rather than qualitative characteristic.
The study presents the most comprehensive atlas of protein expression to date, across 29 healthy human tissues.
Protein level evidence is provided for 13,640 genes and 15,257 isoforms, including 37 missing proteins.
Tissue‐specific protein expression is rare and quantitative rather than qualitative characteristic.
Proteogenomics is still challenging and needs rigorous validation by synthetic peptides.
Proteome and transcriptome quantification across tissues reveals which human genes exist as transcripts and proteins, where they are expressed and in which approximate quantities. Tissue‐specific protein expression is found to be a rare and quantitative rather than qualitative characteristic.
The 5' untranslated region plays a key role in regulating mRNA translation and consequently protein abundance. Therefore, accurate modeling of 5'UTR regulatory sequences shall provide insights into ...translational control mechanisms and help interpret genetic variants. Recently, a model was trained on a massively parallel reporter assay to predict mean ribosome load (MRL)-a proxy for translation rate-directly from 5'UTR sequence with a high degree of accuracy. However, this model is restricted to sequence lengths investigated in the reporter assay and therefore cannot be applied to the majority of human sequences without a substantial loss of information. Here, we introduced frame pooling, a novel neural network operation that enabled the development of an MRL prediction model for 5'UTRs of any length. Our model shows state-of-the-art performance on fixed length randomized sequences, while offering better generalization performance on longer sequences and on a variety of translation-related genome-wide datasets. Variant interpretation is demonstrated on a 5'UTR variant of the gene HBB associated with beta-thalassemia. Frame pooling could find applications in other bioinformatics predictive tasks. Moreover, our model, released open source, could help pinpoint pathogenic genetic variants.
Metabolic pathway analysis has been recognized as a central approach to the structural analysis of metabolic networks. The concept of elementary (flux) modes provides a rigorous formalism to describe ...and assess pathways and has proven to be valuable for many applications. However, computing elementary modes is a hard computational task. In recent years we assisted in a multiplication of algorithms dedicated to it. We require a summarizing point of view and a continued improvement of the current methods.
We show that computing the set of elementary modes is equivalent to computing the set of extreme rays of a convex cone. This standard mathematical representation provides a unified framework that encompasses the most prominent algorithmic methods that compute elementary modes and allows a clear comparison between them. Taking lessons from this benchmark, we here introduce a new method, the binary approach, which computes the elementary modes as binary patterns of participating reactions from which the respective stoichiometric coefficients can be computed in a post-processing step. We implemented the binary approach in FluxAnalyzer 5.1, a software that is free for academics. The binary approach decreases the memory demand up to 96% without loss of speed giving the most efficient method available for computing elementary modes to date.
The equivalence between elementary modes and extreme ray computations offers opportunities for employing tools from polyhedral computation for metabolic pathway analysis. The new binary approach introduced herein was derived from this general theoretical framework and facilitates the computation of elementary modes in considerably larger networks.
Predicting the effects of genetic variants on splicing is highly relevant for human genetics. We describe the framework MMSplice (modular modeling of splicing) with which we built the winning model ...of the CAGI5 exon skipping prediction challenge. The MMSplice modules are neural networks scoring exon, intron, and splice sites, trained on distinct large-scale genomics datasets. These modules are combined to predict effects of variants on exon skipping, splice site choice, splicing efficiency, and pathogenicity, with matched or higher performance than state-of-the-art. Our models, available in the repository Kipoi, apply to variants including indels directly from VCF files.