Our genome contains tens of thousands of long noncoding RNAs (lncRNAs), many of which are likely to have genetic regulatory functions. It has been proposed that lncRNA are organized into combinations ...of discrete functional domains, but the nature of these and their identification remain elusive. One class of sequence elements that is enriched in lncRNA is represented by transposable elements (TEs), repetitive mobile genetic sequences that have contributed widely to genome evolution through a process termed exaptation. Here, we link these two concepts by proposing that exonic TEs act as RNA domains that are essential for lncRNA function. We term such elements Repeat Insertion Domains of LncRNAs (RIDLs). A growing number of RIDLs have been experimentally defined, where TE-derived fragments of lncRNA act as RNA-, DNA-, and protein-binding domains. We propose that these reflect a more general phenomenon of exaptation during lncRNA evolution, where inserted TE sequences are repurposed as recognition sites for both protein and nucleic acids. We discuss a series of genomic screens that may be used in the future to systematically discover RIDLs. The RIDL hypothesis has the potential to explain how functional evolution can keep pace with the rapid gene evolution observed in lncRNA. More practically, TE maps may in the future be used to predict lncRNA function.
Because of ever-increasing throughput requirements of sequencing data, most existing short-read aligners have been designed to focus on speed at the expense of accuracy. The Genome Multitool (GEM) ...mapper can leverage string matching by filtration to search the alignment space more efficiently, simultaneously delivering precision (performing fully tunable exhaustive searches that return all existing matches, including gapped ones) and speed (being several times faster than comparable state-of-the-art tools).
Gene maps, or annotations, enable us to navigate the functional landscape of our genome. They are a resource upon which virtually all studies depend, from single-gene to genome-wide scales and from ...basic molecular biology to medical genetics. Yet present-day annotations suffer from trade-offs between quality and size, with serious but often unappreciated consequences for downstream studies. This is particularly true for long non-coding RNAs (lncRNAs), which are poorly characterized compared to protein-coding genes. Long-read sequencing technologies promise to improve current annotations, paving the way towards a complete annotation of lncRNAs expressed throughout a human lifetime.
Cross-species comparisons of genomes, transcriptomes and gene regulation are now feasible at unprecedented resolution and throughput, enabling the comparison of human and mouse biology at the ...molecular level. Insights have been gained into the degree of conservation between human and mouse at the level of not only gene expression but also epigenetics and inter-individual variation. However, a number of limitations exist, including incomplete transcriptome characterization and difficulties in identifying orthologous phenotypes and cell types, which are beginning to be addressed by emerging technologies. Ultimately, these comparisons will help to identify the conditions under which the mouse is a suitable model of human physiology and disease, and optimize the use of animal models.
Alternative splicing (AS) is a fundamental step in eukaryotic mRNA biogenesis. Here, we develop an efficient and reproducible pipeline for the discovery of genetic variants that affect AS (splicing ...QTLs, sQTLs). We use it to analyze the GTEx dataset, generating a comprehensive catalog of sQTLs in the human genome. Downstream analysis of this catalog provides insight into the mechanisms underlying splicing regulation. We report that a core set of sQTLs is shared across multiple tissues. sQTLs often target the global splicing pattern of genes, rather than individual splicing events. Many also affect the expression of the same or other genes, uncovering regulatory loci that act through different mechanisms. sQTLs tend to be located in post-transcriptionally spliced introns, which would function as hotspots for splicing regulation. While many variants affect splicing patterns by altering the sequence of splice sites, many more modify the binding sites of RNA-binding proteins. Genetic variants affecting splicing can have a stronger phenotypic impact than those affecting gene expression.
We present ggsashimi, a command-line tool for the visualization of splicing events across multiple samples. Given a specified genomic region, ggsashimi creates sashimi plots for individual RNA-seq ...experiments as well as aggregated plots for groups of experiments, a feature unique to this software. Compared to the existing versions of programs generating sashimi plots, it uses popular bioinformatics file formats, it is annotation-independent, and allows the visualization of splicing events even for large genomic regions by scaling down the genomic segments between splice sites. ggsashimi is freely available at https://github.com/guigolab/ggsashimi. It is implemented in python, and internally generates R code for plotting.
We present a fast mapping-based algorithm to compute the mappability of each region of a reference genome up to a specified number of mismatches. Knowing the mappability of a genome is crucial for ...the interpretation of massively parallel sequencing experiments. We investigate the properties of the mappability of eukaryotic DNA/RNA both as a whole and at the level of the gene family, providing for various organisms tracks which allow the mappability information to be visually explored. In addition, we show that mappability varies greatly between species and gene classes. Finally, we suggest several practical applications where mappability can be used to refine the analysis of high-throughput sequencing data (SNP calling, gene expression quantification and paired-end experiments). This work highlights mappability as an important concept which deserves to be taken into full account, in particular when massively parallel sequencing technologies are employed. The GEM mappability program belongs to the GEM (GEnome Multitool) suite of programs, which can be freely downloaded for any use from its website (http://gemlibrary.sourceforge.net).
Recent footprinting studies have made the surprising observation that long noncoding RNAs (lncRNAs) physically interact with ribosomes. However, these findings remain controversial, and the overall ...proportion of cytoplasmic lncRNAs involved is unknown. Here we make a global, absolute estimate of the cytoplasmic and ribosome-associated population of stringently filtered lncRNAs in a human cell line using polysome profiling coupled to spike-in normalized microarray analysis. Fifty-four percent of expressed lncRNAs are detected in the cytoplasm. The majority of these (70%) have >50% of their cytoplasmic copies associated with polysomal fractions. These interactions are lost upon disruption of ribosomes by puromycin. Polysomal lncRNAs are distinguished by a number of 5' mRNA-like features, including capping and 5'UTR length. On the other hand, nonpolysomal "free cytoplasmic" lncRNAs have more conserved promoters and a wider range of expression across cell types. Exons of polysomal lncRNAs are depleted of endogenous retroviral insertions, suggesting a role for repetitive elements in lncRNA localization. Finally, we show that blocking of ribosomal elongation results in stabilization of many associated lncRNAs. Together these findings suggest that the ribosome is the default destination for the majority of cytoplasmic long noncoding RNAs and may play a role in their degradation.
Abstract
The accurate identification and description of the genes in the human and mouse genomes is a fundamental requirement for high quality analysis of data informing both genome biology and ...clinical genomics. Over the last 15 years, the GENCODE consortium has been producing reference quality gene annotations to provide this foundational resource. The GENCODE consortium includes both experimental and computational biology groups who work together to improve and extend the GENCODE gene annotation. Specifically, we generate primary data, create bioinformatics tools and provide analysis to support the work of expert manual gene annotators and automated gene annotation pipelines. In addition, manual and computational annotation workflows use any and all publicly available data and analysis, along with the research literature to identify and characterise gene loci to the highest standard. GENCODE gene annotations are accessible via the Ensembl and UCSC Genome Browsers, the Ensembl FTP site, Ensembl Biomart, Ensembl Perl and REST APIs as well as https://www.gencodegenes.org.
High-throughput RNA sequencing is an increasingly accessible method for studying gene structure and activity on a genome-wide scale. A critical step in RNA-seq data analysis is the alignment of ...partial transcript reads to a reference genome sequence. To assess the performance of current mapping software, we invited developers of RNA-seq aligners to process four large human and mouse RNA-seq data sets. In total, we compared 26 mapping protocols based on 11 programs and pipelines and found major performance differences between methods on numerous benchmarks, including alignment yield, basewise accuracy, mismatch and gap placement, exon junction discovery and suitability of alignments for transcript reconstruction. We observed concordant results on real and simulated RNA-seq data, confirming the relevance of the metrics employed. Future developments in RNA-seq alignment methods would benefit from improved placement of multimapped reads, balanced utilization of existing gene annotation and a reduced false discovery rate for splice junctions.