PacBio single molecule real-time sequencing is a third-generation sequencing technique producing long reads, with comparatively lower throughput and higher error rate. Errors include numerous indels ...and complicate downstream analysis like mapping or de novo assembly. A hybrid strategy that takes advantage of the high accuracy of second-generation short reads has been proposed for correcting long reads. Mapping of short reads on long reads provides sufficient coverage to eliminate up to 99% of errors, however, at the expense of prohibitive running times and considerable amounts of disk and memory space.
We present LoRDEC, a hybrid error correction method that builds a succinct de Bruijn graph representing the short reads, and seeks a corrective sequence for each erroneous region in the long reads by traversing chosen paths in the graph. In comparison, LoRDEC is at least six times faster and requires at least 93% less memory or disk space than available tools, while achieving comparable accuracy. Availability and implementaion: LoRDEC is written in C++, tested on Linux platforms and freely available at http://atgc.lirmm.fr/lordec.
New long read sequencing technologies, like PacBio SMRT and Oxford NanoPore, can produce sequencing reads up to 50 000 bp long but with an error rate of at least 15%. Reducing the error rate is ...necessary for subsequent utilization of the reads in, e.g. de novo genome assembly. The error correction problem has been tackled either by aligning the long reads against each other or by a hybrid approach that uses the more accurate short reads produced by second generation sequencing technologies to correct the long reads.
We present an error correction method that uses long reads only. The method consists of two phases: first, we use an iterative alignment-free correction method based on de Bruijn graphs with increasing length of k -mers, and second, the corrected reads are further polished using long-distance dependencies that are found using multiple alignments. According to our experiments, the proposed method is the most accurate one relying on long reads only for read sets with high coverage. Furthermore, when the coverage of the read set is at least 75×, the throughput of the new method is at least 20% higher.
LoRMA is freely available at http://www.cs.helsinki.fi/u/lmsalmel/LoRMA/ .
leena.salmela@cs.helsinki.fi.
The last decade has seen mRNA modification emerge as a new layer of gene expression regulation. The Fat mass and obesity-associated protein (FTO) was the first identified eraser of N6-methyladenosine ...(m6A) adducts, the most widespread modification in eukaryotic messenger RNA. This discovery, of a reversible and dynamic RNA modification, aided by recent technological advances in RNA mass spectrometry and sequencing has led to the birth of the field of epitranscriptomics. FTO crystallized much of the attention of epitranscriptomics researchers and resulted in the publication of numerous, yet contradictory, studies describing the regulatory role of FTO in gene expression and central biological processes. These incongruities may be explained by a wide spectrum of FTO substrates and RNA sequence preferences: FTO binds multiple RNA species (mRNA, snRNA and tRNA) and can demethylate internal m6A in mRNA and snRNA, N6,2′-O-dimethyladenosine (m6Am) adjacent to the mRNA cap, and N1-methyladenosine (m1A) in tRNA. Here, we review current knowledge related to FTO function in healthy and cancer cells. In particular, we emphasize the divergent role(s) attributed to FTO in different tissues and subcellular and molecular contexts.
Abstract
Cancer stem cells (CSCs) are a small but critical cell population for cancer biology since they display inherent resistance to standard therapies and give rise to metastases. Despite ...accruing evidence establishing a link between deregulation of epitranscriptome-related players and tumorigenic process, the role of messenger RNA (mRNA) modifications in the regulation of CSC properties remains poorly understood. Here, we show that the cytoplasmic pool of fat mass and obesity-associated protein (FTO) impedes CSC abilities in colorectal cancer through its N
6
,2’-O-dimethyladenosine (m
6
A
m
) demethylase activity. While m
6
A
m
is strategically located next to the m
7
G-mRNA cap, its biological function is not well understood and has not been addressed in cancer. Low FTO expression in patient-derived cell lines elevates m
6
A
m
level in mRNA which results in enhanced in vivo tumorigenicity and chemoresistance. Inhibition of the nuclear m
6
A
m
methyltransferase, PCIF1/CAPAM, fully reverses this phenotype, stressing the role of m
6
A
m
modification in stem-like properties acquisition. FTO-mediated regulation of m
6
A
m
marking constitutes a reversible pathway controlling CSC abilities. Altogether, our findings bring to light the first biological function of the m
6
A
m
modification and its potential adverse consequences for colorectal cancer management.
Faithful transcription initiation is critical for accurate gene expression, yet the mechanisms underlying specific transcription start site (TSS) selection in mammals remain unclear. Here, we show ...that the histone-fold domain protein NF-Y, a ubiquitously expressed transcription factor, controls the fidelity of transcription initiation at gene promoters in mouse embryonic stem cells. We report that NF-Y maintains the region upstream of TSSs in a nucleosome-depleted state while simultaneously protecting this accessible region against aberrant and/or ectopic transcription initiation. We find that loss of NF-Y binding in mammalian cells disrupts the promoter chromatin landscape, leading to nucleosomal encroachment over the canonical TSS. Importantly, this chromatin rearrangement is accompanied by upstream relocation of the transcription pre-initiation complex and ectopic transcription initiation. Further, this phenomenon generates aberrant extended transcripts that undergo translation, disrupting gene expression profiles. These results suggest NF-Y is a central player in TSS selection in metazoans and highlight the deleterious consequences of inaccurate transcription initiation.
Microsatellites are short, tandemly-repeated DNA sequences which are widely distributed among genomes. Their structure, role and evolution can be analyzed based on exhaustive extraction from ...sequenced genomes. Several dedicated algorithms have been developed for this purpose. Here, we compared the detection efficiency of five of them (TRF, Mreps, Sputnik, STAR, and RepeatMasker).
Our analysis was first conducted on the human X chromosome, and microsatellite distributions were characterized by microsatellite number, length, and divergence from a pure motif. The algorithms work with user-defined parameters, and we demonstrate that the parameter values chosen can strongly influence microsatellite distributions. The five algorithms were then compared by fixing parameters settings, and the analysis was extended to three other genomes (Saccharomyces cerevisiae, Neurospora crassa and Drosophila melanogaster) spanning a wide range of size and structure. Significant differences for all characteristics of microsatellites were observed among algorithms, but not among genomes, for both perfect and imperfect microsatellites. Striking differences were detected for short microsatellites (below 20 bp), regardless of motif.
Since the algorithm used strongly influences empirical distributions, studies analyzing microsatellite evolution based on a comparison between empirical and theoretical size distributions should therefore be considered with caution. We also discuss why a typological definition of microsatellites limits our capacity to capture their genomic distributions.
Gene expression is the synthesis of proteins from the information encoded on DNA. One of the two main steps of gene expression is the translation of messenger RNA (mRNA) into polypeptide sequences of ...amino acids. Here, by taking into account mRNA degradation, we model the motion of ribosomes along mRNA with a ballistic model where particles advance along a filament without excluded volume interactions. Unidirectional models of transport have previously been used to fit the average density of ribosomes obtained by the experimental ribo-sequencing (Ribo-seq) technique in order to obtain the kinetic rates. The degradation rate is not, however, accounted for and experimental data from different experiments are needed to have enough parameters for the fit. Here, we propose an entirely novel experimental setup and theoretical framework consisting in splitting the mRNAs into categories depending on the number of ribosomes from one to four. We solve analytically the ballistic model for a fixed number of ribosomes per mRNA, study the different regimes of degradation, and propose a criterion for the quality of the inverse fit. The proposed method provides a high sensitivity to the mRNA degradation rate. The additional equations coming from using the monosome (single ribosome) and polysome (arbitrary number) ribo-seq profiles enable us to determine all the kinetic rates in terms of the experimentally accessible mRNA degradation rate.
In humans and mice, meiotic recombination events cluster into narrow hotspots whose genomic positions are defined by the PRDM9 protein via its DNA binding domain constituted of an array of zinc ...fingers (ZnFs). High polymorphism and rapid divergence of the Prdm9 gene ZnF domain appear to involve positive selection at DNA-recognition amino-acid positions, but the nature of the underlying evolutionary pressures remains a puzzle. Here we explore the variability of the Prdm9 ZnF array in wild mice, and uncovered a high allelic diversity of both ZnF copy number and identity with the caracterization of 113 alleles. We analyze features of the diversity of ZnF identity which is mostly due to non-synonymous changes at codons -1, 3 and 6 of each ZnF, corresponding to amino-acids involved in DNA binding. Using methods adapted to the minisatellite structure of the ZnF array, we infer a phylogenetic tree of these alleles. We find the sister species Mus spicilegus and M. macedonicus as well as the three house mouse (Mus musculus) subspecies to be polyphyletic. However some sublineages have expanded independently in Mus musculus musculus and M. m. domesticus, the latter further showing phylogeographic substructure. Compared to random genomic regions and non-coding minisatellites, none of these patterns appears exceptional. In silico prediction of DNA binding sites for each allele, overlap of their alignments to the genome and relative coverage of the different families of interspersed repeated elements suggest a large diversity between PRDM9 variants with a potential for highly divergent distributions of recombination events in the genome with little correlation to evolutionary distance. By compiling PRDM9 ZnF protein sequences in Primates, Muridae and Equids, we find different diversity patterns among the three amino-acids most critical for the DNA-recognition function, suggesting different diversification timescales.
Protozoan parasites are known to attach specific and diverse group of proteins to their plasma membrane via a GPI anchor. In malaria parasites, GPI-anchored proteins (GPI-APs) have been shown to play ...an important role in host-pathogen interactions and a key function in host cell invasion and immune evasion. Because of their immunogenic properties, some of these proteins have been considered as malaria vaccine candidates. However, identification of all possible GPI-APs encoded by these parasites remains challenging due to their sequence diversity and limitations of the tools used for their characterization.
The FT-GPI software was developed to detect GPI-APs based on the presence of a hydrophobic helix at both ends of the premature peptide. FT-GPI was implemented in C ++and applied to study the GPI-proteome of 46 isolates of the order Haemosporida. Using the GPI proteome of Plasmodium falciparum strain 3D7 and Plasmodium vivax strain Sal-1, a heuristic method was defined to select the most sensitive and specific FT-GPI software parameters.
FT-GPI enabled revision of the GPI-proteome of P. falciparum and P. vivax, including the identification of novel GPI-APs. Orthology- and synteny-based analyses showed that 19 of the 37 GPI-APs found in the order Haemosporida are conserved among Plasmodium species. Our analyses suggest that gene duplication and deletion events may have contributed significantly to the evolution of the GPI proteome, and its composition correlates with speciation.
FT-GPI-based prediction is a useful tool for mining GPI-APs and gaining further insights into their evolution and sequence diversity. This resource may also help identify new protein candidates for the development of vaccines for malaria and other parasitic diseases.
Motivation: Tandem repeats consist in approximate and adjacent repetitions of a DNA motif. Such repeats account for large portions of eukaryotic genomes and have also been found in other life ...kingdoms. Owing to their polymorphism, tandem repeats have proven useful in genome cartography, forensic and population studies, etc. Nevertheless, they are not systematically detected nor annotated in genome projects. Partially because of this lack of data, their evolution is still poorly understood. Results: In this work, we design an exact algorithm to locate approximate tandem repeats (ATR) of a motif in a DNA sequence. Given a motif and a DNA sequence, our method named STAR, identifies all segments of the sequence that correspond to significant approximate tandem repetitions of the motif. In our model, an Exact Tandem Repeat (ETR) comes from the tandem duplication of the motif and an ATR derives from an ETR by a series of point mutations. An ATR can then be encoded as a number of duplications of the motif together with a list of mutations. Consequently, any sequence that is not an ATR cannot be encoded efficiently by this description, while a true ATR can. Our method uses the minimum description length criterion to identify which sequence segments are ATR. Our optimization procedure guarantees that STAR finds a combination of ATR that minimizes this criterion. Availability: for use at http://atgc.lirmm.fr/star Supplementary information: an appendix is available at http://atgc.lirmm.fr/star under ‘Paper and contacts’.