PacBio sequencers produce two types of characteristic reads (continuous long reads: long and high error rate and circular consensus sequencing: short and low error rate), both of which could be ...useful for de novo assembly of genomes. Currently, there is no available simulator that targets the specific generation of PacBio libraries.
Our analysis of 13 PacBio datasets showed characteristic features of PacBio reads (e.g. the read length of PacBio reads follows a log-normal distribution). We have developed a read simulator, PBSIM, that captures these features using either a model-based or sampling-based method. Using PBSIM, we conducted several hybrid error correction and assembly tests for PacBio reads, suggesting that a continuous long reads coverage depth of at least 15 in combination with a circular consensus sequencing coverage depth of at least 30 achieved extensive assembly results.
PBSIM is freely available from the web under the GNU GPL v2 license (http://code.google.com/p/pbsim/).
Abstract
RNA secondary structure around translation initiation sites strongly affects the abundance of expressed proteins in Escherichia coli. However, detailed secondary structural features ...governing protein abundance remain elusive. Recent advances in high-throughput DNA synthesis and experimental systems enable us to obtain large amounts of data. Here, we evaluated six types of structural features using two large-scale datasets. We found that accessibility, which is the probability that a given region around the start codon has no base-paired nucleotides, showed the highest correlation with protein abundance in both datasets. Accessibility showed a significantly higher correlation (Spearman’s ρ = 0.709) than the widely used minimum free energy (0.554) in one of the datasets. Interestingly, accessibility showed the highest correlation only when it was calculated by a log-linear model, indicating that the RNA structural model and how to utilize it are important. Furthermore, by combining the accessibility and activity of the Shine-Dalgarno sequence, we devised a method for predicting protein abundance more accurately than existing methods. We inferred that the log-linear model has a broader probabilistic distribution than the widely used Turner energy model, which contributed to more accurate quantification of ribosome accessibility to translation initiation sites.
Abstract
Motivation
By detecting homology among RNAs, the probabilistic consideration of RNA structural alignments has improved the prediction accuracy of significant RNA prediction problems. ...Predicting an RNA consensus secondary structure from an RNA sequence alignment is a fundamental research objective because in the detection of conserved base-pairings among RNA homologs, predicting an RNA consensus secondary structure is more convenient than predicting an RNA structural alignment.
Results
We developed and implemented ConsAlifold, a dynamic programming-based method that predicts the consensus secondary structure of an RNA sequence alignment. ConsAlifold considers RNA structural alignments. ConsAlifold achieves moderate running time and the best prediction accuracy of RNA consensus secondary structures among available prediction methods.
Availability and implementation
ConsAlifold, data and Python scripts for generating both figures and tables are freely available at https://github.com/heartsh/consalifold.
Supplementary information
Supplementary data are available at Bioinformatics online.
Recent studies have revealed that large numbers of non-coding RNAs are transcribed in humans, but only a few of them have been identified with their functions. Identification of the interaction ...target RNAs of the non-coding RNAs is an important step in predicting their functions. The current experimental methods to identify RNA-RNA interactions, however, are not fast enough to apply to a whole human transcriptome. Therefore, computational predictions of RNA-RNA interactions are desirable, but this is a challenging task due to the huge computational costs involved.
Here, we report comprehensive predictions of the interaction targets of lncRNAs in a whole human transcriptome for the first time. To achieve this, we developed an integrated pipeline for predicting RNA-RNA interactions on the K computer, which is one of the fastest super-computers in the world. Comparisons with experimentally-validated lncRNA-RNA interactions support the quality of the predictions. Additionally, we have developed a database that catalogs the predicted lncRNA-RNA interactions to provide fundamental information about the targets of lncRNAs.
Motivation: Recent studies have shown that the methods for predicting secondary structures of RNAs on the basis of posterior decoding of the base-pairing probabilities has an advantage with respect ...to prediction accuracy over the conventionally utilized minimum free energy methods. However, there is room for improvement in the objective functions presented in previous studies, which are maximized in the posterior decoding with respect to the accuracy measures for secondary structures. Results: We propose novel estimators which improve the accuracy of secondary structure prediction of RNAs. The proposed estimators maximize an objective function which is the weighted sum of the expected number of the true positives and that of the true negatives of the base pairs. The proposed estimators are also improved versions of the ones used in previous works, namely CONTRAfold for secondary structure prediction from a single RNA sequence and McCaskill-MEA for common secondary structure prediction from multiple alignments of RNA sequences. We clarify the relations between the proposed estimators and the estimators presented in previous works, and theoretically show that the previous estimators include additional unnecessary terms in the evaluation measures with respect to the accuracy. Furthermore, computational experiments confirm the theoretical analysis by indicating improvement in the empirical accuracy. The proposed estimators represent extensions of the centroid estimators proposed in Ding et al. and Carvalho and Lawrence, and are applicable to a wide variety of problems in bioinformatics. Availability: Supporting information and the CentroidFold software are available online at: http://www.ncrna.org/software/centroidfold/. Contact: hamada-michiaki@aist.go.jp Supplementary information: Supplementary data are available at Bioinformatics online.
Analysis of secondary structures is essential for understanding the functions of RNAs. Because RNA molecules thermally fluctuate, it is necessary to analyze the probability distributions of their ...secondary structures. Existing methods, however, are not applicable to long RNAs owing to their high computational complexity. Additionally, previous research has suffered from two numerical difficulties: overflow and significant numerical errors.
In this research, we reduced the computational complexity of calculating the landscape of the probability distribution of secondary structures by introducing a maximum-span constraint. In addition, we resolved numerical computation problems through two techniques: extended logsumexp and accuracy-guaranteed numerical computation. We analyzed the stability of the secondary structures of 16S ribosomal RNAs at various temperatures without overflow. The results obtained are consistent with previous research on thermophilic bacteria, suggesting that our method is applicable in thermal stability analysis. Furthermore, we quantitatively assessed numerical stability using our method..
These results demonstrate that the proposed method is applicable to long RNAs..
Ustiloxins A and B are toxic cyclic tetrapeptides, Tyr-Val/Ala-Ile-Gly (Y-V/A-I-G), that were originally identified from Ustilaginoidea virens, a pathogenic fungus affecting rice plants. Contrary to ...our report that ustiloxin B is ribosomally synthesized in Aspergillus flavus, a recent report suggested that ustiloxins are synthesized by a non-ribosomal peptide synthetase in U.virens. Thus, we analyzed the U.virens genome, to identify the responsible gene cluster.
The biosynthetic gene cluster was identified from the genome of U.virens based on homologies to the ribosomal peptide biosynthetic gene cluster for ustiloxin B identified from A.flavus. It contains a gene encoding precursor protein having five Tyr-Val-Ile-Gly and three Tyr-Ala-Ile-Gly motifs for ustiloxins A and B, respectively, strongly indicating that ustiloxins A and B from U.virens are ribosomally synthesized.
Accession codes of the U.virens and A.flavus gene clusters in NCBI are BR001221 and BR001206, respectively. Supplementary data are available at Bioinformatics online.
PIWI-interacting RNAs (piRNAs) silence retrotransposons in Drosophila germ lines by associating with the PIWI proteins Argonaute 3 (AGO3), Aubergine (Aub) and Piwi. piRNAs in Drosophila are produced ...from intergenic repetitive genes and piRNA clusters by two systems: the primary processing pathway and the amplification loop. The amplification loop occurs in a Dicer-independent, PIWI-Slicer-dependent manner. However, primary piRNA processing remains elusive. Here we analysed piRNA processing in a Drosophila ovarian somatic cell line where Piwi, but not Aub or AGO3, is expressed; thus, only the primary piRNAs exist. In addition to flamenco, a Piwi-specific piRNA cluster, traffic jam (tj), a large Maf gene, was determined as a new piRNA cluster. piRNAs arising from tj correspond to the untranslated regions of tj messenger RNA and are sense-oriented. piRNA loading on to Piwi may occur in the cytoplasm. zucchini, a gene encoding a putative cytoplasmic nuclease, is required for tj-derived piRNA production. In tj and piwi mutant ovaries, somatic cells fail to intermingle with germ cells and Fasciclin III is overexpressed. Loss of tj abolishes Piwi expression in gonadal somatic cells. Thus, in gonadal somatic cells, tj gives rise simultaneously to two different molecules: the TJ protein, which activates Piwi expression, and piRNAs, which define the Piwi targets for silencing.
Abstract
Long-read sequencers, such as Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) sequencers, have improved their read length and accuracy, thereby opening up unprecedented ...research. Many tools and algorithms have been developed to analyze long reads, and rapid progress in PacBio and ONT has further accelerated their development. Together with the development of high-throughput sequencing technologies and their analysis tools, many read simulators have been developed and effectively utilized. PBSIM is one of the popular long-read simulators. In this study, we developed PBSIM3 with three new functions: error models for long reads, multi-pass sequencing for high-fidelity read simulation and transcriptome sequencing simulation. Therefore, PBSIM3 is now able to meet a wide range of long-read simulation requirements.
The mammalian cell nucleus contains dozens of membrane-less nuclear bodies that play significant roles in various aspects of gene expression. Several nuclear bodies are nucleated by specific ...architectural noncoding RNAs (arcRNAs) acting as structural scaffolds. We have reported that a minor population of cellular RNAs exhibits an unusual semi-extractable feature upon using the conventional procedure of RNA preparation and that needle shearing or heating of cell lysates remarkably improves extraction of dozens of RNAs. Because semi-extractable RNAs, including known arcRNAs, commonly localize in nuclear bodies, this feature may be a hallmark of arcRNAs. Using the semi-extractability of RNA, we performed genome-wide screening of semi-extractable long noncoding RNAs to identify new candidate arcRNAs for arcRNA under hyperosmotic and heat stress conditions. After screening stress-inducible and semi-extractable RNAs, hundreds of readthrough downstream-of-gene (DoG) transcripts over several hundreds of kilobases, many of which were not detected among RNAs prepared by the conventional extraction procedure, were found to be stress-inducible and semi-extractable. We further characterized some of the abundant DoGs and found that stress-inducible transient extension of the 3'-UTR made DoGs semi-extractable. Furthermore, they were localized in distinct nuclear foci that were sensitive to 1,6-hexanediol. These data suggest that semi-extractable DoGs exhibit arcRNA-like features and our semi-extractable RNA-seq is a powerful tool to extensively monitor DoGs that are induced under specific physiological conditions.