PacBio Sequencing and Its Applications Rhoads, Anthony; Au, Kin Fai
Genomics, proteomics & bioinformatics,
10/2015, Volume:
13, Issue:
5
Journal Article
Peer reviewed
Open access
Single-molecule, real-time sequencing developed by Pacific BioSciences offers longer read lengths than the second-generation sequencing (SGS) technologies, making it well-suited for unsolved problems ...in genome, transcriptome, and epigenetics research. The highly-contiguous de novo assemblies using PacBio sequencing can close gaps in current reference assemblies and characterize structural variation (SV) in personal genomes. With longer reads, we can sequence through extended repetitive regions and detect mutations, many of which are associated with dis- eases. Moreover, PacBio transcriptome sequencing is advantageous for the identification of gene isoforms and facilitates reliable discoveries of novel genes and novel isoforms of annotated genes, due to its ability to sequence full-length transcripts or fragments with significant lengths. Addition- ally, PacBio's sequencing technique provides information that is useful for the direct detection of base modifications, such as methylation. In addition to using PacBio sequencing alone, many hybrid sequencing strategies have been developed to make use of more accurate short reads in conjunction with PacBio long reads. In general, hybrid sequencing strategies are more affordable and scalable especially for small-size laboratories than using PacBio Sequencing alone. The advent of PacBio sequencing has made available much information that could not be obtained via SGS alone.
...the accuracy, accessibility, and cost efficiency of long-read sequencing are improving dramatically, which boosts the long read-based research in many topics. ...we are not only in the midst of a ...new revolution in sequencing technology but also the next revolution in biomedical research. The new software phasebook adapts a divide-and-conquer strategy to improve the coverage of haplotype-resolved de novo genome assembly 3. Since sample preparation, such as high-molecular-weight DNA extraction, could influence the data quality and thus assembly significantly, the end-to-end plant genome assembly workflow LeafGO optimizes the steps from sample preparation to computational analysis 4. The application of LR-Split-seq to the C2C12 myogenic system found the distinct patterns of alternative transcription start sites and/or alternative internal exon usage in different cell clusters. Besides cDNA sequencing, nanopore sequencing can measure native RNA molecules directly, so Schulz et al. were able to revisit the reliability of exitron identification by comparing the data of direct RNA sequencing and cDNA sequencing 11. ...Chen et al. developed a computational platform Inspector to evaluate genome assembly 15. Because of the large variance of long-read sequencing data quality, such as read length and error profile, the performance of assemblers was examined in different data scenarios (e.g., PacBio CLR and HiFi reads and nanopore data).
Third-generation sequencing technologies have advanced the progress of the biological research by generating reads that are substantially longer than second-generation sequencing technologies. ...However, their notorious high error rate impedes straightforward data analysis and limits their application. A handful of error correction methods for these error-prone long reads have been developed to date. The output data quality is very important for downstream analysis, whereas computing resources could limit the utility of some computing-intense tools. There is a lack of standardized assessments for these long-read error-correction methods.
Here, we present a comparative performance assessment of ten state-of-the-art error-correction methods for long reads. We established a common set of benchmarks for performance assessment, including sensitivity, accuracy, output rate, alignment rate, output read length, run time, and memory usage, as well as the effects of error correction on two downstream applications of long reads: de novo assembly and resolving haplotype sequences.
Taking into account all of these metrics, we provide a suggestive guideline for method choice based on available data size, computing resources, and individual research goals.
Retroviral overexpression of reprogramming factors (Oct4, Sox2, Klf4, c-Myc) generates induced pluripotent stem cells (iPSCs). However, the integration of foreign DNA could induce genomic ...dysregulation. Cell-permeant proteins (CPPs) could overcome this limitation. To date, this approach has proved exceedingly inefficient. We discovered a striking difference in the pattern of gene expression induced by viral versus CPP-based delivery of the reprogramming factors, suggesting that a signaling pathway required for efficient nuclear reprogramming was activated by the retroviral, but not CPP approach. In gain- and loss-of-function studies, we find that the toll-like receptor 3 (TLR3) pathway enables efficient induction of pluripotency by viral or mmRNA approaches. Stimulation of TLR3 causes rapid and global changes in the expression of epigenetic modifiers to enhance chromatin remodeling and nuclear reprogramming. Activation of inflammatory pathways are required for efficient nuclear reprogramming in the induction of pluripotency.
Display omitted
▸ TLR3 knockdown reduces the efficiency and yield of human iPSC generation ▸ TLR3 activation enhances human iPSC generation by cell permeant peptides ▸ TLR3 activation enables epigenetic changes to promote an open chromatin state ▸ Innate immune activation enhances nuclear reprogramming and cell plasticity
Efficient reprogramming of somatic cells to pluripotency depends on the activation of innate immune pathways, which cause changes in the expression of epigenetic modifiers and promote chromatin remodeling.
Sperm contributes diverse RNAs to the zygote. While sperm small RNAs have been shown to impact offspring phenotypes, our knowledge of the sperm transcriptome, especially the composition of long RNAs, ...has been limited by the lack of sensitive, high-throughput experimental techniques that can distinguish intact RNAs from fragmented RNAs, known to abound in sperm. Here, we integrate single-molecule long-read sequencing with short-read sequencing to detect sperm intact RNAs (spiRNAs). We identify 3440 spiRNA species in mice and 4100 in humans. The spiRNA profile consists of both mRNAs and long non-coding RNAs, is evolutionarily conserved between mice and humans, and displays an enrichment in mRNAs encoding for ribosome. In sum, we characterize the landscape of intact long RNAs in sperm, paving the way for future studies on their biogenesis and functions. Our experimental and bioinformatics approaches can be applied to other tissues and organisms to detect intact transcripts.
The error-prone third-generation sequencing (TGS) long reads can be corrected by the high-quality second-generation sequencing (SGS) short reads, which is referred to as hybrid error correction. We ...here investigate the influences of the principal algorithmic factors of two major types of hybrid error correction methods by mathematical modeling and analysis on both simulated and real data. Our study reveals the distribution of accuracy gain with respect to the original long read error rate. We also demonstrate that the original error rate of 19% is the limit for perfect correction, beyond which long reads are too error-prone to be corrected by these methods.
The recent development of third generation sequencing (TGS) generates much longer reads than second generation sequencing (SGS) and thus provides a chance to solve problems that are difficult to ...study through SGS alone. However, higher raw read error rates are an intrinsic drawback in most TGS technologies. Here we present a computational method, LSC, to perform error correction of TGS long reads (LR) by SGS short reads (SR). Aiming to reduce the error rate in homopolymer runs in the main TGS platform, the PacBio® RS, LSC applies a homopolymer compression (HC) transformation strategy to increase the sensitivity of SR-LR alignment without scarifying alignment accuracy. We applied LSC to 100,000 PacBio long reads from human brain cerebellum RNA-seq data and 64 million single-end 75 bp reads from human brain RNA-seq data. The results show LSC can correct PacBio long reads to reduce the error rate by more than 3 folds. The improved accuracy greatly benefits many downstream analyses, such as directional gene isoform detection in RNA-seq study. Compared with another hybrid correction tool, LSC can achieve over double the sensitivity and similar specificity.
Summary
Danshen, Salvia miltiorrhiza Bunge, is one of the most widely used herbs in traditional Chinese medicine, wherein its rhizome/roots are particularly valued. The corresponding bioactive ...components include the tanshinone diterpenoids, the biosynthesis of which is a subject of considerable interest. Previous investigations of the S. miltiorrhiza transcriptome have relied on short‐read next‐generation sequencing (NGS) technology, and the vast majority of the resulting isotigs do not represent full‐length cDNA sequences. Moreover, these efforts have been targeted at either whole plants or hairy root cultures. Here, we demonstrate that the tanshinone pigments are produced and accumulate in the root periderm, and apply a combination of NGS and single‐molecule real‐time (SMRT) sequencing to various root tissues, particularly including the periderm, to provide a more complete view of the S. miltiorrhiza transcriptome, with further insight into tanshinone biosynthesis as well. In addition, the use of SMRT long‐read sequencing offered the ability to examine alternative splicing, which was found to occur in approximately 40% of the detected gene loci, including several involved in isoprenoid/terpenoid metabolism.
Significance Statement
Here, we combined NGS and SMRT sequencing to generate a more complete/full‐length S. miltiorrhiza transcriptome. Our study provides a general model for studying the full‐length transcriptome, alternative splicing, and the biosynthesis of active constituents in medicinal plants, as well as important information about the important Danshen constituent of traditional Chinese medicine more specifically.
Genetic efficiency in higher organisms depends on mechanisms to create multiple functions from single genes. To investigate this question for an enzyme family, we chose aminoacyl tRNA synthetases ...(AARSs). They are exceptional in their progressive and accretive proliferation of noncatalytic domains as the Tree of Life is ascended. Here we report discovery of a large number of natural catalytic nulls (CNs) for each human AARS. Splicing events retain noncatalytic domains while ablating the catalytic domain to create CNs with diverse functions. Each synthetase is converted into several new signaling proteins with biological activities "orthogonal" to that of the catalytic parent. We suggest that splice variants with nonenzymatic functions may be more general, as evidenced by recent findings of other catalytically inactive splice-variant enzymes.
RNA-sequencing (RNA-seq) is an essential technique for transcriptome studies, hundreds of analysis tools have been developed since it was debuted. Although recent efforts have attempted to assess the ...latest available tools, they have not evaluated the analysis workflows comprehensively to unleash the power within RNA-seq. Here we conduct an extensive study analysing a broad spectrum of RNA-seq workflows. Surpassing the expression analysis scope, our work also includes assessment of RNA variant-calling, RNA editing and RNA fusion detection techniques. Specifically, we examine both short- and long-read RNA-seq technologies, 39 analysis tools resulting in ~120 combinations, and ~490 analyses involving 15 samples with a variety of germline, cancer and stem cell data sets. We report the performance and propose a comprehensive RNA-seq analysis protocol, named RNACocktail, along with a computational pipeline achieving high accuracy. Validation on different samples reveals that our proposed protocol could help researchers extract more biologically relevant predictions by broad analysis of the transcriptome.RNA-seq is widely used for transcriptome analysis. Here, the authors analyse a wide spectrum of RNA-seq workflows and present a comprehensive analysis protocol named RNACocktail as well as a computational pipeline leveraging the widely used tools for accurate RNA-seq analysis.