Arguably the most basic step in the analysis of next generation sequencing data (NGS) involves the extraction of mappable reads from the raw reads produced by sequencing instruments. The presence of ...barcodes, adaptors and artifacts subject to sequencing errors makes this step non-trivial.
Here I present TagDust2, a generic approach utilizing a library of hidden Markov models (HMM) to accurately extract reads from a wide array of possible read architectures. TagDust2 extracts more reads of higher quality compared to other approaches. Processing of multiplexed single, paired end and libraries containing unique molecular identifiers is fully supported. Two additional post processing steps are included to exclude known contaminants and filter out low complexity sequences. Finally, TagDust2 can automatically detect the library type of sequenced data from a predefined selection.
Taken together TagDust2 is a feature rich, flexible and adaptive solution to go from raw to mappable NGS reads in a single step. The ability to recognize and record the contents of raw reads will help to automate and demystify the initial, and often poorly documented, steps in NGS data analysis pipelines. TagDust2 is freely available at: http://tagdust.sourceforge.net .
Single-cell RNA sequencing has been widely adopted to estimate the cellular composition of heterogeneous tissues and obtain transcriptional profiles of individual cells. Multiple approaches for ...optimal sample dissociation and storage of single cells have been proposed as have single-nuclei profiling methods. What has been lacking is a systematic comparison of their relative biases and benefits.
Here, we compare gene expression and cellular composition of single-cell suspensions prepared from adult mouse kidney using two tissue dissociation protocols. For each sample, we also compare fresh cells to cryopreserved and methanol-fixed cells. Lastly, we compare this single-cell data to that generated using three single-nucleus RNA sequencing workflows. Our data confirms prior reports that digestion on ice avoids the stress response observed with 37 °C dissociation. It also reveals cell types more abundant either in the cold or warm dissociations that may represent populations that require gentler or harsher conditions to be released intact. For cell storage, cryopreservation of dissociated cells results in a major loss of epithelial cell types; in contrast, methanol fixation maintains the cellular composition but suffers from ambient RNA leakage. Finally, cell type composition differences are observed between single-cell and single-nucleus RNA sequencing libraries. In particular, we note an underrepresentation of T, B, and NK lymphocytes in the single-nucleus libraries.
Systematic comparison of recovered cell types and their transcriptional profiles across the workflows has highlighted protocol-specific biases and thus enables researchers starting single-cell experiments to make an informed choice.
Abstract
Motivation
Kalign is an efficient multiple sequence alignment (MSA) program capable of aligning thousands of protein or nucleotide sequences. However, current alignment problems involving ...large numbers of sequences are exceeding Kalign’s original design specifications. Here we present a completely re-written and updated version to meet current and future alignment challenges.
Results
Kalign now uses a SIMD (single instruction, multiple data) accelerated version of the bit-parallel Gene Myers algorithm to estimate pairwise distances, adopts a sequence embedding strategy and the bi-secting K-means algorithm to rapidly construct guide trees for thousands of sequences. The new version maintains high alignment accuracy on both protein and nucleotide alignments and scales better than other MSA tools.
Availability and implementation
The source code of Kalign and code to reproduce the results are found here: https://github.com/timolassmann/kalign.
The alignment of multiple protein sequences is a fundamental step in the analysis of biological data. It has traditionally been applied to analyzing protein families for conserved motifs, phylogeny, ...structural properties, and to improve sensitivity in homology searching. The availability of complete genome sequences has increased the demands on multiple sequence alignment (MSA) programs. Current MSA methods suffer from being either too inaccurate or too computationally expensive to be applied effectively in large-scale comparative genomics.
We developed Kalign, a method employing the Wu-Manber string-matching algorithm, to improve both the accuracy and speed of multiple sequence alignment. We compared the speed and accuracy of Kalign to other popular methods using Balibase, Prefab, and a new large test set. Kalign was as accurate as the best other methods on small alignments, but significantly more accurate when aligning large and distantly related sets of sequences. In our comparisons, Kalign was about 10 times faster than ClustalW and, depending on the alignment size, up to 50 times faster than popular iterative methods.
Kalign is a fast and robust alignment method. It is especially well suited for the increasingly important task of aligning large numbers of sequences.
Single-cell transcriptomic profiling is a powerful tool to explore cellular heterogeneity. However, most of these methods focus on the 3'-end of polyadenylated transcripts and provide only a partial ...view of the transcriptome. We introduce C1 CAGE, a method for the detection of transcript 5'-ends with an original sample multiplexing strategy in the C1
microfluidic system. We first quantifiy the performance of C1 CAGE and find it as accurate and sensitive as other methods in the C1 system. We then use it to profile promoter and enhancer activities in the cellular response to TGF-β of lung cancer cells and discover subpopulations of cells differing in their response. We also describe enhancer RNA dynamics revealing transcriptional bursts in subsets of cells with transcripts arising from either strand in a mutually exclusive manner, validated using single molecule fluorescence in situ hybridization.
Next generation sequencing is a standard tool used in clinical diagnostics. In Mendelian diseases the challenge is to discover the single etiological variant among thousands of benign or functionally ...unrelated variants. After calling variants from aligned sequencing reads, variant prioritisation tools are used to examine the conservation or potential functional consequences of variants. We hypothesised that the performance of variant prioritisation tools may vary by disease phenotype. To test this we created benchmark data sets for variants associated with different disease phenotypes. We found that performance of 24 tested tools is highly variable and differs by disease phenotype. The task of identifying a causative variant amongst a large number of benign variants is challenging for all tools, highlighting the need for further development in the field. Based on our observations, we recommend use of five top performers found in this study (FATHMM, M-CAP, MetaLR, MetaSVM and VEST3). In addition we provide tables indicating which analytical approach works best in which disease context. Variant prioritisation tools are best suited to investigate variants associated with well-studied genetic diseases, as these variants are more readily available during algorithm development than variants associated with rare diseases. We anticipate that further development into disease focussed tools will lead to significant improvements.
Motivation: Next-generation parallel sequencing technologies produce large quantities of short sequence reads. Due to experimental procedures various types of artifacts are commonly sequenced ...alongside the targeted RNA or DNA sequences. Identification of such artifacts is important during the development of novel sequencing assays and for the downstream analysis of the sequenced libraries. Results: Here we present TagDust, a program identifying artifactual sequences in large sequencing runs. Given a user-defined cutoff for the false discovery rate, TagDust identifies all reads explainable by combinations and partial matches to known sequences used during library preparation. We demonstrate the quality of our method on sequencing runs performed on Illumina's Genome Analyzer platform. Availability: Executables and documentation are available from http://genome.gsc.riken.jp/osc/english/software/. Contact: timolassmann@gmail.com
Motivation: The sequence alignment map format (SAM) is a commonly used format to store the alignments between millions of short reads and a reference genome. Often certain positions within the reads ...are inherently more likely to contain errors due to the protocols used to prepare the samples. Such biases can have adverse effects on both mapping rate and accuracy. To understand the relationship between potential protocol biases and poor mapping we wrote SAMstat, a simple C program plotting nucleotide overrepresentation and other statistics in mapped and unmapped reads in a concise html page. Collecting such statistics also makes it easy to highlight problems in the data processing and enables non-experts to track data quality over time.
Results: We demonstrate that studying sequence features in mapped data can be used to identify biases particular to one sequencing protocol. Once identified, such biases can be considered in the downstream analysis or even be removed by read trimming or filtering techniques.
Availability: SAMStat is open source and freely available as a C program running on all Unix-compatible platforms. The source code is available from http://samstat.sourceforge.net.
Contact:
timolassmann@gmail.com
Mast cells (MCs) mature exclusively in peripheral tissues, hampering research into their developmental and functional programs. Here, we employed deep cap analysis of gene expression on skin-derived ...MCs to generate the most comprehensive view of the human MC transcriptome ever reported. An advantage is that MCs were embedded in the FANTOM5 project, giving the opportunity to contrast their molecular signature against a multitude of human samples. We demonstrate that MCs possess a unique and surprising transcriptional landscape, combining hematopoietic genes with those exclusively active in MCs and genes not previously reported as expressed by MCs (several of them markers of unrelated tissues). We also found functional bone morphogenetic protein receptors transducing activatory signals in MCs. Conversely, several immune-related genes frequently studied in MCs were not expressed or were weakly expressed. Comparing MCs ex vivo with cultured counterparts revealed profound changes in the MC transcriptome in in vitro surroundings. We also determined the promoter usage of MC-expressed genes and identified associated motifs active in the lineage. Befitting their uniqueness, MCs had no close relative in the hematopoietic network (also only distantly related with basophils). This rich data set reveals that our knowledge of human MCs is still limited, but with this resource, novel functional programs of MCs may soon be discovered.
•Generated a reference transcriptome for ex vivo, cultured, and stimulated mast cells, contrasted against a broad collection of primary cells.•Identified BMPs as function-modulating factors for mast cells.