Abstract
Background
Calling germline SNP variants from bisulfite-converted sequencing data poses a challenge for conventional software, which have no inherent capability to dissociate true ...polymorphisms from artificial mutations induced by the chemical treatment. Nevertheless, SNP data is desirable both for genotyping and to understand the DNA methylome in the context of the genetic background. The confounding effect of bisulfite conversion however can be conceptually resolved by observing differences in allele counts on a per-strand basis, whereby artificial mutations are reflected by non-complementary base pairs.
Results
Herein, we present a computational pre-processing approach for adapting sequence alignment data, thus indirectly enabling downstream analysis on a per-strand basis using conventional variant calling software such as GATK or Freebayes. In comparison to specialised tools, the method represents a marked improvement in precision-sensitivity based on high-quality, published benchmark datasets for both human and model plant variants.
Conclusion
The presented “double-masking” procedure represents an open source, easy-to-use method to facilitate accurate variant calling using conventional software, thus negating any dependency on specialised tools and mitigating the need to generate additional, conventional sequencing libraries alongside bisulfite sequencing experiments. The method is available at
https://github.com/bio15anu/revelio
and an implementation with Freebayes is available at
https://github.com/EpiDiverse/SNP
Microarrays are a powerful tool for transcriptome analysis. Best results are obtained using high-quality RNA samples for preparation and hybridization. Issues with RNA integrity can lead to low data ...quality and failure of the microarray experiment.
Microarray intensity data contains information to estimate the RNA quality of the sample. We here study the interplay of the characteristics of RNA surface hybridization with the effects of partly truncated transcripts on probe intensity. The 3'/5' intensity gradient, the basis of microarray RNA quality measures, is shown to depend on the degree of competitive binding of specific and of non-specific targets to a particular probe, on the degree of saturation of the probes with bound transcripts and on the distance of the probe from the 3'-end of the transcript. Increasing degrees of non-specific hybridization or of saturation reduce the 3'/5' intensity gradient and if not taken into account, this leads to biased results in common quality measures for GeneChip arrays such as affyslope or the control probe intensity ratio. We also found that short probe sets near the 3'-end of the transcripts are prone to non-specific hybridization presumable because of inaccurate positional assignment and the existence of transcript isoforms with variable 3' UTRs. Poor RNA quality is associated with a decreased amount of RNA material hybridized on the array paralleled by a decreased total signal level. Additionally, it causes a gene-specific loss of signal due to the positional bias of transcript abundance which requires an individual, gene-specific correction. We propose a new RNA quality measure that considers the hybridization mode. Graphical characteristics are introduced allowing assessment of RNA quality of each single array ('tongs plot' and 'degradation hook'). Furthermore, we suggest a method to correct for effects of RNA degradation on microarray intensities.
The presented RNA degradation measure has best correlation with the independent RNA integrity measure RIN, and therefore presents itself as a valuable tool for quality control and even for the study of RNA degradation. When RNA degradation effects are detected in microarray experiments, a correction of the induced bias in probe intensities is advised.
The brightness of the probe spots on expression microarrays intends to measure the abundance of specific mRNA targets. Probes with runs of at least three guanines (G) in their sequence show abnormal ...high intensities which reflect rather probe effects than target concentrations. This G-bias requires correction prior to downstream expression analysis.
Longer runs of three or more consecutive G along the probe sequence and in particular triple degenerated G at its solution end ((GGG)1-effect) are associated with exceptionally large probe intensities on GeneChip expression arrays. This intensity bias is related to non-specific hybridization and affects both perfect match and mismatch probes. The (GGG)1-effect tends to increase gradually for microarrays of later GeneChip generations. It was found for DNA/RNA as well as for DNA/DNA probe/target-hybridization chemistries. Amplification of sample RNA using T7-primers is associated with strong positive amplitudes of the G-bias whereas alternative amplification protocols using random primers give rise to much smaller and partly even negative amplitudes. We applied positional dependent sensitivity models to analyze the specifics of probe intensities in the context of all possible short sequence motifs of one to four adjacent nucleotides along the 25meric probe sequence. Most of the longer motifs are adequately described using a nearest-neighbor (NN) model. In contrast, runs of degenerated guanines require explicit consideration of next nearest neighbors (GGG terms). Preprocessing methods such as vsn, RMA, dChip, MAS5 and gcRMA only insufficiently remove the G-bias from data.
Positional and motif dependent sensitivity models accounts for sequence effects of oligonucleotide probe intensities. We propose a positional dependent NN+GGG hybrid model to correct the intensity bias associated with probes containing poly-G motifs. It is implemented as a single-chip based calibration algorithm for GeneChips which can be applied in a pre-correction step prior to standard preprocessing.
Here we present the results of a large-scale bioinformatics annotation of non-coding RNA loci in 48 avian genomes. Our approach uses probabilistic models of hand-curated families from the Rfam ...database to infer conserved RNA families within each avian genome. We supplement these annotations with predictions from the tRNA annotation tool, tRNAscan-SE and microRNAs from miRBase. We identify 34 lncRNA-associated loci that are conserved between birds and mammals and validate 12 of these in chicken. We report several intriguing cases where a reported mammalian lncRNA, but not its function, is conserved. We also demonstrate extensive conservation of classical ncRNAs (e.g., tRNAs) and more recently discovered ncRNAs (e.g., snoRNAs and miRNAs) in birds. Furthermore, we describe numerous "losses" of several RNA families, and attribute these to either genuine loss, divergence or missing data. In particular, we show that many of these losses are due to the challenges associated with assembling avian microchromosomes. These combined results illustrate the utility of applying homology-based methods for annotating novel vertebrate genomes.
Single nucleotide polymorphism (SNP) arrays are important tools widely used for genotyping and copy number estimation. This technology utilizes the specific affinity of fragmented DNA for binding to ...surface-attached oligonucleotide DNA probes. We analyze the variability of the probe signals of Affymetrix GeneChip SNP arrays as a function of the probe sequence to identify relevant sequence motifs which potentially cause systematic biases of genotyping and copy number estimates.
The probe design of GeneChip SNP arrays enables us to disentangle different sources of intensity modulations such as the number of mismatches per duplex, matched and mismatched base pairings including nearest and next-nearest neighbors and their position along the probe sequence. The effect of probe sequence was estimated in terms of triple-motifs with central matches and mismatches which include all 256 combinations of possible base pairings. The probe/target interactions on the chip can be decomposed into nearest neighbor contributions which correlate well with free energy terms of DNA/DNA-interactions in solution. The effect of mismatches is about twice as large as that of canonical pairings. Runs of guanines (G) and the particular type of mismatched pairings formed in cross-allelic probe/target duplexes constitute sources of systematic biases of the probe signals with consequences for genotyping and copy number estimates. The poly-G effect seems to be related to the crowded arrangement of probes which facilitates complex formation of neighboring probes with at minimum three adjacent G's in their sequence.
The applied method of "triple-averaging" represents a model-free approach to estimate the mean intensity contributions of different sequence motifs which can be applied in calibration algorithms to correct signal values for sequence effects. Rules for appropriate sequence corrections are suggested.
Gene expression experiments aim to accurately quantify thousands of transcripts in parallel. Factors posterior to RNA extraction can, however, impair their accurate representation. RNA degradation ...and differences in the efficiency of amplification affect raw intensity measurements using Affymetrix expression arrays. The positional intensity decay of specifically hybridized probes along the transcript they intend to interrogate is used to estimate the RNA quality in a sample and to correct probe intensities for the degradation bias. This functionality, for which no previous software solution is available, is implemented in the R/Bioconductor package AffyRNADegradation presented here.
The package is available via Bioconductor at the URL http://bioconductor.org/packages/release/bioc/html/AffyRNA Degradation.html
Large-scale RNA sequencing has revealed a large number of long mRNA-like transcripts (lncRNAs) that do not code for proteins. The evolutionary history of these lncRNAs has been notoriously hard to ...study systematically due to their low level of sequence conservation that precludes comprehensive homology-based surveys and makes them nearly impossible to align. An increasing number of special cases, however, has been shown to be at least as old as the vertebrate lineage. Here we use the conservation of splice sites to trace the evolution of lncRNAs. We show that >85% of the human GENCODE lncRNAs were already present at the divergence of placental mammals and many hundreds of these RNAs date back even further. Nevertheless, we observe a fast turnover of intron/exon structures. We conclude that lncRNA genes are evolutionary ancient components of vertebrate genomes that show an unexpected and unprecedented evolutionary plasticity. We offer a public web service (http://splicemap.bioinf.uni-leipzig.de) that allows to retrieve sets of orthologous splice sites and to produce overview maps of evolutionarily conserved splice sites for visualization and further analysis. An electronic supplement containing the ncRNA data sets used in this study is available at http://www.bioinf.uni-leipzig.de/publications/supplements/12-001.
Research in evolutionary biology has been progressively influenced by big data such as massive genome and transcriptome sequencing data, scalar measurements of several phenotypes on tens to thousands ...of individuals, as well as from collecting worldwide environmental data at an increasingly detailed scale. The handling and analysis of such data require computational skills that usually exceed the abilities of most traditionally trained evolutionary biologists. Here we discuss the advantages, challenges and considerations for organizing and running bioinformatics training courses of 2–3 weeks in length to introduce evolutionary biologists to the computational analysis of big data. Extended courses have the advantage of offering trainees the opportunity to learn a more comprehensive set of complementary topics and skills and allowing for more time to practice newly acquired competences. Many organizational aspects are common to any course, as the need to define precise learning objectives and the selection of appropriate and highly motivated instructors and trainees, among others. However, other features assume particular importance in extended bioinformatics training courses. To successfully implement a learning-by-doing philosophy, sufficient and enthusiastic teaching assistants (TAs) are necessary to offer prompt help to trainees. Further, a good balance between theoretical background and practice time needs to be provided and assured that the schedule includes enough flexibility for extra review sessions or further discussions if desired. A final project enables trainees to apply their newly learned skills to real data or case studies of their interest. To promote a friendly atmosphere throughout the course and to build a close-knit community after the course, allow time for some scientific discussions and social activities. In addition, to not exhaust trainees and TAs, some leisure time needs to be organized. Finally, all organization should be done while keeping the budget within fair limits. In order to create a sustainable course that constantly improves and adapts to the trainees’ needs, gathering short- and long-term feedback after the end of the course is important. Based on our experience we have collected a set of recommendations to effectively organize and run extended bioinformatics training courses for evolutionary biologists, which we here want to share with the community. They offer a complementary way for the practical teaching of modern evolutionary biology and reaching out to the biological community.
To connect human biology to fish biomedical models, we sequenced the genome of spotted gar (Lepisosteus oculatus), whose lineage diverged from teleosts before teleost genome duplication (TGD). The ...slowly evolving gar genome has conserved in content and size many entire chromosomes from bony vertebrate ancestors. Gar bridges teleosts to tetrapods by illuminating the evolution of immunity, mineralization and development (mediated, for example, by Hox, ParaHox and microRNA genes). Numerous conserved noncoding elements (CNEs; often cis regulatory) undetectable in direct human-teleost comparisons become apparent using gar: functional studies uncovered conserved roles for such cryptic CNEs, facilitating annotation of sequences identified in human genome-wide association studies. Transcriptomic analyses showed that the sums of expression domains and expression levels for duplicated teleost genes often approximate the patterns and levels of expression for gar genes, consistent with subfunctionalization. The gar genome provides a resource for understanding evolution after genome duplication, the origin of vertebrate genomes and the function of human regulatory sequences.
Small non-coding RNAs (ncRNAs) such as microRNAs, snoRNAs and tRNAs are a diverse collection of molecules with several important biological functions. Current methods for high-throughput sequencing ...for the first time offer the opportunity to investigate the entire ncRNAome in an essentially unbiased way. However, there is a substantial need for methods that allow a convenient analysis of these overwhelmingly large data sets. Here, we present DARIO, a free web service that allows to study short read data from small RNA-seq experiments. It provides a wide range of analysis features, including quality control, read normalization, ncRNA quantification and prediction of putative ncRNA candidates. The DARIO web site can be accessed at http://dario.bioinf.uni-leipzig.de/.