Taxonomic classification of marker-gene sequences is an important step in microbiome analysis.
We present q2-feature-classifier ( https://github.com/qiime2/q2-feature-classifier ), a QIIME 2 plugin ...containing several novel machine-learning and alignment-based methods for taxonomy classification. We evaluated and optimized several commonly used classification methods implemented in QIIME 1 (RDP, BLAST, UCLUST, and SortMeRNA) and several new methods implemented in QIIME 2 (a scikit-learn naive Bayes machine-learning classifier, and alignment-based taxonomy consensus methods based on VSEARCH, and BLAST+) for classification of bacterial 16S rRNA and fungal ITS marker-gene amplicon sequence data. The naive-Bayes, BLAST+-based, and VSEARCH-based classifiers implemented in QIIME 2 meet or exceed the species-level accuracy of other commonly used methods designed for classification of marker gene sequences that were evaluated in this work. These evaluations, based on 19 mock communities and error-free sequence simulations, including classification of simulated "novel" marker-gene sequences, are available in our extensible benchmarking framework, tax-credit ( https://github.com/caporaso-lab/tax-credit-data ).
Our results illustrate the importance of parameter tuning for optimizing classifier performance, and we make recommendations regarding parameter choices for these classifiers under a range of standard operating conditions. q2-feature-classifier and tax-credit are both free, open-source, BSD-licensed packages available on GitHub.
Popular naive Bayes taxonomic classifiers for amplicon sequences assume that all species in the reference database are equally likely to be observed. We demonstrate that classification accuracy ...degrades linearly with the degree to which that assumption is violated, and in practice it is always violated. By incorporating environment-specific taxonomic abundance information, we demonstrate a significant increase in the species-level classification accuracy across common sample types. At the species level, overall average error rates decline from 25% to 14%, which is favourably comparable to the error rates that existing classifiers achieve at the genus level (16%). Our findings indicate that for most practical purposes, the assumption that reference species are equally likely to be observed is untenable. q2-clawback provides a straightforward alternative for samples from common environments.
Abstract
Mutations contribute significantly to developing diversity in biological capabilities. Mutagenesis is an adaptive feature of normal development, e.g. generating diversity in immune cells...
...There is increasing interest in developing diagnostics that discriminate individual mutagenic mechanisms in a range of applications that include identifying population-specific mutagenesis and resolving distinct mutation signatures in cancer samples. Analyses for these applications assume that mutagenic mechanisms have a distinct relationship with neighboring bases that allows them to be distinguished. Direct support for this assumption is limited to a small number of simple cases, e.g., CpG hypermutability. We have evaluated whether the mechanistic origin of a point mutation can be resolved using only sequence context for a more complicated case. We contrasted single nucleotide variants originating from the multitude of mutagenic processes that normally operate in the mouse germline with those induced by the potent mutagen N-ethyl-N-nitrosourea (ENU). The considerable overlap in the mutation spectra of these two samples make this a challenging problem. Employing a new, robust log-linear modeling method, we demonstrate that neighboring bases contain information regarding point mutation direction that differs between the ENU-induced and spontaneous mutation variant classes. A logistic regression classifier exhibited strong performance at discriminating between the different mutation classes. Concordance between the feature set of the best classifier and information content analyses suggest our results can be generalized to other mutation classification problems. We conclude that machine learning can be used to build a practical classification tool to identify the mutation mechanism for individual genetic variants. Software implementing our approach is freely available under an open-source license.
Mutation processes differ between types of point mutation, genomic locations, cells, and biological species. For some point mutations, specific neighboring bases are known to be mechanistically ...influential. Beyond these cases, numerous questions remain unresolved, including: what are the sequence motifs that affect point mutations? How large are the motifs? Are they strand symmetric? And, do they vary between samples? We present new log-linear models that allow explicit examination of these questions, along with sequence logo style visualization to enable identifying specific motifs. We demonstrate the performance of these methods by analyzing mutation processes in human germline and malignant melanoma. We recapitulate the known CpG effect, and identify novel motifs, including a highly significant motif associated with AFormula: see textG mutations. We show that major effects of neighbors on germline mutation lie within Formula: see text of the mutating base. Models are also presented for contrasting the entire mutation spectra (the distribution of the different point mutations). We show the spectra vary significantly between autosomes and X-chromosome, with a difference in TFormula: see textC transition dominating. Analyses of malignant melanoma confirmed reported characteristic features of this cancer, including statistically significant strand asymmetry, and markedly different neighboring influences. The methods we present are made freely available as a Python library https://bitbucket.org/pycogent3/mutationmotif.
Phylogenetic analyses of toxin gene families have revolutionised our understanding of the origin and evolution of reptile venoms, leading to the current hypothesis that venom evolved once in squamate ...reptiles. However, because of a lack of homologous squamate non-toxin sequences, these conclusions rely on the implicit assumption that recruitments of protein families into venom are both rare and irreversible. Here we use sequences of homologous non-toxin proteins from two snake species to test these assumptions. Phylogenetic and ancestral-state analyses revealed frequent nesting of 'physiological' proteins within venom toxin clades, suggesting early ancestral recruitment into venom followed by reverse recruitment of toxins back to physiological roles. These results provide evidence that protein recruitment into venoms from physiological functions is not a one-way process, but dynamic, with reversal of function and/or co-expression of toxins in different tissues. This requires a major reassessment of our previous understanding of how animal venoms evolve.
Cytosine methylation is one of several reversible epigenetic modifications of DNA that allow a greater flexibility in the relationship between genotype and phenotype. Methylation in the simplest ...models dampens gene expression by modifying regions of DNA critical for transcription factor binding. The capacity to methylate DNA is variable in the insects due to diverse histories of gene loss and duplication of DNA methylases. Mosquitoes like Drosophila melanogaster possess only a single methylase, DNMT2.
Here we characterise the methylome of the mosquito Aedes aegypti and examine its relationship to transcription and test the effects of infection with a virulent strain of the endosymbiont Wolbachia on the stability of methylation patterns.
We see that methylation in the A. aegypti genome is associated with reduced transcription and is most common in the promoters of genes relating to regulation of transcription and metabolism. Similar gene classes are also methylated in aphids and honeybees, suggesting either conservation or convergence of methylation patterns. In addition to this evidence of evolutionary stability, we also show that infection with the virulent wMelPop Wolbachia strain induces additional methylation and demethylation events in the genome. While most of these changes seem random with respect to gene function and have no detected effect on transcription, there does appear to be enrichment of genes associated with membrane function. Given that Wolbachia lives within a membrane-bound vacuole of host origin and retains a large number of genes for transporting host amino acids, inorganic ions and ATP despite a severely reduced genome, these changes might represent an evolved strategy for manipulating the host environments for its own gain. Testing for a direct link between these methylation changes and expression, however, will require study across a broader range of developmental stages and tissues with methods that detect splice variants.
Although it has been clearly established that well-positioned histone H2A.Z-containing nucleosomes flank the nucleosome-depleted region (NDR) at the transcriptional start site (TSS) of active ...mammalian genes, how this chromatin-based information is transmitted through the cell cycle is unknown. We show here that in mouse trophoblast stem cells, the amount of histone H2A.Z at promoters decreased during S phase, coinciding with homotypic (H2A.Z-H2A.Z) nucleosomes flanking the TSS becoming heterotypic (H2A.Z-H2A). To our surprise these nucleosomes remained heterotypic at M phase. At the TSS, we identified an unstable heterotypic histone H2A.Z-containing nucleosome in G1 phase that was lost after DNA replication. These dynamic changes at the TSS mirror a global expansion of the NDR at S and M phases, which, unexpectedly, is unrelated to transcriptional activity. Coincident with the loss of histone H2A.Z at promoters, histone H2A.Z is targeted to the centromere when mitosis begins.
q2-sample-classifier is a plugin for the QIIME 2 microbiome bioinformatics platform that facilitates access, reproducibility, and interpretation of supervised learning (SL) methods for a broad ...audience of non-bioinformatics specialists.
Continuous-time Markov processes are often used to model the complex natural phenomenon of sequence evolution. To make the process of sequence evolution tractable, simplifying assumptions are often ...made about the sequence properties and the underlying process. The validity of one such assumption, time-homogeneity, has never been explored. Violations of this assumption can be found by identifying non-embeddability. A process is non-embeddable if it can not be embedded in a continuous time-homogeneous Markov process. In this study, non-embeddability was demonstrated to exist when modelling sequence evolution with Markov models. Evidence of non-embeddability was found primarily at the third codon position, possibly resulting from changes in mutation rate over time. Outgroup edges and those with a deeper time depth were found to have an increased probability of the underlying process being non-embeddable. Overall, low levels of non-embeddability were detected when examining individual edges of triads across a diverse set of alignments. Subsequent phylogenetic reconstruction analyses demonstrated that non-embeddability could impact on the correct prediction of phylogenies, but at extremely low levels. Despite the existence of non-embeddability, there is minimal evidence of violations of the local time homogeneity assumption and consequently the impact is likely to be minor.