Clustal Omega is a widely used package for carrying out multiple sequence alignment. Here, we describe some recent additions to the package and benchmark some alternative ways of making alignments. ...These benchmarks are based on protein structure comparisons or predictions and include a recently described method based on secondary structure prediction. In general, Clustal Omega is fast enough to make very large alignments and the accuracy of protein alignments is high when compared to alternative packages. The package is freely available as executables or source code from www.clustal.org or can be run on‐line from a variety of sites, especially the EBI www.ebi.ac.uk.
Clustal omega Sievers, Fabian; Higgins, Desmond G
Current protocols in bioinformatics,
December 2014, Letnik:
48
Journal Article
Clustal Omega is a package for making multiple sequence alignments of amino acid or nucleotide sequences, quickly and accurately. It is a complete upgrade and rewrite of earlier Clustal programs. ...This unit describes how to run Clustal Omega interactively from a command line, although it can also be run online from several sites. The unit describes a basic protocol for taking a set of unaligned sequences and producing a full alignment. There are also protocols for using an external HMM or iteration to help improve an alignment.
Multiple sequence alignments are fundamental to many sequence analysis methods. Most alignments are computed using the progressive alignment heuristic. These methods are starting to become a ...bottleneck in some analysis pipelines when faced with data sets of the size of many thousands of sequences. Some methods allow computation of larger data sets while sacrificing quality, and others produce high‐quality alignments, but scale badly with the number of sequences. In this paper, we describe a new program called Clustal Omega, which can align virtually any number of protein sequences quickly and that delivers accurate alignments. The accuracy of the package on smaller test cases is similar to that of the high‐quality aligners. On larger data sets, Clustal Omega outperforms other packages in terms of execution time and quality. Clustal Omega also has powerful features for adding sequences to and exploiting information in existing alignments, making use of the vast amount of precomputed information in public databases like Pfam.
Multiple sequence alignments are fundamental to many sequence analysis methods. The new program Clustal Omega can align virtually any number of protein sequences quickly and has powerful features for adding sequences to existing precomputed alignments.
Metagenomics uses nucleic acid sequencing to characterize species diversity in different niches such as environmental biomes or the human microbiome. Most studies have used 16S rRNA amplicon ...sequencing to identify bacteria. However, the decreasing cost of sequencing has resulted in a gradual shift away from amplicon analyses and towards shotgun metagenomic sequencing. Shotgun metagenomic data can be used to identify a wide range of species, but have rarely been applied to fungal identification. Here, we develop a sequence classification pipeline, FindFungi, and use it to identify fungal sequences in public metagenome datasets. We focus primarily on animal metagenomes, especially those from pig and mouse microbiomes. We identified fungi in 39 of 70 datasets comprising 71 fungal species. At least 11 pathogenic species with zoonotic potential were identified, including Candida tropicalis. We identified Pseudogymnoascus species from 13 Antarctic soil samples initially analyzed for the presence of bacteria capable of degrading diesel oil. We also show that Candida tropicalis and Candida loboi are likely the same species. In addition, we identify several examples where contaminating DNA was erroneously included in fungal genome assemblies.
Abstract
Motivation
Secondary structure prediction accuracy (SSPA) in the QuanTest benchmark can be used to measure accuracy of a multiple sequence alignment. SSPA correlates well with the ...sum-of-pairs score, if the results are averaged over many alignments but not on an alignment-by-alignment basis. This is due to a sub-optimal selection of reference and non-reference sequences in QuanTest.
Results
We develop an improved strategy for selecting reference and non-reference sequences for a new benchmark, QuanTest2. In QuanTest2, SSPA and SP correlate better on an alignment-by-alignment basis than in QuanTest. Guide-trees for QuanTest2 are more balanced with respect to reference sequences than in QuanTest. QuanTest2 scores correlate well with other well-established benchmarks.
Availability and implementation
QuanTest2 is available at http://bioinf.ucd.ie/quantest2.tar, comprises of reference and non-reference sequence sets and a scoring script.
Supplementary information
Supplementary data are available at Bioinformatics online
Numerous feature selection methods have been applied to the identification of differentially expressed genes in microarray data. These include simple fold change, classical t-statistic and moderated ...t-statistics. Even though these methods return gene lists that are often dissimilar, few direct comparisons of these exist. We present an empirical study in which we compare some of the most commonly used feature selection methods. We apply these to 9 publicly available datasets, and compare, both the gene lists produced and how these perform in class prediction of test datasets.
In this study, we compared the efficiency of the feature selection methods; significance analysis of microarrays (SAM), analysis of variance (ANOVA), empirical bayes t-statistic, template matching, maxT, between group analysis (BGA), Area under the receiver operating characteristic (ROC) curve, the Welch t-statistic, fold change, rank products, and sets of randomly selected genes. In each case these methods were applied to 9 different binary (two class) microarray datasets. Firstly we found little agreement in gene lists produced by the different methods. Only 8 to 21% of genes were in common across all 10 feature selection methods. Secondly, we evaluated the class prediction efficiency of each gene list in training and test cross-validation using four supervised classifiers.
We report that the choice of feature selection method, the number of genes in the genelist, the number of cases (samples) and the noise in the dataset, substantially influence classification success. Recommendations are made for choice of feature selection. Area under a ROC curve performed well with datasets that had low levels of noise and large sample size. Rank products performs well when datasets had low numbers of samples or high levels of noise. The Empirical bayes t-statistic performed well across a range of sample sizes.
Clustal Omega is a completely rewritten and revised version of the widely used Clustal series of programs for multiple sequence alignment. It can deal with very large numbers (many tens of thousands) ...of DNA/RNA or protein sequences due to its use of the mBED algorithm for calculating guide trees. This algorithm allows very large alignment problems to be tackled very quickly, even on personal computers. The accuracy of the program has been considerably improved over earlier Clustal programs, through the use of the HHalign method for aligning profile hidden Markov models. The program currently is used from the command line or can be run on line.
The olfactory receptor (OR) gene families, which govern mammalian olfaction, have undergone extensive expansion and contraction through duplication and pseudogenization. Previous studies have shown ...that broadly defined environmental adaptations (e.g., terrestrial vs. aquatic) are correlated with the number of functional and non-functional OR genes retained. However, to date, no study has examined species-specific gene duplications in multiple phylogenetically divergent mammals to elucidate OR evolution and adaptation. Here, we identify the OR gene families driving adaptation to different ecological niches by mapping the fate of species-specific gene duplications in the OR repertoire of 94 diverse mammalian taxa, using molecular phylogenomic methods. We analyze >70,000 OR gene sequences mined from whole genomes, generated from novel amplicon sequencing data, and collated with data from previous studies, comprising one of the largest OR studies to date. For the first time, we demonstrate statistically significant patterns of OR species-specific gene duplications associated with the presence of a functioning vomeronasal organ. With respect to dietary niche, we uncover a novel link between a large number of duplications in OR family 5/8/9 and herbivory. Our results also highlight differences between social and solitary niches, indicating that a greater OR repertoire expansion may be associated with a solitary lifestyle. This study demonstrates the utility of species-specific duplications in elucidating gene family evolution, revealing how the OR repertoire has undergone expansion and contraction with respect to a number of ecological adaptations in mammals.
Multiple sequence alignment (MSA) is commonly used to analyze sets of homologous protein or DNA sequences. This has lead to the development of many methods and packages for MSA over the past 30 ...years. Being able to compare different methods has been problematic and has relied on gold standard benchmark datasets of 'true' alignments or on MSA simulations. A number of protein benchmark datasets have been produced which rely on a combination of manual alignment and/or automated superposition of protein structures. These are either restricted to very small MSAs with few sequences or require manual alignment which can be subjective. In both cases, it remains very difficult to properly test MSAs of more than a few dozen sequences. PREFAB and HomFam both rely on using a small subset of sequences of known structure and do not fairly test the quality of a full MSA.
In this paper we describe QuanTest, a fully automated and highly scalable test system for protein MSAs which is based on using secondary structure prediction accuracy (SSPA) to measure alignment quality. This is based on the assumption that better MSAs will give more accurate secondary structure predictions when we include sequences of known structure. SSPA measures the quality of an entire alignment however, not just the accuracy on a handful of selected sequences. It can be scaled to alignments of any size but here we demonstrate its use on alignments of either 200 or 1000 sequences. This allows the testing of slow accurate programs as well as faster, less accurate ones. We show that the scores from QuanTest are highly correlated with existing benchmark scores. We also validate the method by comparing a wide range of MSA alignment options and by including different levels of mis-alignment into MSA, and examining the effects on the scores.
QuanTest is available from http://www.bioinf.ucd.ie/download/QuanTest.tgz.
quan.le@ucd.ie.
Supplementary data are available at Bioinformatics online.
Clustal Omega is a version, completely rewritten and revised in 2011, of the widely used Clustal series of programs for multiple sequence alignment. It can deal with very large numbers (many tens of ...thousands) of DNA/RNA or protein sequences due to its use of the mBed algorithm for calculating guide-trees. This algorithm allows very large alignment problems to be tackled very quickly, even on personal computers. The accuracy of the program has been considerably improved over earlier Clustal programs, through the use of the HHalign method for aligning profile hidden Markov models. The program currently is used from the command-line or can be run online.