The average amino acid identity (AAI) is an index of pairwise genomic relatedness, and multiple studies have proposed its application in prokaryotic taxonomy and related disciplines. AAI demonstrates ...better resolution in elucidating taxonomic structure beyond the species rank when compared with average nucleotide identity (ANI), which is a standard criterion in species delineation. However, an efficient and easy-to-use computational tool for AAI calculation in large-scale taxonomic studies is not yet available. Here, we introduce a bioinformatic pipeline, named EzAAI, which allows for rapid and accurate AAI calculation in prokaryote sequences. The EzAAI tool is based on the MMSeqs2 program and computes AAI values almost identical to those generated by the standard BLAST algorithm with significant improvements in the speed of these evaluations. Our pipeline also provides a function for hierarchical clustering to create dendrograms, which is an essential part of any taxonomic study. EzAAI is available for download as a standalone JAVA program at
http://leb.snu.ac.kr/ezaai
.
The recent advent of DNA sequencing technologies facilitates the use of genome sequencing data that provide means for more informative and precise classification and identification of members of the ...Bacteria and Archaea. Because the current species definition is based on the comparison of genome sequences between type and other strains in a given species, building a genome database with correct taxonomic information is of paramount need to enhance our efforts in exploring prokaryotic diversity and discovering novel species as well as for routine identifications. Here we introduce an integrated database, called EzBioCloud, that holds the taxonomic hierarchy of the Bacteria and Archaea, which is represented by quality-controlled 16S rRNA gene and genome sequences. Whole-genome assemblies in the NCBI Assembly Database were screened for low quality and subjected to a composite identification bioinformatics pipeline that employs gene-based searches followed by the calculation of average nucleotide identity. As a result, the database is made of 61 700 species/phylotypes, including 13 132 with validly published names, and 62 362 whole-genome assemblies that were identified taxonomically at the genus, species and subspecies levels. Genomic properties, such as genome size and DNA G+C content, and the occurrence in human microbiome data were calculated for each genus or higher taxa. This united database of taxonomy, 16S rRNA gene and genome sequences, with accompanying bioinformatics tools, should accelerate genome-based classification and identification of members of the Bacteria and Archaea. The database and related search tools are available at www.ezbiocloud.net/.
Among available genome relatedness indices, average nucleotide identity (ANI) is one of the most robust measurements of genomic relatedness between strains, and has great potential in the taxonomy of ...bacteria and archaea as a substitute for the labour-intensive DNA–DNA hybridization (DDH) technique. An ANI threshold range (95–96 %) for species demarcation had previously been suggested based on comparative investigation between DDH and ANI values, albeit with rather limited datasets. Furthermore, its generality was not tested on all lineages of prokaryotes. Here, we investigated the overall distribution of ANI values generated by pairwise comparison of 6787 genomes of prokaryotes belonging to 22 phyla to see whether the suggested range can be applied to all species. There was an apparent distinction in the overall ANI distribution between intra- and interspecies relationships at around 95–96 % ANI. We went on to determine which level of 16S rRNA gene sequence similarity corresponds to the currently accepted ANI threshold for species demarcation using over one million comparisons. A twofold cross-validation statistical test revealed that 98.65 % 16S rRNA gene sequence similarity can be used as the threshold for differentiating two species, which is consistent with previous suggestions (98.2–99.0 %) derived from comparative studies between DDH and 16S rRNA gene sequence similarity. Our findings should be useful in accelerating the use of genomic sequence data in the taxonomy of bacteria and archaea.
Average nucleotide identity (ANI) is a category of computational analysis that can be used to define species boundaries of Archaea and Bacteria. Calculating ANI usually involves the fragmentation of ...genome sequences, followed by nucleotide sequence search, alignment, and identity calculation. The original algorithm to calculate ANI used the BLAST program as its search engine. An improved ANI algorithm, called OrthoANI, was developed to accommodate the concept of orthology. Here, we compared four algorithms to compute ANI, namely ANIb (ANI algorithm using BLAST), ANIm (ANI using MUMmer), OrthoANIb (OrthoANI using BLAST) and OrthoANIu (OrthoANI using USEARCH) using >100,000 pairs of genomes with various genome sizes. By comparing values to the ANIb that is considered a standard, OrthoANIb and OrthoANIu exhibited good correlation in the whole range of ANI values. ANIm showed poor correlation for ANI of <90%. ANIm and OrthoANIu runs faster than ANIb by an order of magnitude. When genomes that are larger than 7 Mbp were analysed, the run-times of ANIm and OrthoANIu were shorter than that of ANIb by 53- and 22-fold, respectively. In conclusion, ANI calculation can be greatly sped up by the OrthoANIu method without losing accuracy. A web-service that can be used to calculate OrthoANIu between a pair of genome sequences is available at
http://www.ezbiocloud.net/tools/ani
. For large-scale calculation and integration in bioinformatics pipelines, a standalone JAVA program is available for download at
http://www.ezbiocloud.net/tools/orthoaniu
.
Advancement of DNA sequencing technology allows the routine use of genome sequences in the various fields of microbiology. The information held in genome sequences proved to provide objective and ...reliable means in the taxonomy of prokaryotes. Here, we describe the minimal standards for the quality of genome sequences and how they can be applied for taxonomic purposes.
Genome-based phylogeny plays a central role in the future taxonomy and phylogenetics of
Bacteria
and
Archaea
by replacing 16S rRNA gene phylogeny. The concatenated core gene alignments are frequently ...used for such a purpose. The bacterial core genes are defined as single-copy, homologous genes that are present in most of the known bacterial species. There have been several studies describing such a gene set, but the number of species considered was rather small. Here we present the up-to-date bacterial core gene set, named UBCG, and software suites to accommodate necessary steps to generate and evaluate phylogenetic trees. The method was successfully used to infer phylogenomic relationship of
Escherichia
and related taxa and can be used for the set of genomes at any taxonomic ranks of
Bacteria
. The UBCG pipeline and file viewer are freely available at
https://www.ezbiocloud.net/tools/ubcg
and
https://www.ezbiocloud.net/tools/ubcg_viewer
, respectively.
Phylogenomic tree reconstruction has recently become a routine and critical task to elucidate the evolutionary relationships among bacterial species. The most widely used method utilizes the ...concatenated core genes, universally present in a single-copy throughout the bacterial domain. In our previous study, a bioinformatics pipeline termed Up-to-date Bacterial Core Genes (UBCG) was developed with a set of bacterial core genes selected from 1,429 species covering 28 phyla. In this study, we revised a new bacterial core gene set, named UBCG2, that was selected from the more extensive genome sequence set belonging to 3,508 species spanning 43 phyla. UBCG2 comprises 81 genes with nine Clusters of Orthologous Groups of proteins (COGs) functional categories. The new gene set and complete pipeline are available at
http://leb.snu.ac.kr/ubcg2
.
The gut microbiota can affect host health, including humans. Mouse models have been used extensively to study the relationships between the host and the gut microbiota. With the development of ...cost-effective high-throughput DNA sequencing, several methods have been used to identify members of the gut microbiota of laboratory mice. In recent years, the amount of research and knowledge about the mouse gut microbiota has exploded, leading to significant breakthroughs in understanding of the taxonomic composition of and variation in this community. In addition, the rapidly increasing volume of data has allowed the development of public resources for exploring the mouse gut microbiota. In this review, we describe the concepts and pros and cons of basic methodologies that can be used to determine the gut bacterial profile in laboratory mice. We also present the key bacterial components of the mouse gut microbiota from the phylum to the species level and then compare them with those identified in other references. Additionally, we discuss variations in the mouse gut microbiota and their association with experiments using mice. Finally, we summarize the properties and functions of currently available public resources for exploring the mouse gut microbiota.