The opportunities for bacterial population genomics that are being realised by the application of parallel nucleotide sequencing require novel bioinformatics platforms. These must be capable of the ...storage, retrieval, and analysis of linked phenotypic and genotypic information in an accessible, scalable and computationally efficient manner.
The Bacterial Isolate Genome Sequence Database (BIGSDB) is a scalable, open source, web-accessible database system that meets these needs, enabling phenotype and sequence data, which can range from a single sequence read to whole genome data, to be efficiently linked for a limitless number of bacterial specimens. The system builds on the widely used mlstdbNet software, developed for the storage and distribution of multilocus sequence typing (MLST) data, and incorporates the capacity to define and identify any number of loci and genetic variants at those loci within the stored nucleotide sequences. These loci can be further organised into 'schemes' for isolate characterisation or for evolutionary or functional analyses. Isolates and loci can be indexed by multiple names and any number of alternative schemes can be accommodated, enabling cross-referencing of different studies and approaches. LIMS functionality of the software enables linkage to and organisation of laboratory samples. The data are easily linked to external databases and fine-grained authentication of access permits multiple users to participate in community annotation by setting up or contributing to different schemes within the database. Some of the applications of BIGSDB are illustrated with the genera Neisseria and Streptococcus.The BIGSDB source code and documentation are available at http://pubmlst.org/software/database/bigsdb/.
Genomic data can be used to characterise bacterial isolates in many different ways but it can also be efficiently exploited for evolutionary or functional studies. BIGSDB represents a freely available resource that will assist the broader community in the elucidation of the structure and function of bacteria by means of a population genomics approach.
The PubMLST.org website hosts a collection of open-access, curated databases that integrate population sequence data with provenance and phenotype information for over 100 different microbial species ...and genera. Although the PubMLST website was conceived as part of the development of the first multi-locus sequence typing (MLST) scheme in 1998 the software it uses, the Bacterial Isolate Genome Sequence database (BIGSdb, published in 2010), enables PubMLST to include all levels of sequence data, from single gene sequences up to and including complete, finished genomes. Here we describe developments in the BIGSdb software made from publication to June 2018 and show how the platform realises microbial population genomics for a wide range of applications. The system is based on the gene-by-gene analysis of microbial genomes, with each deposited sequence annotated and curated to identify the genes present and systematically catalogue their variation. Originally intended as a means of characterising isolates with typing schemes, the synthesis of sequences and records of genetic variation with provenance and phenotype data permits highly scalable (whole genome sequence data for tens of thousands of isolates) means of addressing a wide range of functional questions, including: the prediction of antimicrobial resistance; likely cross-reactivity with vaccine antigens; and the functional activities of different variants that lead to key phenotypes. There are no limitations to the number of sequences, genetic loci, allelic variants or schemes (combinations of loci) that can be included, enabling each database to represent an expanding catalogue of the genetic variation of the population in question. In addition to providing web-accessible analyses and links to third-party analysis and visualisation tools, the BIGSdb software includes a RESTful application programming interface (API) that enables access to all the underlying data for third-party applications and data analysis pipelines.
The increasing availability of hundreds of whole bacterial genomes provides opportunities for enhanced understanding of the genes and alleles responsible for clinically important phenotypes and how ...they evolved. However, it is a significant challenge to develop easy-to-use and scalable methods for characterizing these large and complex data and relating it to disease epidemiology. Existing approaches typically focus on either homologous sequence variation in genes that are shared by all isolates, or non-homologous sequence variation--focusing on genes that are differentially present in the population. Here we present a comparative genomics approach that simultaneously approximates core and accessory genome variation in pathogen populations and apply it to pathogenic species in the genus Campylobacter. A total of 7 published Campylobacter jejuni and Campylobacter coli genomes were selected to represent diversity across these species, and a list of all loci that were present at least once was compiled. After filtering duplicates a 7-isolate reference pan-genome, of 3,933 loci, was defined. A core genome of 1,035 genes was ubiquitous in the sample accounting for 59% of the genes in each isolate (average genome size of 1.68 Mb). The accessory genome contained 2,792 genes. A Campylobacter population sample of 192 genomes was screened for the presence of reference pan-genome loci with gene presence defined as a BLAST match of ≥ 70% identity over ≥ 50% of the locus length--aligned using MUSCLE on a gene-by-gene basis. A total of 21 genes were present only in C. coli and 27 only in C. jejuni, providing information about functional differences associated with species and novel epidemiological markers for population genomic analyses. Homologs of these genes were found in several of the genomes used to define the pan-genome and, therefore, would not have been identified using a single reference strain approach.
species assigned to the
(Acb) complex, are Gram-negative bacteria responsible for a large number of human infections. The population structure of Acb has been studied using two 7-gene MLST schemes, ...introduced by Bartual and coworkers (Oxford scheme) and by Diancourt and coworkers (Pasteur scheme). The schemes have three genes in common but underlie two coexisting nomenclatures of sequence types and clonal complexes, which complicates communication on
genotypes. The aim of this study was to compare the characteristics of the two schemes to make a recommendation about their usage. Using genome sequences of 730 strains of the Acb complex, we evaluated the phylogenetic congruence of MLST schemes, the correspondence between sequence types, their discriminative power and genotyping reliability from genomic sequences.
ST re-assignments highlighted the presence of a second copy of the Oxford
locus, present in 553/730 genomes that has led to the creation of artefactual profiles and STs. The reliability of the two MLST schemes was tested statistically comparing MLST-based phylogenies to two reference phylogenies (core-genome genes and genome-wide SNPs) using topology-based and likelihood-based tests. Additionally, each MLST gene fragment was evaluated by correlating the pairwise nucleotide distances between each pair of genomes calculated on the core-genome and on each single gene fragment. The Pasteur scheme appears to be less discriminant among closely related isolates, but less affected by homologous recombination and more appropriate for precise strain classification in clonal groups, which within this scheme are more often correctly monophyletic. Statistical tests evaluate the tree deriving from the Oxford scheme as more similar to the reference genome trees. Our results, together with previous work, indicate that the Oxford scheme has important issues:
paralogy, recombination, primers sequences, position of the genes on the genome. While there is no complete agreement in all analyses, when considered as a whole the above results indicate that the Pasteur scheme is more appropriate for population biology and epidemiological studies of
and related species and we propose that it should be the scheme of choice during the transition toward, and in parallel with, core genome MLST.
Abstract
The genus
Bordetella
includes bacteria that are found in the environment and/or associated with humans and other animals. A few closely related species, including
Bordetella pertussis
, are ...human pathogens that cause diseases such as whooping cough. Here, we present a large database of
Bordetella
isolates and genomes and develop genotyping systems for the genus and for the
B. pertussis
clade. To generate the database, we merge previously existing databases from Oxford University and Institut Pasteur, import genomes from public repositories, and add 83 newly sequenced
B. bronchiseptica
genomes. The public database currently includes 2582
Bordetella
isolates and their provenance data, and 2085 genomes (
https://bigsdb.pasteur.fr/bordetella/
). We use core-genome multilocus sequence typing (cgMLST) to develop genotyping systems for the whole genus and for
B. pertussis
, as well as specific schemes to define antigenic, virulence and macrolide resistance profiles. Phylogenetic analyses allow us to redefine evolutionary relationships among known
Bordetella
species, and to propose potential new species. Our database provides an expandable resource for genotyping of environmental and clinical
Bordetella
isolates, thus facilitating evolutionary and epidemiological research on whooping cough and other
Bordetella
infections.
Following the association of Cronobacter spp. to several publicized fatal outbreaks in neonatal intensive care units of meningitis and necrotising enterocolitis, the World Health Organization (WHO) ...in 2004 requested the establishment of a molecular typing scheme to enable the international control of the organism. This paper presents the application of Next Generation Sequencing (NGS) to Cronobacter which has led to the establishment of the Cronobacter PubMLST genome and sequence definition database (http://pubmlst.org/cronobacter/) containing over 1000 isolates with metadata along with the recognition of specific clonal lineages linked to neonatal meningitis and adult infections
Whole genome sequencing and multilocus sequence typing (MLST) has supports the formal recognition of the genus Cronobacter composed of seven species to replace the former single species Enterobacter sakazakii. Applying the 7-loci MLST scheme to 1007 strains revealed 298 definable sequence types, yet only C. sakazakii clonal complex 4 (CC4) was principally associated with neonatal meningitis. This clonal lineage has been confirmed using ribosomal-MLST (51-loci) and whole genome-MLST (1865 loci) to analyse 107 whole genomes via the Cronobacter PubMLST database. This database has enabled the retrospective analysis of historic cases and outbreaks following re-identification of those strains.
The Cronobacter PubMLST database offers a central, open access, reliable sequence-based repository for researchers. It has the capacity to create new analysis schemes 'on the fly', and to integrate metadata (source, geographic distribution, clinical presentation). It is also expandable and adaptable to changes in taxonomy, and able to support the development of reliable detection methods of use to industry and regulatory authorities. Therefore it meets the WHO (2004) request for the establishment of a typing scheme for this emergent bacterial pathogen. Whole genome sequencing has additionally shown a range of potential virulence and environmental fitness traits which may account for the association of C. sakazakii CC4 pathogenicity, and propensity for neonatal CNS.
Mycobacterium bovis (M. bovis) is a causative agent of bovine tuberculosis, a significant source of morbidity and mortality in the global cattle industry. The Randomised Badger Culling Trial was a ...field experiment carried out between 1998 and 2005 in the South West of England. As part of this trial, M. bovis isolates were collected from contemporaneous and overlapping populations of badgers and cattle within ten defined trial areas. We combined whole genome sequences from 1,442 isolates with location and cattle movement data, identifying transmission clusters and inferred rates and routes of transmission of M. bovis. Most trial areas contained a single transmission cluster that had been established shortly before sampling, often contemporaneous with the expansion of bovine tuberculosis in the 1980s. The estimated rate of transmission from badger to cattle was approximately two times higher than from cattle to badger, and the rate of within-species transmission considerably exceeded these for both species. We identified long distance transmission events linked to cattle movement, recurrence of herd breakdown by infection within the same transmission clusters and superspreader events driven by cattle but not badgers. Overall, our data suggests that the transmission clusters in different parts of South West England that are still evident today were established by long-distance seeding events involving cattle movement, not by recrudescence from a long-established wildlife reservoir. Clusters are maintained primarily by within-species transmission, with less frequent spill-over both from badger to cattle and cattle to badger.
Highly parallel, 'second generation' sequencing technologies have rapidly expanded the number of bacterial whole genome sequences available for study, permitting the emergence of the discipline of ...population genomics. Most of these data are publically available as unassembled short-read sequence files that require extensive processing before they can be used for analysis. The provision of data in a uniform format, which can be easily assessed for quality, linked to provenance and phenotype and used for analysis, is therefore necessary.
The performance of de novo short-read assembly followed by automatic annotation using the pubMLST.org Neisseria database was assessed and evaluated for 108 diverse, representative, and well-characterised Neisseria meningitidis isolates. High-quality sequences were obtained for >99% of known meningococcal genes among the de novo assembled genomes and four resequenced genomes and less than 1% of reassembled genes had sequence discrepancies or misassembled sequences. A core genome of 1600 loci, present in at least 95% of the population, was determined using the Genome Comparator tool. Genealogical relationships compatible with, but at a higher resolution than, those identified by multilocus sequence typing were obtained with core genome comparisons and ribosomal protein gene analysis which revealed a genomic structure for a number of previously described phenotypes. This unified system for cataloguing Neisseria genetic variation in the genome was implemented and used for multiple analyses and the data are publically available in the PubMLST Neisseria database.
The de novo assembly, combined with automated gene-by-gene annotation, generates high quality draft genomes in which the majority of protein-encoding genes are present with high accuracy. The approach catalogues diversity efficiently, permits analyses of a single genome or multiple genome comparisons, and is a practical approach to interpreting WGS data for large bacterial population samples. The method generates novel insights into the biology of the meningococcus and improves our understanding of the whole population structure, not just disease causing lineages.
Acinetobacter baumannii is a troublesome opportunistic pathogen with a high capacity for clonal dissemination. We announce the establishment of a database for the ampC locus in A. baumannii, in which ...novel ampC alleles are differentiated based on the occurrence of ≥ 1 nucleotide change, regardless of whether it is silent or missense. The database is openly accessible at the pubmlst platform for A. baumannii (http://pubmlst.org/abaumannii/). Forty-eight distinctive alleles of the ampC locus have so far been identified and deposited in the database. Isolates from clonal complex 1 (CC1), according to the Pasteur multilocus sequence typing scheme, had a variety of the ampC locus alleles, including alleles 1, 3, 4, 5, 6, 7, 8, 13, 14, 17, and 18. On the other hand, isolates from CC2 had the ampC alleles 2, 3, 19, 20, 21, 22, 23, 24, 26, 27, 28, and 46. Allele 3 was characteristic for sequence types ST3 or ST32. The ampC alleles 10, 16, and 25 were characteristic for CC10, ST16, and CC25, respectively. Our study points out that novel gene databases, in which alleles are numbered based on differences in their nucleotide identities, should replace traditional records that use amino acid substitutions to define new alleles.