Reference databases with wide taxonomic coverage are greatly needed in many fields of biology, most particularly for the taxonomic assignment of metabarcoding sequences. Therefore, it is fundamental ...to be able to access and pool data from different primary databases. The COInr database is a freely available, easy‐to‐access database of COI reference sequences extracted from the BOLD and NCBI nucleotide databases. It is a comprehensive database: not limited to a taxon, a gene region or a taxonomic rank; therefore, it is a good starting point for creating custom databases. Sequences are dereplicated between databases and within taxa. Each taxon has a unique taxonomic identifier (taxID), fundamental to avoid ambiguous associations of homonyms and synonyms in the source database. TaxIDs form a coherent hierarchical system fully compatible with the NCBI taxIDs, allowing their full or ranked lineages to be created. The mkcoinr tool is a series of Perl scripts designed to download sequences from BOLD and NCBI, to build the COInr database and to customize it according to the users’ needs. It is possible to select or eliminate sequences for a list of taxa, select a specific gene region, select for minimum taxonomic resolution, add new custom sequences, and format the database for blast, vtam, qiime and rdp classifier. This is a semi‐automated pipeline using command lines in a Linux environment. The COInr database can be downloaded from https://doi.org/10.5281/zenodo.6555985 and mkcoinr and its full documentation is available at https://github.com/meglecz/mkCOInr.
Microsatellites (or SSRs: simple sequence repeats) are among the most frequently used DNA markers in many areas of research. The use of microsatellite markers is limited by the difficulties involved ...in their de novo isolation from species for which no genomic resources are available. We describe here a high‐throughput method for isolating microsatellite markers based on coupling multiplex microsatellite enrichment and next‐generation sequencing on 454 GS‐FLX Titanium platforms. The procedure was calibrated on a model species (Apis mellifera) and validated on 13 other species from various taxonomic groups (animals, plants and fungi), including taxa for which severe difficulties were previously encountered using traditional methods. We obtained from 11 497 to 34 483 sequences depending on the species and the number of detected microsatellite loci ranged from 199 to 5791. We thus demonstrated that this procedure can be readily and successfully applied to a large variety of taxonomic groups, at much lower cost than would have been possible with traditional protocols. This method is expected to speed up the acquisition of high‐quality genetic markers for nonmodel organisms.
Metabarcoding is now a widely used method for biodiversity studies. Taxonomic assignment of environmental sequences is one of the key steps of metabarcoding. Assignments based on lowest common ...ancestor (LCA) method generally rely on fixed arbitrary thresholds, and this is generally not well adapted for assignment of taxonomically diverse groups with variable coverage in reference databases. The mkLTG is a LCA-based method that uses a series of percentage of identity thresholds starting from stringent parameters and decreasing it if necessary. All parameters can be set separately for each percentage of identity threshold, which makes this tool adaptable for different databases, genetic markers and diverse taxonomic groups. The optimization step was included using the COI marker and a comprehensive, non-redundant database. The mkLTG tool is a command-line application with few dependencies that runs in all operating systems, therefore, it is easy to include into complex pipelines. All scripts are freely available including the benchmarking at https://github.com/meglecz/mkLTG .
The rapid evolution of 454 GS-FLX sequencing technology has not been accompanied by a reassessment of the quality and accuracy of the sequences obtained. Current strategies for decision-making and ...error-correction are based on an initial analysis by Huse et al. in 2007, for the older GS20 system based on experimental sequences. We analyze here the quality of 454 sequencing data and identify factors playing a role in sequencing error, through the use of an extensive dataset for Roche control DNA fragments.
We obtained a mean error rate for 454 sequences of 1.07%. More importantly, the error rate is not randomly distributed; it occasionally rose to more than 50% in certain positions, and its distribution was linked to several experimental variables. The main factors related to error are the presence of homopolymers, position in the sequence, size of the sequence and spatial localization in PT plates for insertion and deletion errors. These factors can be described by considering seven variables. No single variable can account for the error rate distribution, but most of the variation is explained by the combination of all seven variables.
The pattern identified here calls for the use of internal controls and error-correcting base callers, to correct for errors, when available (e.g. when sequencing amplicons). For shotgun libraries, the use of both sequencing primers and deep coverage, combined with the use of random sequencing primer sites should partly compensate for even high error rates, although it may prove more difficult than previous thought to distinguish between low-frequency alleles and errors.
QDD is an open access program providing a user-friendly tool for microsatellite detection and primer design from large sets of DNA sequences. The program is designed to deal with all steps of ...treatment of raw sequences obtained from pyrosequencing of enriched DNA libraries, but it is also applicable to data obtained through other sequencing methods, using FASTA files as input. The following tasks are completed by QDD: tag sorting, adapter/vector removal, elimination of redundant sequences, detection of possible genomic multicopies (duplicated loci or transposable elements), stringent selection of target microsatellites and customizable primer design. It can treat up to one million sequences of a few hundred base pairs in the tag-sorting step, and up to 50 000 sequences in a single input file for the steps involving estimation of sequence similarity. Availability: QDD is freely available under the GPL licence for Windows and Linux from the following web site: http://www.univ-provence.fr/gsite/Local/egee/dir/meglecz/QDD.html Contact: emese.meglecz@univ-provence.fr Supplementary information: Supplementary data are available at Bioinformatics online.
Microsatellite marker development has been greatly simplified by the use of high‐throughput sequencing followed by in silico microsatellite detection and primer design. However, the selection of ...markers designed by the existing pipelines depends either on arbitrary criteria, or older studies on PCR success. Based on wet laboratory experiments, we have identified the following factors that are most likely to influence genotyping success rate: alignment score between the primers and the amplicon; the distance between primers and microsatellites; the length of the PCR product; target region complexity and the number of reads underlying the sequence. The QDD pipeline has been modified to include these most pertinent factors in the output to help the selection of markers. Furthermore, new features are also included in the present version: (i) not only raw sequencing reads are accepted as input, but also contigs, allowing the analysis of assembled high‐coverage data; (ii) input data can be both in fasta and fastq format to facilitate the use of Illumina and IonTorrent reads; (iii) A comparison to known transposable elements allows their detection; (iv) A contamination check can be carried out by BLASTing potential markers against the nucleotide (nt) database of NCBI; (v) QDD3 is now also available imbedded into a virtual machine making installation easier and operating system independent. It can be used both on command‐line version as well as integrated into a Galaxy server, providing a user‐friendly interface, as well as the possibility to utilize a large variety of NGS tools.
ABSTRACT
Genetic data show that many nominal species are composed of more than one biological species, and thus contain cryptic species in the broad sense (including overlooked species). When ...ignored, cryptic species generate confusion which, beyond biodiversity or vulnerability underestimation, blurs our understanding of ecological and evolutionary processes and may impact the soundness of decisions in conservation or medicine. However, very few hypotheses have been tested about factors that predispose a taxon to contain cryptic or overlooked species. To fill this gap, we surveyed the literature on free‐living marine metazoans and built two data sets, one of 187,603 nominal species and another of 83 classes or phyla, to test several hypotheses, correcting for sequence data availability, taxon size and phylogenetic relatedness. We found a strong effect of scientific history: the probability of a taxon containing cryptic species was highest for the earliest described species and varied among time periods potentially consistently with an influence of prevailing scientific theories. The probability of cryptic species being present was also increased for species with large distribution ranges. They were more frequent in the north polar and south polar zones, contradicting previous predictions of more cryptic species in the tropics, and supporting the hypothesis that many cryptic species diverged recently. The number of cryptic species varied among classes, with an excess in hydrozoans and polychaetes, and a deficit in actinopterygians, for example, but precise class ranking was relatively sensitive to the statistical model used. For all models, biological traits, rather than phylum, appeared responsible for the variation among classes: there were fewer cryptic species than expected in classes with hard skeletons (perhaps because they provide good characters for taxonomy) and image‐forming vision (in which selection against heterospecific mating may enhance morphological divergence), and more in classes with internal fertilisation. We estimate that among marine free‐living metazoans, several thousand additional cryptic species complexes could be identified as more sequence data become available. The factors identified as important for marine animal cryptic species are likely important for other biomes and taxa and should aid many areas in biology that rely on accurate species identification.
Downloading large batches of DNA sequences can be useful to create custom databases containing for example sequences of a particular genomic region or a group of organisms. These sequences can be ...found on NCBI databases and accessed via a web browser (GUI) or directly via NCBI API. While the GUI is user-friendly, it lacks certain functionalities. On the other extreme, the use of the API is flexible but requires coding knowledge. NSDPY is a python package that combines flexibility and ease of use to download large amount of DNA sequences and includes several taxonomic or filtering options like batch downloading sequences for a list of taxa, downloading sequences including taxonomic lineage or filtering CDS sequences for a specific gene. NSDPY is available on PyPI, it is written to minimize dependencies on other packages and to be used directly from the terminal by simple command lines so that most users can use it without prior coding experience.
The main objective of this work was to develop and validate a robust and reliable “from‐benchtop‐to‐desktop” metabarcoding workflow to investigate the diet of invertebrate‐eaters. We applied our ...workflow to faecal DNA samples of an invertebrate‐eating fish species. A fragment of the cytochrome c oxidase I (COI) gene was amplified by combining two minibarcoding primer sets to maximize the taxonomic coverage. Amplicons were sequenced by an Illumina MiSeq platform. We developed a filtering approach based on a series of nonarbitrary thresholds established from control samples and from molecular replicates to address the elimination of cross‐contamination, PCR/sequencing errors and mistagging artefacts. This resulted in a conservative and informative metabarcoding data set. We developed a taxonomic assignment procedure that combines different approaches and that allowed the identification of ~75% of invertebrate COI variants to the species level. Moreover, based on the diversity of the variants, we introduced a semiquantitative statistic in our diet study, the minimum number of individuals, which is based on the number of distinct variants in each sample. The metabarcoding approach described in this article may guide future diet studies that aim to produce robust data sets associated with a fine and accurate identification of prey items.
The adaptability of plant populations to a changing environment depends on their genetic diversity, which in turn is influenced by the degree of sexual reproduction and gene flow from distant areas. ...Aquatic macrophytes can reproduce both sexually and asexually, and their reproductive fragments are spread in various ways (e.g. by water). Although these plants are obviously exposed to hydrological changes, the degree of vulnerability may depend on the types of their reproduction and distribution, as well as the hydrological differences of habitats. The aim of this study was to investigate the genetic diversity of the cosmopolitan macrophyte
Ceratophyllum demersum
in hydrologically different aquatic habitats, i.e. rivers and backwaters separated from the main river bed to a different extent. For this purpose, the first microsatellite primer set was developed for this species. Using 10 developed primer pairs, a high level of genetic variation was explored in
C. demersum
populations. Overall, more than 80% of the loci were found to be polymorphic, a total of 46 different multilocus genotypes and 18 private alleles were detected in the 63 individuals examined. The results demonstrated that microsatellite polymorphism in this species depends on habitat hydrology. The greatest genetic variability was revealed in populations of rivers, where flowing water provides constant longitudinal connections with distant habitats. The populations of the hydrologically isolated backwaters showed the lowest microsatellite polymorphism, while plants from an oxbow occasionally flooded by the main river had medium genetic diversity. The results highlight that in contrast to species that spread independently of water flow or among hydrologically isolated water bodies, macrophytes with exclusive or dominant hydrochory may be most severely affected by habitat fragmentation, for example due to climate change.