Multiplexing, the simultaneous sequencing of multiple barcoded DNA samples on a single flow cell, has made Oxford Nanopore sequencing cost-effective for small genomes. However, it depends on the ...ability to sort the resulting sequencing reads by barcode, and current demultiplexing tools fail to classify many reads. Here we present Deepbinner, a tool for Oxford Nanopore demultiplexing that uses a deep neural network to classify reads based on the raw electrical read signal. This 'signal-space' approach allows for greater accuracy than existing 'base-space' tools (Albacore and Porechop) for which signals must first be converted to DNA base calls, itself a complex problem that can introduce noise into the barcode sequence. To assess Deepbinner and existing tools, we performed multiplex sequencing on 12 amplicons chosen for their distinguishability. This allowed us to establish a ground truth classification for each read based on internal sequence alone. Deepbinner had the lowest rate of unclassified reads (7.8%) and the highest demultiplexing precision (98.5% of classified reads were correctly assigned). It can be used alone (to maximise the number of classified reads) or in conjunction with other demultiplexers (to maximise precision and minimise false positive classifications). We also found cross-sample chimeric reads (0.3%) and evidence of barcode switching (0.3%) in our dataset, which likely arise during library preparation and may be detrimental for quantitative studies that use multiplexing. Deepbinner is open source (GPLv3) and available at https://github.com/rrwick/Deepbinner.
To associate specimens identified by molecular characters to other biological knowledge, we need reference sequences annotated by Linnaean taxonomy. In this study, we (1) report the creation of a ...comprehensive reference library of DNA barcodes for the arthropods of an entire country (Finland), (2) publish this library, and (3) deliver a new identification tool for insects and spiders, as based on this resource. The reference library contains mtDNA COI barcodes for 11,275 (43%) of 26,437 arthropod species known from Finland, including 10,811 (45%) of 23,956 insect species. To quantify the improvement in identification accuracy enabled by the current reference library, we ran 1000 Finnish insect and spider species through the Barcode of Life Data system (BOLD) identification engine. Of these, 91% were correctly assigned to a unique species when compared to the new reference library alone, 85% were correctly identified when compared to BOLD with the new material included, and 75% with the new material excluded. To capitalize on this resource, we used the new reference material to train a probabilistic taxonomic assignment tool, FinPROTAX, scoring high success. For the full‐length barcode region, the accuracy of taxonomic assignments at the level of classes, orders, families, subfamilies, tribes, genera, and species reached 99.9%, 99.9%, 99.8%, 99.7%, 99.4%, 96.8%, and 88.5%, respectively. The FinBOL arthropod reference library and FinPROTAX are available through the Finnish Biodiversity Information Facility (www.laji.fi) at https://laji.fi/en/theme/protax. Overall, the FinBOL investment represents a massive capacity‐transfer from the taxonomic community of Finland to all sectors of society.
DNA-based identification is vital for classifying biological specimens, yet methods to quantify the uncertainty of sequence-based taxonomic assignments are scarce. Challenges arise from noisy ...reference databases, including mislabelled entries and missing taxa. PROTAX addresses these issues with a probabilistic approach to taxonomic classification, advancing on methods that rely solely on sequence similarity. It provides calibrated probabilistic assignments to a partially populated taxonomic hierarchy, accounting for taxa that lack references and incorrect taxonomic annotation. While effective on smaller scales, global application of PROTAX necessitates substantially larger reference libraries, a goal previously hindered by computational barriers. We introduce PROTAX-GPU, a scalable algorithm capable of leveraging the global Barcode of Life Data System (>14 million specimens) as a reference database. Using graphics processing units (GPU) to accelerate similarity and nearest-neighbour operations and the JAX library for Python integration, we achieve over a 1000 × speedup compared with the central processing unit (CPU)-based implementation without compromising PROTAX's key benefits. PROTAX-GPU marks a significant stride towards real-time DNA barcoding, enabling quicker and more efficient species identification in environmental assessments. This capability opens up new avenues for real-time monitoring and analysis of biodiversity, advancing our ability to understand and respond to ecological dynamics. This article is part of the theme issue 'Towards a toolkit for global insect biodiversity monitoring'.
Exhaustive biodiversity data, covering all the taxa in an environment, would be fundamental to understand how global changes influence organisms living at different trophic levels, and to evaluate ...impacts on interspecific interactions. Molecular approaches such as DNA metabarcoding are boosting our ability to perform biodiversity inventories. Nevertheless, even though a few studies have recently attempted exhaustive reconstructions of communities, holistic assessments remain rare. The majority of metabarcoding studies published in the last years used just one or two markers and analysed a limited number of taxonomic groups. Here, we provide an overview of emerging approaches that can allow all‐taxa biological inventories. Exhaustive biodiversity assessments can be attempted by combining a large number of specific primers, by exploiting the power of universal primers, or by combining specific and universal primers to obtain good information on key taxa while limiting the overlooked biodiversity. Multiplexes of primers, shotgun sequencing and capture enrichment may provide a better coverage of biodiversity compared to standard metabarcoding, but still require major methodological advances. Here, we identify the strengths and limitations of different approaches, and suggest new development lines that might improve broad scale biodiversity analyses in the near future. More holistic reconstructions of ecological communities can greatly increase the value of metabarcoding studies, improving understanding of the consequences of ongoing environmental changes on the multiple components of biodiversity.
Diatoms are frequently used for water quality assessments; however, identification to species level is difficult, time‐consuming and needs in‐depth knowledge of the organisms under investigation, as ...nonhomoplastic species‐specific morphological characters are scarce. We here investigate how identification methods based on DNA (metabarcoding using NGS platforms) perform in comparison to morphological diatom identification and propose a workflow to optimize diatom fresh water quality assessments. Diatom diversity at seven different sites along the course of the river system Odra and Lusatian Neisse from the source to the mouth is analysed with DNA and morphological methods, which are compared. The NGS technology almost always leads to a higher number of identified taxa (270 via NGS vs. 103 by light microscopy LM), whose presence could subsequently be verified by LM. The sequence‐based approach allows for a much more graduated insight into the taxonomic diversity of the environmental samples. Taxa retrieval varies considerably throughout the river system, depending on species occurrences and the taxonomic depth of the reference databases. Mostly rare taxa from oligotrophic parts of the river systems are less well represented in the reference database used. A workflow for DNA‐based NGS diatom identification is presented. 28 000 diatom sequences were evaluated. Our findings provide evidence that metabarcoding of diatoms via NGS sequencing of the V4 region (18S) has a great potential for water quality assessments and could complement and maybe even improve the identification via light microscopy.
Most arthropod species are undescribed and hidden in specimen-rich samples that are difficult to sort to species using morphological characters. For such samples, sorting to putative species with DNA ...barcodes is an attractive alternative, but needs cost-effective techniques that are suitable for use in many laboratories around the world. Barcoding using the portable and inexpensive MinION sequencer produced by Oxford Nanopore Technologies (ONT) could be useful for presorting specimen-rich samples with DNA barcodes because it requires little space and is inexpensive. However, similarly important is user-friendly and reliable software for analysis of the ONT data. It is here provided in the form of ONTbarcoder 2.0 that is suitable for all commonly used operating systems and includes a Graphical User Interface (GUI). Compared with an earlier version, ONTbarcoder 2.0 has three key improvements related to the higher read quality obtained with ONT's latest flow cells (R10.4), chemistry (V14 kits) and basecalling model (super-accuracy model). First, the improved read quality of ONT's latest flow cells (R10.4) allows for the use of primers with shorter indices than those previously needed (9 bp vs. 12-13 bp). This decreases the primer cost and can potentially improve PCR success rates. Second, ONTbarcoder now delivers real-time barcoding to complement ONT's real-time sequencing. This means that the first barcodes are obtained within minutes of starting a sequencing run; i.e. flow cell use can be optimized by terminating sequencing runs when most barcodes have already been obtained. The only input needed by ONTbarcoder 2.0 is a demultiplexing sheet and sequencing data (raw or basecalled) generated by either a Mk1B or a Mk1C. Thirdly, we demonstrate that the availability of R10.4 chemistry for the low-cost Flongle flow cell is an attractive option for users who require only 200-250 barcodes at a time.
The analysis of DNA barcode sequences with varying techniques for cluster recognition provides an efficient approach for recognizing putative species (operational taxonomic units, OTUs). This ...approach accelerates and improves taxonomic workflows by exposing cryptic species and decreasing the risk of synonymy. This study tested the congruence of OTUs resulting from the application of three analytical methods (ABGD, BIN, GMYC) to sequence data for Australian hypertrophine moths. OTUs supported by all three approaches were viewed as robust, but 20% of the OTUs were only recognized by one or two of the methods. These OTUs were examined for three criteria to clarify their status. Monophyly and diagnostic nucleotides were both uninformative, but information on ranges was useful as sympatric sister OTUs were viewed as distinct, while allopatric OTUs were merged. This approach revealed 124 OTUs of Hypertrophinae, a more than twofold increase from the currently recognized 51 species. Because this analytical protocol is both fast and repeatable, it provides a valuable tool for establishing a basic understanding of species boundaries that can be validated with subsequent studies.
DNA barcodes are a useful tool for discovering, understanding, and monitoring biodiversity which are critical tasks at a time of rapid biodiversity loss. However, widespread adoption of barcodes ...requires cost-effective and simple barcoding methods. We here present a workflow that satisfies these conditions. It was developed via "innovation through subtraction" and thus requires minimal lab equipment, can be learned within days, reduces the barcode sequencing cost to < 10 cents, and allows fast turnaround from specimen to sequence by using the portable MinION sequencer.
We describe how tagged amplicons can be obtained and sequenced with the real-time MinION sequencer in many settings (field stations, biodiversity labs, citizen science labs, schools). We also provide amplicon coverage recommendations that are based on several runs of the latest generation of MinION flow cells ("R10.3") which suggest that each run can generate barcodes for > 10,000 specimens. Next, we present a novel software, ONTbarcoder, which overcomes the bioinformatics challenges posed by MinION reads. The software is compatible with Windows 10, Macintosh, and Linux, has a graphical user interface (GUI), and can generate thousands of barcodes on a standard laptop within hours based on only two input files (FASTQ, demultiplexing file). We document that MinION barcodes are virtually identical to Sanger and Illumina barcodes for the same specimens (> 99.99%) and provide evidence that MinION flow cells and reads have improved rapidly since 2018.
We propose that barcoding with MinION is the way forward for government agencies, universities, museums, and schools because it combines low consumable and capital cost with scalability. Small projects can use the flow cell dongle ("Flongle") while large projects can rely on MinION flow cells that can be stopped and re-used after collecting sufficient data for a given project.
A two-marker combination of plastid rbcL and matK has previously been recommended as the core plant barcode, to be supplemented with additional markers such as plastid trnH–psbA and nuclear ribosomal ...internal transcribed spacer (ITS). To assess the effectiveness and universality of these barcode markers in seed plants, we sampled 6,286 individuals representing 1,757 species in 141 genera of 75 families (42 orders) by using four different methods of data analysis. These analyses indicate that (i) the three plastid markers showed high levels of universality (87.1–92.7%), whereas ITS performed relatively well (79%) in angiosperms but not so well in gymnosperms; (ii) in taxonomic groups for which direct sequencing of the marker is possible, ITS showed the highest discriminatory power of the four markers, and a combination of ITS and any plastid DNA marker was able to discriminate 69.9–79.1% of species, compared with only 49.7% with rbcL + matK; and (iii) where multiple individuals of a single species were tested, ascriptions based on ITS and plastid DNA barcodes were incongruent in some samples for 45.2% of the sampled genera (for genera with more than one species sampled). This finding highlights the importance of both sampling multiple individuals and using markers with different modes of inheritance. In cases where it is difficult to amplify and directly sequence ITS in its entirety, just using ITS2 is a useful backup because it is easier to amplify and sequence this subset of the marker. We therefore propose that ITS/ITS2 should be incorporated into the core barcode for seed plants.