Species occurrence records from online databases are an indispensable resource in ecological, biogeographical and palaeontological research. However, issues with data quality, especially incorrect ...geo‐referencing or dating, can diminish their usefulness. Manual cleaning is time‐consuming, error prone, difficult to reproduce and limited to known geographical areas and taxonomic groups, making it impractical for datasets with thousands or millions of records.
Here, we present CoordinateCleaner, an r‐package to scan datasets of species occurrence records for geo‐referencing and dating imprecisions and data entry errors in a standardized and reproducible way. CoordinateCleaner is tailored to problems common in biological and palaeontological databases and can handle datasets with millions of records. The software includes (a) functions to flag potentially problematic coordinate records based on geographical gazetteers, (b) a global database of 9,691 geo‐referenced biodiversity institutions to identify records that are likely from horticulture or captivity, (c) novel algorithms to identify datasets with rasterized data, conversion errors and strong decimal rounding and (d) spatio‐temporal tests for fossils.
We describe the individual functions available in CoordinateCleaner and demonstrate them on more than 90 million occurrences of flowering plants from the Global Biodiversity Information Facility (GBIF) and 19,000 fossil occurrences from the Palaeobiology Database (PBDB). We find that in GBIF more than 3.4 million records (3.7%) are potentially problematic and that 179 of the tested contributing datasets (18.5%) might be biased by rasterized coordinates. In PBDB, 1205 records (6.3%) are potentially problematic.
All cleaning functions and the biodiversity institution database are open‐source and available within the CoordinateCleaner r‐package.
The reliable mapping of species richness is a crucial step for the identification of areas of high conservation priority, alongside other value and threat considerations. This is commonly done by ...overlapping range maps of individual species, which requires dense availability of occurrence data or relies on assumptions about the presence of species in unsampled areas deemed suitable by environmental niche models. Here, we present a deep learning approach that directly estimates species richness, skipping the step of estimating individual species ranges. We train a neural network model based on species lists from inventory plots, which provide ground truth data for supervised machine learning. The model learns to predict species richness based on spatially associated variables, including climatic and geographic predictors, as well as counts of available species records from online databases. We assess the empirical utility of our approach by producing independently verifiable maps of alpha, beta, and gamma plant diversity at high spatial resolutions for Australia, a continent with highly heterogeneous diversity patterns. Our deep learning framework provides a powerful and flexible new approach for estimating biodiversity patterns, constituting a step forward toward automated biodiversity assessments.
To understand the current biodiversity crisis, it is crucial to determine how humans have affected biodiversity in the past. However, the extent of human involvement in species extinctions from the ...Late Pleistocene onward remains contentious. Here, we apply Bayesian models to the fossil record to estimate how mammalian extinction rates have changed over the past 126,000 years, inferring specific times of rate increases. We specifically test the hypothesis of human-caused extinctions by using posterior predictive methods. We find that human population size is able to predict past extinctions with 96% accuracy. Predictors based on past climate, in contrast, perform no better than expected by chance, suggesting that climate had a negligible impact on global mammal extinctions. Based on current trends, we predict for the near future a rate escalation of unprecedented magnitude. Our results provide a comprehensive assessment of the human impact on past and predicted future extinctions of mammals.
Abstract
Some of the most extensive terrestrial biomes today consist of open vegetation, including temperate grasslands and tropical savannas. These biomes originated relatively recently in Earth’s ...history, likely replacing forested habitats in the second half of the Cenozoic. However, the timing of their origination and expansion remains disputed. Here, we present a Bayesian deep learning model that utilizes information from fossil evidence, geologic models, and paleoclimatic proxies to reconstruct paleovegetation, placing the emergence of open habitats in North America at around 23 million years ago. By the time of the onset of the Quaternary glacial cycles, open habitats were covering more than 30% of North America and were expanding at peak rates, to eventually become the most prominent natural vegetation type today. Our entirely data-driven approach demonstrates how deep learning can harness unexplored signals from complex data sets to provide insights into the evolution of Earth’s biomes in time and space.
Birds are among the best-studied animal groups, but their prehistoric diversity is poorly known due to low fossilization potential. Hence, while many human-driven bird extinctions (i.e., extinctions ...caused directly by human activities such as hunting, as well as indirectly through human-associated impacts such as land use change, fire, and the introduction of invasive species) have been recorded, the true number is likely much larger. Here, by combining recorded extinctions with model estimates based on the completeness of the fossil record, we suggest that at least ~1300-1500 bird species (~12% of the total) have gone extinct since the Late Pleistocene, with 55% of these extinctions undiscovered (not yet discovered or left no trace). We estimate that the Pacific accounts for 61% of total bird extinctions. Bird extinction rate varied through time with an intense episode ~1300 CE, which likely represents the largest human-driven vertebrate extinction wave ever, and a rate 80 (60-95) times the background extinction rate. Thus, humans have already driven more than one in nine bird species to extinction, with likely severe, and potentially irreversible, ecological and evolutionary consequences.
Trees are fundamental for Earth's biodiversity as primary producers and ecosystem engineers and are responsible for many of nature's contributions to people. Yet, many tree species at present are ...threatened with extinction by human activities. Accurate identification of threatened tree species is necessary to quantify the current biodiversity crisis and to prioritize conservation efforts. However, the most comprehensive dataset of tree species extinction risk-the Red List of the International Union for the Conservation of Nature (IUCN RL)-lacks assessments for a substantial number of known tree species. The RL is based on a time-consuming expert-based assessment process, which hampers the inclusion of less-known species and the continued updating of extinction risk assessments. In this study, we used a computational pipeline to approximate RL extinction risk assessments for more than 21,000 tree species (leading to an overall assessment of 89% of all known tree species) using a supervised learning approach trained based on available IUCN RL assessments. We harvested the occurrence data for tree species worldwide from online databases, which we used with other publicly available data to design features characterizing the species' geographic range, biome and climatic affinities, and exposure to human footprint. We trained deep neural network models to predict their conservation status, based on these features. We estimated 43% of the assessed tree species to be threatened with extinction and found taxonomic and geographic heterogeneities in the distribution of threatened species. The results are consistent with the recent estimates by the Global Tree Assessment initiative, indicating that our approach provides robust and time-efficient approximations of species' IUCN RL extinction risk assessments.
Evolutionary biology has entered an era of unprecedented amounts of DNA sequence data, as new sequencing technologies such as Massive Parallel Sequencing (MPS) can generate billions of nucleotides ...within less than a day. The current bottleneck is how to efficiently handle, process, and analyze such large amounts of data in an automated and reproducible way. To tackle these challenges we introduce the Sequence Capture Processor (SECAPR) pipeline for processing raw sequencing data into multiple sequence alignments for downstream phylogenetic and phylogeographic analyses. SECAPR is user-friendly and we provide an exhaustive empirical data tutorial intended for users with no prior experience with analyzing MPS output. SECAPR is particularly useful for the processing of sequence capture (synonyms: target or hybrid enrichment) datasets for non-model organisms, as we demonstrate using an empirical sequence capture dataset of the palm genus
(Arecaceae). Various quality control and plotting functions help the user to decide on the most suitable settings for even challenging datasets. SECAPR is an easy-to-use, free, and versatile pipeline, aimed to enable efficient and reproducible processing of MPS data for many samples in parallel.
High-throughput DNA sequencing techniques enable time- and cost-effective sequencing of large portions of the genome. Instead of sequencing and annotating whole genomes, many phylogenetic studies ...focus sequencing effort on large sets of pre-selected loci, which further reduces costs and bioinformatic challenges while increasing coverage. One common approach that enriches loci before sequencing is often referred to as target sequence capture. This technique has been shown to be applicable to phylogenetic studies of greatly varying evolutionary depth. Moreover, it has proven to produce powerful, large multi-locus DNA sequence datasets suitable for phylogenetic analyses. However, target capture requires careful considerations, which may greatly affect the success of experiments. Here we provide a simple flowchart for designing phylogenomic target capture experiments. We discuss necessary decisions from the identification of target loci to the final bioinformatic processing of sequence data. We outline challenges and solutions related to the taxonomic scope, sample quality, and available genomic resources of target capture projects. We hope this review will serve as a useful roadmap for designing and carrying out successful phylogenetic target capture studies.
The unparalleled biodiversity found in the American tropics (the Neotropics) has attracted the attention of naturalists for centuries. Despite major advances in recent years in our understanding of ...the origin and diversification of many Neotropical taxa and biotic regions, many questions remain to be answered. Additional biological and geological data are still needed, as well as methodological advances that are capable of bridging these research fields. In this review, aimed primarily at advanced students and early-career scientists, we introduce the concept of "trans-disciplinary biogeography," which refers to the integration of data from multiple areas of research in biology (e.g., community ecology, phylogeography, systematics, historical biogeography) and Earth and the physical sciences (e.g., geology, climatology, palaeontology), as a means to reconstruct the giant puzzle of Neotropical biodiversity and evolution in space and time. We caution against extrapolating results derived from the study of one or a few taxa to convey general scenarios of Neotropical evolution and landscape formation. We urge more coordination and integration of data and ideas among disciplines, transcending their traditional boundaries, as a basis for advancing tomorrow's ground-breaking research. Our review highlights the great opportunities for studying the Neotropical biota to understand the evolution of life.
Aim
The Red List (RL) from the International Union for the Conservation of Nature is the most comprehensive global quantification of extinction risk, and widely used in applied conservation as well ...as in biogeographic and ecological research. Yet, due to the time‐consuming assessment process, the RL is biased taxonomically and geographically, which limits its application on large scales, in particular for underdocumented areas such as the tropics, or understudied taxa, such as most plants and invertebrates. Here, we present IUCNN, an R‐package implementing deep learning models to predict species RL status from publicly available geographic occurrence records (and other data if available).
Innovation
We implement a user‐friendly workflow to train and validate neural network models, and use them to predict species RL status. IUCNN contains specific functions for extinction risk prediction in the RL framework, including a regression‐based approach to account for the ordinal nature of RL categories, a Bayesian approach for improved uncertainty quantification and a convolutional neural network to predict species RL status based on their raw geographic occurrences. Most analyses run with few lines of code, not requiring users to have prior experience with neural network models. We demonstrate the use of IUCNN on an empirical dataset of ~14,000 orchid species, for which IUCNN models can predict extinction risk within minutes, while outperforming comparable methods based on species occurrence information.
Main conclusions
IUCNN harnesses innovative methodology to estimate the RL status of large numbers of species. By providing estimates of the number and identity of threatened species in custom geographic or taxonomic datasets, IUCNN enables large‐scale automated assessments of the extinction risk of species so far not well represented on the official RL.