The retrieval of plant-related information is a challenging task due to variations in species name mentions as well as spelling or typographical errors across data sources. Scalable solutions are ...needed for identifying plant name mentions from text and resolving them to accepted taxonomic names.
An Apache Solr-based fuzzy matching system enhanced with the Smith-Waterman alignment algorithm ("Solr-Plant") was developed for mapping and resolution to a plant name and synonym thesaurus. Evaluation of Solr-Plant suggests promising results in terms of both accuracy and processing efficiency on misspelled species names from two benchmark datasets: (1) SALVIAS and (2) National Center for Biotechnology Information (NCBI) Taxonomy. Additional evaluation using S800 text corpus also reflects high precision and recall. The latest version of the source code is available at https://github.com/bcbi/SolrPlantAPI . A REST-compliant web interface and service for Solr-Plant is hosted at http://bcbi.brown.edu/solrplant .
Automated techniques are needed for efficient and accurate identification of knowledge linked with biological scientific names. Solr-Plant complements the current state-of-the-art in terms of both efficiency and accuracy in identification of names restricted at species level. The approach can be extended to identify broader groups of organisms at different taxonomic levels. The results reflect potential utility of Solr-Plant as a data mining tool for extracting and correcting plant species names.
Relationships between bacterial taxa are traditionally defined using 16S rRNA nucleotide similarity or average nucleotide identity. Improvements in sequencing technology provide additional pairwise ...information on genome sequences, which may provide valuable information on genomic relationships. Mapping orthologous gene locations between genome pairs, known as synteny, is typically implemented in the discovery of new species and has not been systematically applied to bacterial genomes. Using a data set of 378 bacterial genomes, we developed and tested a new measure of synteny similarity between a pair of genomes, which was scaled onto 16S rRNA distance using covariance matrices. Based on the input gene functions used (i.e., core, antibiotic resistance, and virulence), we observed varying topological arrangements of bacterial relationship networks by applying (i) complete linkage hierarchical clustering and (ii) K-nearest neighbor graph structures to synteny-scaled 16S data. Our metric improved clustering quality comparatively to state-of-the-art average nucleotide identity metrics while preserving clustering assignments for the highest similarity relationships. Our findings indicate that syntenic relationships provide more granular and interpretable relationships for within-genera taxa compared to pairwise similarity measures, particularly in functional contexts.IMPORTANCEGiven the prevalence and necessity of the 16S rRNA measure in bacterial identification and analysis, this additional analysis adds a functional and synteny-based layer to the identification of relatives and clustering of bacteria genomes. It is also of computational interest to model the bacterial genome as a graph structure, which presents new avenues of genomic analysis for bacteria and their closely related strains and species.
Over the past two decades, there has been a long-standing debate about the impact of taxon sampling on phylogenetic inference. Studies have been based on both real and simulated data sets, within ...actual and theoretical contexts, and using different inference methods, to study the impact of taxon sampling. In some cases, conflicting conclusions have been drawn for the same data set. The main questions explored in studies to date have been about the effects of using sparse data, adding new taxa, including more characters from genome sequences and using different (or concatenated) locus regions. These questions can be reduced to more fundamental ones about the assessment of data quality and the design guidelines of taxon sampling in phylogenetic inference experiments. This review summarizes progress to date in understanding the impact of taxon sampling on the accuracy of phylogenetic analysis.
Vaccines are effective in preventing Coronavirus Disease 2019 (COVID-19). Vaccine hesitancy defined as delay of acceptance or refusal of the vaccine is a major barrier to effective implementation.
...Participants were recruited statewide through an English and Spanish social media marketing campaign conducted by a local news station during a one-month period as vaccines were becoming available in Rhode Island (from December 21, 2020 to January 22, 2021). Participants completed an online survey about COVID-19 vaccines and vaccine hesitancy with constructs and items adopted from the Health Belief Model.
A total of 2,007 individuals completed the survey. Eight percent (n = 161) reported vaccine hesitancy. The sample had a median age of 58 years (interquartile range IQR: 45, 67), were majority female (78%), White (96%), Non-Hispanic (94%), employed (58%), and reported an annual individual income of $50,000 (59%). COVID-19 vaccine hesitancy was associated with attitudes and behaviors related to COVID-19. A one unit increase in concern about COVID-19 was associated with a 69% (Adjusted Odds Ratio: 0.31, 95% CI: 0.26-0.37) decrease in vaccine hesitancy. A one-level increase in the likelihood of getting influenza vaccine was associated with a 55% (AOR: 0.45 95% CI: 0.41-0.50) decrease in vaccine hesitancy.
COVID-19 vaccine hesitancy was relatively low in a state-wide survey in Rhode Island. Future research is needed to better understand and tailor messaging related to vaccine hesitancy.
With the volume of molecular sequence data that is systematically being generated globally, there is a need for centralized resources for data exploration and analytics. DNA Barcode initiatives are ...on track to generate a compendium of molecular sequence-based signatures for identifying animals and plants. To date, the range of available data exploration and analytic tools to explore these data have only been available in a boutique form--often representing a frustrating hurdle for many researchers that may not necessarily have resources to install or implement algorithms described by the analytic community. The Barcode of Life Data Portal (BDP) is a first step towards integrating the latest biodiversity informatics innovations with molecular sequence data from DNA barcoding. Through establishment of community driven standards, based on discussion with the Data Analysis Working Group (DAWG) of the Consortium for the Barcode of Life (CBOL), the BDP provides an infrastructure for incorporation of existing and next-generation DNA barcode analytic applications in an open forum.
The growing amount and availability of electronic health record (EHR) data present enhanced opportunities for discovering new knowledge about diseases. In the past decade, there has been an ...increasing number of data and text mining studies focused on the identification of disease associations (e.g., disease-disease, disease-drug, and disease-gene) in structured and unstructured EHR data. This chapter presents a knowledge discovery framework for mining the EHR for disease knowledge and describes each step for data selection, preprocessing, transformation, data mining, and interpretation/validation. Topics including natural language processing, standards, and data privacy and security are also discussed in the context of this framework.
Computational algorithms are often used to assess pathogenicity of Variants of Uncertain Significance (VUS) that are found in disease-associated genes. Most computational methods include analysis of ...protein multiple sequence alignments (PMSA), assessing interspecies variation. Careful validation of PMSA-based methods has been done for relatively few genes, partially because creation of curated PMSAs is labor-intensive. We assessed how PMSA-based computational tools predict the effects of the missense changes in the APC gene, in which pathogenic variants cause Familial Adenomatous Polyposis. Most Pathogenic or Likely Pathogenic APC variants are protein-truncating changes. However, public databases now contain thousands of variants reported as missense. We created a curated APC PMSA that contained >3 substitutions/site, which is large enough for statistically robust in silico analysis. The creation of the PMSA was not easily automated, requiring significant querying and computational analysis of protein and genome sequences. Of 1924 missense APC variants in the NCBI ClinVar database, 1800 (93.5%) are reported as VUS. All but two missense variants listed as P/LP occur at canonical splice or Exonic Splice Enhancer sites. Pathogenicity predictions by five computational tools (Align-GVGD, SIFT, PolyPhen2, MAPP, REVEL) differed widely in their predictions of Pathogenic/Likely Pathogenic (range 17.5-75.0%) and Benign/Likely Benign (range 25.0-82.5%) for APC missense variants in ClinVar. When applied to 21 missense variants reported in ClinVar and securely classified as Benign, the five methods ranged in accuracy from 76.2-100%. Computational PMSA-based methods can be an excellent classifier for variants of some hereditary cancer genes. However, there may be characteristics of the APC gene and protein that confound the results of in silico algorithms. A systematic study of these features could greatly improve the automation of alignment-based techniques and the use of predictive algorithms in hereditary cancer genes.
A variety of bioactive proteins from medicinal leeches, like species of Hirudo, have been characterized and evaluated for their potential therapeutic biomedical properties. However, there has not ...previously been a comprehensive attempt to fully characterize the salivary transcriptome of a medicinal leech that would allow a clearer understanding of the suite of polypeptides employed by these sanguivorous annelids and provide insights regarding their evolutionary origins. An Expressed Sequence Tag (EST) library-based analysis of the salivary transcriptome of the North American medicinal leech, Macrobdella decora, reveals a complex cocktail of anticoagulants and other bioactive secreted proteins not previously known to exist in a single leech. Transcripts were identified that correspond to each of saratin, bdellin, destabilase, hirudin, decorsin, endoglucoronidase, antistatin, and eglin, as well as to other previously uncharacterized predicted serine protease inhibitors, lectoxin-like c-type lectins, ficolin, disintegrins and histidine-rich proteins. This work provides a lens into the richness of bioactive polypeptides that are associated with sanguivory. In the context of a well-characterized molecular phylogeny of leeches, the results allow for preliminary evaluation of the relative evolutionary origins and historical conservation of leech salivary components. The goal of identifying evolutionarily significant residues associated with biomedically significant phenomena implies continued insights from a broader sampling of blood-feeding leech salivary transcriptomes.
To identify potential opportunities for drug repurposing by developing an automated approach to pre-screen the predicted proteomes of any organism against databases of known drug targets using only ...freely available resources.
We employed a combination of Ruby scripts that leverage data from the DrugBank and ChEMBL databases, MySQL, and BLAST to predict potential drugs and their targets from 13 published genomes. Results from a previous cell-based screen to identify inhibitors of Cryptosporidium parvum growth were used to validate our in-silico prediction method.
In-vitro validation of these results, using a cell-based C parvum growth assay, showed that the predicted inhibitors were significantly more likely than expected by chance to have confirmed activity, with 8.9-15.6% of predicted inhibitors confirmed depending on the drug target database used. This method was then used to predict inhibitors for the following 13 disease-causing protozoan parasites, including: C parvum, Entamoeba histolytica, Giardia intestinalis, Leishmania braziliensis, Leishmania donovani, Leishmania major, Naegleria gruberi (in proxy of Naegleria fowleri), Plasmodium falciparum, Plasmodium vivax, Toxoplasma gondii, Trichomonas vaginalis, Trypanosoma brucei and Trypanosoma cruzi.
Although proteome-wide screens for drug targets have disadvantages, in-silico methods can be developed that are fast, broad, inexpensive, and effective. In-vitro validation of our results for C parvum indicate that the method presented here can be used to construct a library for more directed small molecule screening, or pipelined into structural modeling and docking programs to facilitate target-based drug development.