Motivation: From the scientific community, a lot of effort has been spent on the correct identification of gene and protein names in text, while less effort has been spent on the correct ...identification of chemical names. Dictionary-based term identification has the power to recognize the diverse representation of chemical information in the literature and map the chemicals to their database identifiers. Results: We developed a dictionary for the identification of small molecules and drugs in text, combining information from UMLS, MeSH, ChEBI, DrugBank, KEGG, HMDB and ChemIDplus. Rule-based term filtering, manual check of highly frequent terms and disambiguation rules were applied. We tested the combined dictionary and the dictionaries derived from the individual resources on an annotated corpus, and conclude the following: (i) each of the different processing steps increase precision with a minor loss of recall; (ii) the overall performance of the combined dictionary is acceptable (precision 0.67, recall 0.40 (0.80 for trivial names); (iii) the combined dictionary performed better than the dictionary in the chemical recognizer OSCAR3; (iv) the performance of a dictionary based on ChemIDplus alone is comparable to the performance of the combined dictionary. Availability: The combined dictionary is freely available as an XML file in Simple Knowledge Organization System format on the web site http://www.biosemantics.org/chemlist. Contact: k.hettne@erasmusmc.nl Supplementary information: Supplementary data are available at Bioinformatics online.
Objective
To describe the use of artificial intelligence (AI) in medical literature and trial data extraction, and its applications in uro‐oncology. This bridging review, which consolidates ...information from the diverse applications of AI, highlights how AI users can investigate more sophisticated queries than with traditional methods, leading to synthesis of raw data and complex outputs into more actionable and personalised results, particularly in the field of uro‐oncology.
Methods
Literature and clinical trial searches were performed in PubMed, Dimensions, Embase and Google (1999–2020). The searches focussed on the use of AI and its various forms to facilitate literature searches, clinical guidelines development, and clinical trial data extraction in uro‐oncology. To illustrate how AI can be applied to address questions about optimising therapeutic decision making and individualising treatment regimens, the Dimensions‐linked information platform was searched for ‘prostate cancer’ keywords (76 publications were identified; 48 were included).
Results
AI offers the promise of transforming raw data and complex outputs into actionable insights. Literature and clinical trial searches can be automated, enabling clinicians to develop and analyse publications expeditiously on complex issues such as therapeutic sequencing and to obtain updates on documents that evolve at the pace and scope of the landscape. An AI‐based platform inclusive of 12 trial databases and >100 scientific literature sources enabled the creation of an interactive visualisation.
Conclusion
As the literature and clinical trial landscape continues to grow in complexity and with increasing speed, the ability to pull the right information at the right time from different search engines and resources, while excluding social media bias, becomes more challenging. This review demonstrates that by applying natural language processing and machine learning algorithms, validated and optimised AI leads to a speedier, more personalised, efficient, and focussed search compared with traditional methods.
Academic search concerns the retrieval and profiling of information objects in the domain of academic research. In this paper we reveal important observations of academic search queries, and provide ...an algorithmic solution to address a type of failure during search sessions: null queries. We start by providing a general characterization of academic search queries, by analyzing a large-scale transaction log of a leading academic search engine. Unlike previous small-scale analyses of academic search queries, we find important differences with query characteristics known from web search. E.g., in academic search there is a substantially bigger proportion of entity queries, and a heavier tail in query length distribution. We then focus on search failures and, in particular, on null queries that lead to an empty search engine result page, on null sessions that contain such null queries, and on users who are prone to issue null queries. In academic search approximately 1 in 10 queries is a null query, and 25% of the sessions contain a null query. They appear in different types of search sessions, and prevent users from achieving their search goal. To address the high rate of null queries in academic search, we consider the task of providing query suggestions. Specifically we focus on a highly frequent query type: non-boolean informational queries. To this end we need to overcome query sparsity and make effective use of session information.
We find that using entities helps to surface more relevant query suggestions in the face of query sparsity. We also find that query suggestions should be conditioned on the type of session in which they are offered to be more effective. After casting the session classification problem as a multi-label classification problem, we generate session-conditional query suggestions based on predicted session type. We find that this session-conditional method leads to significant improvements over a generic query suggestion method. Personalization yields very little further improvements over session-conditional query suggestions.
Biomarkers, as measurements of defined biological characteristics, can play a pivotal role in estimations of disease risk, early detection, differential diagnosis, assessment of disease progression ...and outcomes prediction. Studies of cancer biomarkers are published daily; some are well characterized, while others are of growing interest. Managing this flow of information is challenging for scientists and clinicians. We sought to develop a novel text-mining method employing biomarker co-occurrence processing applied to a deeply indexed full-text database to generate time-interval–delimited biomarker co-occurrence networks. Biomarkers across 6 cancer sites and a cancer-agnostic network were successfully characterized in terms of their emergence in the published literature and the context in which they are described. Our approach, which enables us to find publications based on biomarker relationships, identified biomarker relationships not known to existing interaction networks. This search method finds relevant literature that could be missed with keyword searches, even if full text is available. It enables users to extract relevant biological information and may provide new biological insights that could not be achieved by individual review of papers.
Massive text mining of the biological literature holds great promise of relating disparate information and discovering new knowledge. However, disambiguation of gene symbols is a major bottleneck.
We ...developed a simple thesaurus-based disambiguation algorithm that can operate with very little training data. The thesaurus comprises the information from five human genetic databases and MeSH. The extent of the homonym problem for human gene symbols is shown to be substantial (33% of the genes in our combined thesaurus had one or more ambiguous symbols), not only because one symbol can refer to multiple genes, but also because a gene symbol can have many non-gene meanings. A test set of 52,529 Medline abstracts, containing 690 ambiguous human gene symbols taken from OMIM, was automatically generated. Overall accuracy of the disambiguation algorithm was up to 92.7% on the test set.
The ambiguity of human gene symbols is substantial, not only because one symbol may denote multiple genes but particularly because many symbols have other, non-gene meanings. The proposed disambiguation approach resolves most ambiguities in our test set with high accuracy, including the important gene/not a gene decisions. The algorithm is fast and scalable, enabling gene-symbol disambiguation in massive text mining applications.
INSIDE PC is an artificial intelligence (AI)-based semantic platform designed to help clinicians search the medical literature and determine optimal therapeutic sequencing for advanced prostate ...cancer. Our development and evaluation of INSIDE PC set new standards for AI-based literature searching.
Defining optimal therapeutic sequencing strategies in prostate cancer (PC) is challenging and may be assisted by artificial intelligence (AI)-based tools for an analysis of the medical literature.
To demonstrate that INSIDE PC can help clinicians query the literature on therapeutic sequencing in PC and to develop previously unestablished practices for evaluating the outputs of AI-based support platforms.
INSIDE PC was developed by customizing PubMed Bidirectional Encoder Representations from Transformers. Publications were ranked and aggregated for relevance using data visualization and analytics. Publications returned by INSIDE PC and PubMed were given normalized discounted cumulative gain (nDCG) scores by PC experts reflecting ranking and relevance.
INSIDE PC for AI-based semantic literature analysis.
INSIDE PC was evaluated for relevance and accuracy for three test questions on the efficacy of therapeutic sequencing of systemic therapies in PC.
In this initial evaluation, INSIDE PC outperformed PubMed for question 1 (novel hormonal therapy NHT followed by NHT) for the top five, ten, and 20 publications (nDCG score, +43, +33, and +30 percentage points pps, respectively). For question 2 (NHT followed by poly adenosine diphosphate ribose polymerase inhibitors PARPi), INSIDE PC and PubMed performed similarly. For question 3 (NHT or PARPi followed by 177Lu-prostate-specific membrane antigen-617), INSIDE PC outperformed PubMed for the top five, ten, and 20 publications (+16, +4, and +5 pps, respectively).
We applied INSIDE PC to develop standards for evaluating the performance of AI-based tools for literature extraction. INSIDE PC performed competitively with PubMed and can assist clinicians with therapeutic sequencing in PC.
The medical literature is often very difficult for doctors and patients to search. In this report, we describe INSIDE PC—an artificial intelligence (AI) system created to help search articles published in medical journals and determine the best order of treatments for advanced prostate cancer in a much better time frame. We found that INSIDE PC works as well as another search tool, PubMed, a widely used resource for searching and retrieving articles published in medical journals. Our work with INSIDE PC shows new ways in which AI can be used to search published articles in medical journals and how these systems might be evaluated to support shared decision-making.
Intraindividual variability in electrocardiograms Schijvenaars, Bob J.A., PhD; van Herpen, Gerard, MD, PhD; Kors, Jan A., PhD
Journal of electrocardiology,
05/2008, Letnik:
41, Številka:
3
Journal Article
Recenzirano
Abstract The electrocardiogram (ECG) can be affected by intraindividual variations from various sources that may confuse the diagnosis of the underlying cardiac condition and impair the accuracy of ...ECG interpretation. Intraindividual variability is a hindrance in serial ECG analysis, where ECGs of the same individual, but taken at different times, are compared. Two sources of intraindividual variability can be distinguished as follows: variability related to the technical circumstances during ECG recording (technical sources) and nonpathologic biologic variability (biological sources). Among the technical sources, variation in electrode positioning between recordings is the most confusing. Of the biological sources, respiratory variations are effective at any time scale, but the most important are age and weight that work on prolonged time scales. Technical problems are best prevented by rigorously sticking to a standard acquisition protocol. Criteria can be adapted to changing circumstances (age, weight), and by computer modeling, it may be possible to correct the ECG diagnosis for some sources of intraindividual variability.
Identification of terms is essential for biomedical text mining.. We concentrate here on the use of vocabularies for term identification, specifically the Unified Medical Language System (UMLS). To ...make the UMLS more suitable for biomedical text mining we implemented and evaluated nine term rewrite and eight term suppression rules. The rules rely on UMLS properties that have been identified in previous work by others, together with an additional set of new properties discovered by our group during our work with the UMLS. Our work complements the earlier work in that we measure the impact on the number of terms identified by the different rules on a MEDLINE corpus. The number of uniquely identified terms and their frequency in MEDLINE were computed before and after applying the rules. The 50 most frequently found terms together with a sample of 100 randomly selected terms were evaluated for every rule.
Five of the nine rewrite rules were found to generate additional synonyms and spelling variants that correctly corresponded to the meaning of the original terms and seven out of the eight suppression rules were found to suppress only undesired terms. Using the five rewrite rules that passed our evaluation, we were able to identify 1,117,772 new occurrences of 14,784 rewritten terms in MEDLINE. Without the rewriting, we recognized 651,268 terms belonging to 397,414 concepts; with rewriting, we recognized 666,053 terms belonging to 410,823 concepts, which is an increase of 2.8% in the number of terms and an increase of 3.4% in the number of concepts recognized. Using the seven suppression rules, a total of 257,118 undesired terms were suppressed in the UMLS, notably decreasing its size. 7,397 terms were suppressed in the corpus.
We recommend applying the five rewrite rules and seven suppression rules that passed our evaluation when the UMLS is to be used for biomedical term identification in MEDLINE. A software tool to apply these rules to the UMLS is freely available at http://biosemantics.org/casper.
We investigate the accuracy of different similarity approaches for clustering over two million biomedical documents. Clustering large sets of text documents is important for a variety of information ...needs and applications such as collection management and navigation, summary and analysis. The few comparisons of clustering results from different similarity approaches have focused on small literature sets and have given conflicting results. Our study was designed to seek a robust answer to the question of which similarity approach would generate the most coherent clusters of a biomedical literature set of over two million documents.
We used a corpus of 2.15 million recent (2004-2008) records from MEDLINE, and generated nine different document-document similarity matrices from information extracted from their bibliographic records, including titles, abstracts and subject headings. The nine approaches were comprised of five different analytical techniques with two data sources. The five analytical techniques are cosine similarity using term frequency-inverse document frequency vectors (tf-idf cosine), latent semantic analysis (LSA), topic modeling, and two Poisson-based language models--BM25 and PMRA (PubMed Related Articles). The two data sources were a) MeSH subject headings, and b) words from titles and abstracts. Each similarity matrix was filtered to keep the top-n highest similarities per document and then clustered using a combination of graph layout and average-link clustering. Cluster results from the nine similarity approaches were compared using (1) within-cluster textual coherence based on the Jensen-Shannon divergence, and (2) two concentration measures based on grant-to-article linkages indexed in MEDLINE.
PubMed's own related article approach (PMRA) generated the most coherent and most concentrated cluster solution of the nine text-based similarity approaches tested, followed closely by the BM25 approach using titles and abstracts. Approaches using only MeSH subject headings were not competitive with those based on titles and abstracts.