Structural information about chemical compounds is typically conveyed as 2D images of molecular structures in scientific documents. Unfortunately, these depictions are not a machine-readable ...representation of the molecules. With a backlog of decades of chemical literature in printed form not properly represented in open-access databases, there is a high demand for the translation of graphical molecular depictions into machine-readable formats. This translation process is known as Optical Chemical Structure Recognition (OCSR). Today, we are looking back on nearly three decades of development in this demanding research field. Most OCSR methods follow a rule-based approach where the key step of vectorization of the depiction is followed by the interpretation of vectors and nodes as bonds and atoms. Opposed to that, some of the latest approaches are based on deep neural networks (DNN). This review provides an overview of all methods and tools that have been published in the field of OCSR. Additionally, a small benchmark study was performed with the available open-source OCSR tools in order to examine their performance.
The Named Entity Recognition (NER) task as a key step in the extraction of health information, has encountered many challenges in Chinese Electronic Medical Records (EMRs). Firstly, the casual use of ...Chinese abbreviations and doctors' personal style may result in multiple expressions of the same entity, and we lack a common Chinese medical dictionary to perform accurate entity extraction. Secondly, the electronic medical record contains entities from a variety of categories of entities, and the length of those entities in different categories varies greatly, which increases the difficult in the extraction for the Chinese NER. Therefore, the entity boundary detection becomes the key to perform accurate entity extraction of Chinese EMRs, and we need to develop a model that supports multiple length entity recognition without relying on any medical dictionary.
In this study, we incorporate part-of-speech (POS) information into the deep learning model to improve the accuracy of Chinese entity boundary detection. In order to avoid the wrongly POS tagging of long entities, we proposed a method called reduced POS tagging that reserves the tags of general words but not of the seemingly medical entities. The model proposed in this paper, named SM-LSTM-CRF, consists of three layers: self-matching attention layer - calculating the relevance of each character to the entire sentence; LSTM (Long Short-Term Memory) layer - capturing the context feature of each character; CRF (Conditional Random Field) layer - labeling characters based on their features and transfer rules.
The experimental results at a Chinese EMRs dataset show that the F1 value of SM-LSTM-CRF is increased by 2.59% compared to that of the LSTM-CRF. After adding POS feature in the model, we get an improvement of about 7.74% at F1. The reduced POS tagging reduces the false tagging on long entities, thus increases the F1 value by 2.42% and achieves an F1 score of 80.07%.
The POS feature marked by the reduced POS tagging together with self-matching attention mechanism puts a stranglehold on entity boundaries and has a good performance in the recognition of clinical entities.
Named entity recognition (NER) is a key step in clinical natural language processing (NLP). Traditionally, rule-based systems leverage prior knowledge to define rules to identify named entities. ...Recently, deep learning-based NER systems have become more and more popular. Contextualized word embedding, as a new type of representation of the word, has been proposed to dynamically capture word sense using context information and has proven successful in many deep learning-based systems in either general domain or medical domain. However, there are very few studies that investigate the effects of combining multiple contextualized embeddings and prior knowledge on the clinical NER task.
This study aims to improve the performance of NER in clinical text by combining multiple contextual embeddings and prior knowledge.
In this study, we investigate the effects of combining multiple contextualized word embeddings with classic word embedding in deep neural networks to predict named entities in clinical text. We also investigate whether using a semantic lexicon could further improve the performance of the clinical NER system.
By combining contextualized embeddings such as ELMo and Flair, our system achieves the F-1 score of 87.30% when only training based on a portion of the 2010 Informatics for Integrating Biology and the Bedside NER task dataset. After incorporating the medical lexicon into the word embedding, the F-1 score was further increased to 87.44%. Another finding was that our system still could achieve an F-1 score of 85.36% when the size of the training data was reduced to 40%.
Combined contextualized embedding could be beneficial for the clinical NER task. Moreover, the semantic lexicon could be used to further improve the performance of the clinical NER system.
Multimodal named entity recognition (MNER) for social media aims to detect named entities in user-generated posts with the aid of visual information from attached images. Existing methods use ...pretrained visual models or visual grounding (VG) toolkits to learn visual information. However, they still suffer from the mismatch issue, where the visual features extracted from visual encoder are inconsistent with actual requirements for cross-modal interaction. In an ideal scenario, the visual encoder should actively extract visual information guided by the text, which inherently provides the blueprint of desired visual features. In this article, we present an end-to-end VG framework for MNER task (VG-MNER), which adaptively learns the text-related visual features. Specifically, we introduce a backbone network with a feature fusion module to learn and aggregate multisize visual representations. We then develop a text-related visual attention to refine the visual features. Notably, entity-image contrast loss is designed to guide the training of visual encoder. The proposed model outperforms several state-of-the-art methods, achieving F1 scores of 75.62% and 88.11% on two benchmark datasets. Experimental results reveal the effectiveness of leveraging text-related visual information in the MNER task.
Named entity recognition (NER) of medical text is a basic task in electronic medical text processing. In recent years, chinese named entity recognition system, especially in the medical field, has ...problems with insufficient semantic information and poor coding ability. Aiming at the existing problems, we put forward a overlapping neural network for medical named entity recognition. Compared with the mainstream methods for sequence tagging task in recent years, our proposed method can learn context feature automatically and handle encoding process better. The comparative experiments are carried out on medical NER data set, and the experimental results show that our proposed overlapping neural network model can obtain better performance than the state-of-the-art models.
Natural Language Processing (NLP) tasks like relation extraction and knowledge graph are based on named entity recognition (NER). In order to improve the recognition ability of Chinese entities in ...industrial processes, a NER model based on BiLSTM-CRF network was proposed. Firstly, the abstract of patent of industrial processes was crawled from the Internet through crawler. After data cleaning, de-duplication and coding, it became a data set. Then the data was put into BiLSTM for bidirectional coding to obtain long sequence semantic features, which can be decoded through Conditional Random Field (CRF). By learning the dependency between tags, it obtain the optimal tag sequence. Finally, the correct entity was identified by sequence. In the self built dataset of industrial processes, the precision, recall and F1-score of the model are 97.17%, 99.41% and 98.27% respectively, which can prove the model can effectively improve the Chinese entity recognition ability of industrial processes.
Recent deep learning approaches have shown promising results for named entity recognition (NER). A reasonable assumption for training robust deep learning models is that a sufficient amount of ...high-quality annotated training data is available. However, in many real-world scenarios, labeled training data is scarcely present. In this paper we consider two use cases: generic entity extraction from financial and from biomedical documents. First, we have developed a character based model for NER in financial documents and a word and character based model with attention for NER in biomedical documents. Further, we have analyzed how transfer learning addresses the problem of limited training data in a target domain. We demonstrate through experiments that NER models trained on labeled data from a source domain can be used as base models and then be fine-tuned with few labeled data for recognition of different named entity classes in a target domain. We also witness an interest in language models to improve NER as a way of coping with limited labeled data. The current most successful language model is BERT. Because of its success in state-of-the-art models we integrate representations based on BERT in our biomedical NER model along with word and character information. The results are compared with a state-of-the-art model applied on a benchmarking biomedical corpus.
Knowledge Graph (KG) has been proven effective in representing and modeling structured information, especially in the medical domain. However, obtaining structured medical information usually depends ...on the manual processing of medical experts. Meanwhile, the construction of Medical Knowledge Graph (MKG) remains a crucial problem in medical informatization. This work presents a novel method for constructing MKGto drive the application of Rational Drug Use (RDU). We first collect and preprocess the corpora from various types of resources, and then develop a medical ontology via studying the concepts in RDUdomain, authoritative books and drug instructions. Based on the medical ontology, we formulate a scheme to annotate the corpora and construct the dataset for extracting entities and relations. We utilize two mechanisms to extract entities and relations respectively. The former is based on deep learning, while the latter is the rule-based method. In the last stage, we disambiguate and standardize the results of entity relation extraction to construct and enrich the MKG. The experimental results verify the effectiveness of the proposed methods.