The Chinese named entity recognition (NER) task is a sub-task within the information extraction domain, where the task goal is to find, identify and classify relevant entities, such as names of ...people, places and organizations, from sentences given a piece of unstructured text. Chinese named entity recognition is a fundamental task in the field of natural language processing (NLP) and plays an important role in many downstream NLP tasks, including information retrieval, relationship extraction and question and answer systems. This paper provides a comprehensive review of existing neural network-based word-character lattice structures for Chinese NER models. Firstly, this paper introduces that Chinese NER is more difficult than English NER, and there are difficulties and challenges such as difficulty in determining the boundaries of Chinese text-related entities and complex Chinese grammatical structures. Secondly, this paper investigates the most representative lattice-structured Chinese NER models under diff
Named Entity Recognition (NER) plays a pivotal role in knowledge extraction and improving the intelligence of edge computing. The effectiveness of span-based NER models predominantly depends on the ...representation of spans. Existing methods primarily utilize semantic features to represent spans, often neglecting other vital information. This paper proposes a method incorporating Part of Speech (POS) information into span representations to overcome this limitation. Central to this methodology is a span POS encoder designed to extract the POS feature of spans. For migrating the method to edge devices, this paper introduces a fast span POS encoder, which significantly reduces the time complexity of POS feature extraction. Building upon this innovation, a span-based NER model named IPSI (Incorporating Part of Speech Information in span representation) is developed, exhibiting outstanding performance on nested and flat datasets. Comparison the original and fast span POS encoders reveals that while the fast encoder slightly compromises performance, it markedly accelerates the training and inference processes. Finally, through a series of experiments and sample analyses, this article explores the intrinsic mechanism through which the span POS feature influences entity recognition and further illustrates the importance of the POS feature.
•A span POS encoder is proposed to extract the POS feature of a span.•An efficient variation of the span POS encoder is developed to cater to the limited computational capacity of edge devices.•A NER model is developed, which incorporates the POS feature to enhance span representation, outperforming competitor models.•The significance of the POS feature for NER is substantiated through a series of experiments.
As an important foundational task in the field of natural language processing, the Chinese named entity recognition (NER) task has received widespread attention in recent years. Self-distillation ...plays a role in exploring the potential of the knowledge carried by internal parameters in the BERT NER model, but few studies have noticed the impact of different granularity semantic information during the distillation process. In this paper, we propose a multi-level semantic enhancement approach based on self-distillation BERT for Chinese named entity recognition. We first design a feasible data augmentation method to improve the training quality for handling complex entity compositions, then construct a boundary smoothing module to achieve the model’s moderate learning on entity boundaries. Besides, we utilize the distillation reweighting method to let the model acquire balanced entity and context knowledge. Experimental results on two Chinese named entity recognition benchmark datasets Weibo and Resume have 72.09% and 96.93% F1 scores, respectively. Compared to three different basic distillation BERT models, our model can also produce better results. The source code is available at https://github.com/lookmedandan/MSE.
Display omitted
•I proposed a deep neural networks for Biomedical named entity recognition.•A conditional random field was added to capture relationship between entities.•Word vectors were ...initialized using pretrained word embeddings.•Character-level of words was useful for handling out-of-vocabulary issues.
Biomedical named entity recognition (BNER), which extracts important named entities such as genes and proteins, is a challenging task in automated systems that mine knowledge in biomedical texts. The previous state-of-the-art systems required large amounts of task-specific knowledge in the form of feature engineering, lexicons and data pre-processing to achieve high performance. In this paper, we introduce a novel neural network architecture that benefits from both word- and character-level representations automatically, by using a combination of bidirectional long short-term memory (LSTM) and conditional random field (CRF) eliminating the need for most feature engineering tasks. We evaluate our system on two datasets: JNLPBA corpus and the BioCreAtIvE II Gene Mention (GM) corpus. We obtained state-of-the-art performance by outperforming the previous systems. To the best of our knowledge, we are the first to investigate the combination of deep neural networks, CRF, word embeddings and character-level representation in recognizing biomedical named entities.
The COrona VIrus Disease 19 (COVID-19) pandemic required the work of all global experts to tackle it. Despite the abundance of new studies, privacy laws prevent their dissemination for medical ...investigations: through clinical de-identification, the Protected Health Information (PHI) contained therein can be anonymized so that medical records can be shared and published. The automation of clinical de-identification through deep learning techniques has proven to be less effective for languages other than English due to the scarcity of data sets. Hence a new Italian de-identification data set has been created from the COVID-19 clinical records made available by the Italian Society of Radiology (SIRM). Therefore, two multi-lingual deep learning systems have been developed for this low-resource language scenario: the objective is to investigate their ability to transfer knowledge between different languages while maintaining the necessary features to correctly perform the Named Entity Recognition task for de-identification. The systems were trained using four different strategies, using both the English Informatics for Integrating Biology & the Bedside (i2b2) 2014 and the new Italian SIRM COVID-19 data sets, then evaluated on the latter. These approaches have demonstrated the effectiveness of cross-lingual transfer learning to de-identify medical records written in a low resource language such as Italian, using one with high resources such as English.
•Comparison of multilingual deep learning systems through clinical de-identification.•Proposal and testing of 4 possible training approaches with low resources languages.•Construction of a new annotated Italian dataset from public COVID-19 medical records.
In the finance domain, nested named entities recognition has become a hot topic in named entity recognition tasks. Traditional nested entity recognition methods easily ignore the dependency ...relationships between entities, and these methods are mostly suitable for English general domain. Therefore, we propose a Chinese nested entity recognition method for the finance domain based on heterogeneous graph network(HGFNER). This method consists of two parts: the boundary division model of candidate entities and the internal relationship graph model of candidate entities. First, the boundary division model of candidate entities that introduces expert knowledge is used to partition the flat entities contained in the text and segment the text to address issues such as long entity boundaries and strong domain features in the Chinese finance domain. Then, by using heterogeneous graphs to represent the internal structure of entities from both spatial and syntactic dependencies to achieve the goal of learning dependency relationships between entities from multiple perspectives. Meanwhile, so as not to affect the operational efficiency of the model, we also propose a fast matching algorithm DAAC_BM for n-gram sequences in domain dictionaries to solve the problems of memory overflow and space waste faced by multi-pattern fast matching algorithms in Chinese matching. In addition, we propose a Chinese nested entity dataset CFNE for the financial field, which, as far as we know, is the first publicly available annotated dataset in the field. HGFNER achieves state-of-the-art macro-F1 value on CFNE, reaching 86.41%.
The biomedical literature is growing rapidly, and the extraction of meaningful information from the large amount of literature is increasingly important. Biomedical named entity (BioNE) ...identification is one of the critical and fundamental tasks in biomedical text mining. Accurate identification of entities in the literature facilitates the performance of other tasks. Given that an end-to-end neural network can automatically extract features, several deep learning-based methods have been proposed for BioNE recognition (BioNER), yielding state-of-the-art performance. In this review, we comprehensively summarize deep learning-based methods for BioNER and datasets used in training and testing. The deep learning methods are classified into four categories: single neural network-based, multitask learning-based, transfer learning-based and hybrid model-based methods. They can be applied to BioNER in multiple domains, and the results are determined by the dataset size and type. Lastly, we discuss the future development and opportunities of BioNER methods.
Text mining is a flexible technology that can be applied to numerous different tasks in biology and medicine. We present a system for extracting disease–gene associations from biomedical abstracts. ...The system consists of a highly efficient dictionary-based tagger for named entity recognition of human genes and diseases, which we combine with a scoring scheme that takes into account co-occurrences both within and between sentences. We show that this approach is able to extract half of all manually curated associations with a false positive rate of only 0.16%. Nonetheless, text mining should not stand alone, but be combined with other types of evidence. For this reason, we have developed the DISEASES resource, which integrates the results from text mining with manually curated disease–gene associations, cancer mutation data, and genome-wide association studies from existing databases. The DISEASES resource is accessible through a web interface at http://diseases.jensenlab.org/, where the text-mining software and all associations are also freely available for download.
Chinese medicine is a unique and complex medical system with complete and rich scientific theories. The textual data of Traditional Chinese Medicine (TCM) contains a large amount of relevant ...knowledge in the field of TCM, which can serve as guidance for accurate disease diagnosis as well as efficient disease prevention and treatment. Existing TCM texts are disorganized and lack a uniform standard. For this reason, this paper proposes a joint extraction framework by using graph convolutional networks to extract joint entity relations on document-level TCM texts to achieve TCM entity relation mining. More specifically, we first finetune the pre-trained language model by using the TCM domain knowledge to obtain the task-specific model. Taking the integrity of TCM into account, we extract the complete entities as well as the relations corresponding to diagnosis and treatment from the document-level medical cases by using multiple features such as word fusion coding, TCM lexicon information, and multi-relational graph convolutional networks. The experimental results show that the proposed method outperforms the state-of-the-art methods. It has an F1-score of 90.7% for Name Entity Recognization and 76.14% for Relation Extraction on the TCM dataset, which significantly improves the ability to extract entity relations from TCM texts. Code is available at https://github.com/xxxxwx/TCMERE.
•The proposed method could alleviate entity nesting and relation overlapping.•A large amount data is collected to obtain model in TCM domain.•The multi-type TCM dictionaries are built.
Display omitted
•Radical-level features used to enrich the semantic information of the characters.•Self-attention mechanism used to capture the dependencies between characters.•Our AR-CCNER model ...outperforms previous state-of-the-art models.
Named entity recognition is a fundamental and crucial task in medical natural language processing problems. In medical fields, Chinese clinical named entity recognition identifies boundaries and types of medical entities from unstructured text such as electronic medical records. Recently, a composition model of bidirectional Long Short-term Memory Networks (BiLSTMs) and conditional random field (BiLSTM-CRF) based character-level semantics has achieved great success in Chinese clinical named entity recognition tasks. But this method can only capture contextual semantics between characters in sentences. However, Chinese characters are hieroglyphics, and deeper semantic information is hidden inside, the BiLSTM-CRF model failed to get this information. In addition, some of the entities in the sentence are dependent, but the Long Short-term Memory (LSTM) does not capture long-term dependencies perfectly between characters. So we propose a BiLSTM-CRF model based on the radical-level feature and self-attention mechanism to solve these problems. We use the convolutional neural network (CNN) to extract radical-level features, aims to capture the intrinsic and internal relevances of characters. In addition, we use self-attention mechanism to capture the dependency between characters regardless of their distance. Experiments show that our model achieves F1-score 93.00% and 86.34% on CCKS-2017 and TP_CNER dataset respectively.