We present a multilingual Named Entity Recognition approach based on a robust and general set of features across languages and datasets. Our system combines shallow local information with clustering ...semi-supervised features induced on large amounts of unlabeled text. Understanding via empirical experimentation how to effectively combine various types of clustering features allows us to seamlessly export our system to other datasets and languages. The result is a simple but highly competitive system which obtains state of the art results across five languages and twelve datasets. The results are reported on standard shared task evaluation data such as CoNLL for English, Spanish and Dutch. Furthermore, and despite the lack of linguistically motivated features, we also report best results for languages such as Basque and German. In addition, we demonstrate that our method also obtains very competitive results even when the amount of supervised data is cut by half, alleviating the dependency on manually annotated data. Finally, the results show that our emphasis on clustering features is crucial to develop robust out-of-domain models. The system and models are freely available to facilitate its use and guarantee the reproducibility of results.
Named entity recognition (NER) is to identify and categorize entities in unstructured text, which serves as a fundamental task for a variety of natural language processing (NLP) applications. In ...particular, emerging few-shot NER methods aim to learn model parameters well with few samples and have received considerable attention. The dominant few-shot NER methods usually employ pre-trained language models (PLMs) as their basic architecture and fine-tune model parameters with few NER samples. Since the sample size is small and there are a large number of parameters in PLMs, fine-tuning may result in the parameters of PLMs being highly biased. To address this issue, this study introduces the semantic distribution distance constraints to optimize the fine-tuning process of few-shot NER models and develops a framework named Semantic Constraints on few-shot Named Entity Recognition (SCNER). Specifically, the framework formulates the general knowledge transfer of PLMs as an optimal transport procedure with a semantic prior. And, a Semantics-induced Optimal Transport (SOT) regularizer is developed to utilize the importance and similarities of tokens within sentences. SOT builds the semantic distribution of the sentence and defines the transport costs between tokens to achieve the token-level optimal transport procedures. Finally, SOT is employed as a regularization term of few-shot NER to introduce the semantic distribution distance constraint for effectively transferring general knowledge from PLMs. The experiments on four public datasets demonstrate that the proposed method significantly improves the performance of NER models in both few-shot and fully supervised scenarios. SCNER is a common framework that can be applied to a variety of models without adding additional learning parameters, and can be used to enhance the generalization ability and adaptability of various few-shot NER models.
•This paper highlights the contribution of the semantic distribution distance constraints on general knowledge transfer.•The proposed approach is a common framework without adding extra learning parameters.•Extensive experiments show the approach significantly improves the performance of the baseline models.
Identifying adverse drug reaction (ADR) entities from texts is a crucial task for pharmacology, and it is the basis for the ADR relation extraction task. The publicly available resources on this task ...include PubMed abstracts, social media, and other resources. Among these resources, social media data can reflect the reactions of drug users after taking medicine in real-time and update quickly. However, a very small quantity of annotated social media data leads to less research on these data. Moreover, social media data have colloquialism and informal vocabulary expression problems, which pose a major challenge for ADR named entity recognition (NER). In this work, we present an adversarial transfer learning architecture for the ADR NER task. Our model improves the performance on Twitter data (target resource) by incorporating biomedical domain information from PubMed (source resource). Additionally, we set the scale parameter in the final loss function to address the problem of bias in model training caused by imbalanced amounts of data. Without adding any additional manually designed features, our approach achieves state-of-the-art performance with an F1 on Twitter ADR data of 68.58%.
The automatic extraction of geospatial information is an important aspect of data mining. Computer systems capable of discovering geographic information from natural language involve a complex ...process called geoparsing, which includes two important tasks: geographic entity recognition and toponym resolution. The first task could be approached through a machine learning approach, in which case a model is trained to recognize a sequence of characters (words) corresponding to geographic entities. The second task consists of assigning such entities to their most likely coordinates. Frequently, the latter process involves solving referential ambiguities. In this paper, we propose an extensible geoparsing approach including geographic entity recognition based on a neural network model and disambiguation based on what we have called dynamic context disambiguation. Once place names are recognized in an input text, they are solved using a grammar, in which a set of rules specifies how ambiguities could be solved, in a similar way to that which a person would utilize, considering the context. As a result, we have an assignment of the most likely geographic properties of the recognized places. We propose an assessment measure based on a ranking of closeness relative to the predicted and actual locations of a place name. Regarding this measure, our method outperforms OpenStreetMap Nominatim. We include other assessment measures to assess the recognition ability of place names and the prediction of what we called geographic levels (administrative jurisdiction of places).
Recent developments in the field of artificial intelligence has led to renewed interest in natural language processing. Named entity recognition (NER) is a classic problem in natural language ...processing. Investigating named entities is a continuing concern within information extraction. Names of people, organizations, locations, events, etc. are named entities (NEs). Although extensive research has been carried out on NER, limited study exists that explores NER in the Marathi language. The paper aims to provide a conceptual theoretical framework based on the gazetteer matching technique for named entity recognition system development for the Marathi language. A combination of rules and regular expressions is considered for temporal and numerical pattern recognition. The techniques presented in this paper are systematic, clear, and effective for morphologically rich language processing tool development. The system described in the paper reports a satisfactory performance with 62.64% NE identification and 72.27% NE classification accuracy.
•Local and global self-attention mechanisms are used for character embedding.•CNN with multi-size filters are used to extract character information for NER.•A cross-attention method that fuses ...character and word embedding for NER is proposed•A modified Mogrifier LSTM is presented to improve the performance of NER.•Proposed methods integrated with a transformer-based model achieve good performance.
Clinical and biomedical concept extraction is critical in medical analysis using clinical and biomedical documents from professional literature, EHRs and PHRs. Named entity recognition (NER) accurately marks essential information in the literature based on the characteristics of the target entity, providing a method for extracting clinical and biomedical concepts. The performance of NER is heavily embedding-dependent, so recent studies have proposed the method of generating word embedding from character-level information, which can strengthen the representation ability for word embedding.
In this paper, we present a novel neural network model including an attention mechanism network and a convolutional neural network (CNN) to further improve character-level embedding. First, an attention mechanism is applied simultaneously to the local and global character embedding. Then, a CNN with multi-size filters is used to extract more information from the character level, which can capture more meaningful features from words with various lengths. In addition, a cross-attention method is used to leverage the interaction between word embedding and character embedding to generate the final word representation. Finally, we modified Mogrifier LSTM to make it suitable for NER tasks and integrated it into our model. Experimental results show that our method is effective and that the model performs better than the baseline models. We also apply our methods proposed in this paper to the transformer-based model and obtain a 90.36 F1-score on NCBI-Disease.
•The priori information is employed to the model with the MRC framework.•Context information is encoded into each sample by an N-gram sample pre-processing mechanism.•Global features are extracted ...for each token with CNN, and local features are extracted with LSTM.
Recent advances in natural language representation have enabled the internal state of an upstream trained model to migrate to downstream tasks such as named entity recognition (NER). To better utilize pretrained models to perform NER tasks, the latest approach implements NER using the machine reading comprehension (MRC) framework. However, existing MRC approaches do not consider the limited performance of reading comprehension models due to the absence of contextual information in a single sample. Moreover, only word-level features are employed in the feature extraction phase in existing approaches. In this paper, a novel MRC model named GFMRC is proposed to realize NER. GRMRC enhances the MRC model with contextual information and hybrid features. In the preprocessing stage, the samples of the initial MRC dataset are spliced with N-gram information. In the feature extraction stage, global features are extracted for each token using a CNN, and local features are extracted using LSTM. Experiments are carried out on both Chinese datasets and English datasets, and the results demonstrated the effectiveness of the proposed model. The improvements on the English CoNLL 2003, English OntoNotes 5.0, Chinese MSRA, and Chinese OntoNotes 4.0 datasets are 0.07%, 0.23%, 0.04%, and 0.26%, respectively, compared to BERT-MRC+DSC.
To clarify the risk factors and propagation characteristics affecting railway safety, we learn from historical reports to build a connected network of hazards and accidents, forming a knowledge graph ...(KG), and apply it to railway hazard identification and risk assessment. First, the open source-British railway accident/incident reports are selected as the data source. The text augmentation algorithm in the text mining technology is introduced and optimized to achieve data enhancement. An ensemble model is constructed based on the hidden Markov model, conditional random field (CRF) algorithm, bidirectional long short-term memory (Bi-LSTM), and Bi-LSTM-CRF deep learning network, completing the named entity recognition of the reports. Then, using the random forest algorithm, the standardized classification of entities is accomplished, and the multi-dimensional knowledge graph network is established. Finally, after defining a series of safety-related feature parameters, the obtained KG is applied to the quantitative assessment of the corresponding risk level of the hazards. The results show that this approach realizes the visualization and quantitative description of the potential relationship among hazards, faults, and accidents by exploring the topological relationship of the railway accident network, further assisting the formulation of railway risk preventive measures.