The availability of drug abuse data from the official website of the National Narcotics Board of Indonesia is not up-to-date. Besides, the drug reports from Indonesian National Narcotics Board are ...only published once a year. This study aims to utilize online news sites as a data source for collecting information about drug abuse in Indonesia. In addition, this study also builds a named entity recognition (NER) model to extract information from news texts. The primary NER model in this study uses the convolutional neural network-long short-term memory (CNNs-LSTM) architecture because it can produce a good performance and only requires a relatively short computation time. Meanwhile, the baseline NER model uses the bidirectional long short-term memory-conditional random field (Bi-LSTMs-CRF) architecture because it is easy to implement using the Flair framework. The primary model that has been built results in a performance (F1 score) of 82.54%. Meanwhile, the baseline model only results in a performance (F1 score) of 69.67%. Then, the raw data extracted by NER is processed to produce the number of drug suspects in Indonesia from 2018-2020. However, the data that has been produced is not as complete as similar data sourced from Indonesian National Narcotics Board publications.
Named entity recognition (NER) is the task to identify mentions of rigid designators from text belonging to predefined semantic types such as person, location, organization etc. NER always serves as ...the foundation for many natural language applications such as question answering, text summarization, and machine translation. Early NER systems got a huge success in achieving good performance with the cost of human engineering in designing domain-specific features and rules. In recent years, deep learning, empowered by continuous real-valued vector representations and semantic composition through nonlinear processing, has been employed in NER systems, yielding stat-of-the-art performance. In this paper, we provide a comprehensive review on existing deep learning techniques for NER. We first introduce NER resources, including tagged NER corpora and off-the-shelf NER tools. Then, we systematically categorize existing works based on a taxonomy along three axes: distributed representations for input, context encoder, and tag decoder. Next, we survey the most representative methods for recent applied techniques of deep learning in new NER problem settings and applications. Finally, we present readers with the challenges faced by NER systems and outline future directions in this area.
Named Entity Recognition (NER) in social media posts is challenging since texts are usually short and contexts are lacking. Most recent works show that visual information can boost the NER ...performance since images can provide complementary contextual information for texts. However, the image-level features ignore the mapping relations between fine-grained visual objects and textual entities, which results in error detection in entities with different types. To better exploit visual and textual information in NER, we propose an adversarial gated bilinear attention neural network (AGBAN). The model jointly extracts entity-related features from both visual objects and texts, and leverages an adversarial training to map two different representations into a shared representation. As a result, domain information contained in an image can be transferred and applied for extracting named entities in the text associated with the image. Experimental results on Tweets dataset demonstrate that our model outperforms the state-of-the-art methods. Moreover, we systematically evaluate the effectiveness of the proposed gated bilinear attention network in capturing the interactions of mutimodal features visual objects and textual words. Our results indicate that the adversarial training can effectively exploit commonalities across heterogeneous data sources, which leads to improved performance in NER when compared to models purely exploiting text data or combining the image-level visual features.
Named entity recognition (NER) is to identify and categorize entities in unstructured text, which serves as a fundamental task for a variety of natural language processing (NLP) applications. In ...particular, emerging few-shot NER methods aim to learn model parameters well with few samples and have received considerable attention. The dominant few-shot NER methods usually employ pre-trained language models (PLMs) as their basic architecture and fine-tune model parameters with few NER samples. Since the sample size is small and there are a large number of parameters in PLMs, fine-tuning may result in the parameters of PLMs being highly biased. To address this issue, this study introduces the semantic distribution distance constraints to optimize the fine-tuning process of few-shot NER models and develops a framework named Semantic Constraints on few-shot Named Entity Recognition (SCNER). Specifically, the framework formulates the general knowledge transfer of PLMs as an optimal transport procedure with a semantic prior. And, a Semantics-induced Optimal Transport (SOT) regularizer is developed to utilize the importance and similarities of tokens within sentences. SOT builds the semantic distribution of the sentence and defines the transport costs between tokens to achieve the token-level optimal transport procedures. Finally, SOT is employed as a regularization term of few-shot NER to introduce the semantic distribution distance constraint for effectively transferring general knowledge from PLMs. The experiments on four public datasets demonstrate that the proposed method significantly improves the performance of NER models in both few-shot and fully supervised scenarios. SCNER is a common framework that can be applied to a variety of models without adding additional learning parameters, and can be used to enhance the generalization ability and adaptability of various few-shot NER models.
•This paper highlights the contribution of the semantic distribution distance constraints on general knowledge transfer.•The proposed approach is a common framework without adding extra learning parameters.•Extensive experiments show the approach significantly improves the performance of the baseline models.
With the advent of Web 2.0, there exist many online platforms that result in massive textual-data production. With ever-increasing textual data at hand, it is of immense importance to extract ...information nuggets from this data. One approach towards effective harnessing of this unstructured textual data could be its transformation into structured text. Hence, this study aims to present an overview of approaches that can be applied to extract key insights from textual data in a structured way. For this, Named Entity Recognition and Relation Extraction are being majorly addressed in this review study. The former deals with identification of named entities, and the latter deals with problem of extracting relation between set of entities. This study covers early approaches as well as the developments made up till now using machine learning models. Survey findings conclude that deep-learning-based hybrid and joint models are currently governing the state-of-the-art. It is also observed that annotated benchmark datasets for various textual-data generators such as Twitter and other social forums are not available. This scarcity of dataset has resulted into relatively less progress in these domains. Additionally, the majority of the state-of-the-art techniques are offline and computationally expensive. Last, with increasing focus on deep-learning frameworks, there is need to understand and explain the under-going processes in deep architectures.
Display omitted
•Obtain a pre-trained BERT model of Chinese clinical records, public and available for community.•Incorporate dictionary features and radical features into deep learning model, ...BERT + BiLSTM + CRF.•Outperform all other methods on CCKS-2017 and CCKS-2018 clinical named entity recognition datasets.
Clinical Named Entity Recognition (CNER) is a critical task which aims to identify and classify clinical terms in electronic medical records. In recent years, deep neural networks have achieved significant success in CNER. However, these methods require high-quality and large-scale labeled clinical data, which is challenging and expensive to obtain, especially data on Chinese clinical records. To tackle the Chinese CNER task, we pre-train BERT model on the unlabeled Chinese clinical records, which can leverage the unlabeled domain-specific knowledge. Different layers such as Long Short-Term Memory (LSTM) and Conditional Random Field (CRF) are used to extract the text features and decode the predicted tags respectively. In addition, we propose a new strategy to incorporate dictionary features into the model. Radical features of Chinese characters are used to improve the model performance as well. To the best of our knowledge, our ensemble model outperforms the state of the art models which achieves 89.56% strict F1 score on the CCKS-2018 dataset and 91.60% F1 score on CCKS-2017 dataset.
Navigating the knowledge of Stack Overflow (SO) remains challenging. To make the posts vivid to users, SO allows users to write and edit posts with Markdown or HTML so that users can leverage various ...formatting styles (e.g., bold, italic, and code) to highlight the important information. Nonetheless, there have been limited studies on the highlighted information.
We carried out the first large-scale exploratory study on the information highlighted in SO answers in our recent study. To extend our previous study, we develop approaches to automatically recommend highlighted content with formatting styles using neural network architectures initially designed for the Named Entity Recognition task.
In this paper, we studied 31,169,429 answers of Stack Overflow. For training recommendation models, we choose CNN-based and BERT-based models for each type of formatting (i.e., Bold, Italic, Code, and Heading) using the information highlighting dataset we collected from SO answers.
Our models achieve a precision ranging from 0.50 to 0.72 for different formatting types. It is easier to build a model to recommend Code than other types. Models for text formatting types (i.e., Heading, Bold, and Italic) suffer low recall. Our analysis of failure cases indicates that the majority of the failure cases are due to missing identification. One explanation is that the models are easy to learn the frequent highlighted words while struggling to learn less frequent words (i.g., long-tail knowledge).
Our findings suggest that it is possible to develop recommendation models for highlighting information for answers with different formatting styles on Stack Overflow.
Named entity recognition (NER) based on deep neural networks has shown competitive performance when trained on large-scale human-annotated data. However, they face challenges in low-resource ...settings, where the available labeled data are scarce. A typical solution is pseudo-labeling which assigns pseudo-labels to the certain (i.e., high confidence) tokens of unlabeled sentences while discards the uncertain (i.e., low confidence) ones. But there still have two potential challenges: (1) discarding the uncertain tokens leads to low utilization of unlabeled data; (2) the intrinsic quality-quantity trade-off issue of pseudo-labeling with confidence threshold. In this work, we propose an innovative method named Uncertainty-Aware Contrastive Learning (UACL) for semi-supervised named entity recognition. Specifically, UACL first utilizes a Gaussian-based class-wise token separation mechanism to dynamically distinguish certain and uncertain tokens, which can self-adaptively adjust the confidence threshold to balance the quantity and quality of pseudo-labeled certain tokens. Then we perform pseudo-supervised learning based on certain tokens and contrastive learning based on uncertain ones, which not only improves the utilization of unlabeled data, but also provides uncertainty-aware guidance information for model training. Furthermore, our method leverages uncertain tokens to optimize token representation, leading to improving performance. The extensive experimental results on four benchmarks demonstrate that the performance of our proposed approach surpasses that of previously leading low-resource baselines.
Few-shot Named Entity Recognition (NER) systems aim to classify unseen named entity types with limited labeled examples. Significant progress has been made in the use of large-scale pre-trained ...language models. However, boundary information and hidden relationships between entities beyond the sequence, which provide additional information and play crucial roles in few-shot NER, have received little attention in recent methods that achieve the state of the art. In this paper, we propose CEPTNER, a Contrastive learning Enhanced Prototypical network for Two-stage few-shot Named Entity Recognition, which leverages meta-learning and prototypical network to identify unseen entity types with limited labeled data. Concretely, we first detect candidate boundaries of entities in the stage of boundary detector, then we employ a prototypical network to filter false boundaries and type for the remaining spans in the stage of entity classifier. Additionally, we conduct entity-level contrastive learning to explore the internal relationship between entities that provides extra information while optimizing the prototypical network. CEPTNER is evaluated on two widely-used few-shot NER datasets and a few-shot slot tagging dataset: Few-NERD, CrossNER and SNIPS. Extensive experiments on various benchmarks show the superiority of CEPTNER over previous methods for few-shot NER.
Display omitted
•A two-stage approach named CEPTNER is proposed to address few-shot NER problem.•Designing a span filter helps to remove false generated spans.•Utilizing contrastive learning methods for hidden relationship exploring is effective.
OBJECTIVEBiomedical Named Entity Recognition (bio NER) is the task of recognizing named entities in biomedical texts. This paper introduces a new model that addresses bio NER by considering ...additional external contexts. Different from prior methods that mainly use original input sequences for sequence labeling, the model takes into account additional contexts to enhance the representation of entities in the original sequences, since additional contexts can provide enhanced information for the concept explanation of biomedical entities.METHODSTo exploit an additional context, given an original input sequence, the model first retrieves the relevant sentences from PubMed and then ranks the retrieved sentences to form the contexts. It next combines the context with the original input sequence to form a new enhanced sequence. The original and new enhanced sequences are fed into PubMedBERT for learning feature representation. To obtain more fine-grained features, the model stacks a BiLSTM layer on top of PubMedBERT. The final named entity label prediction is done by using a CRF layer. The model is jointly trained in an end-to-end manner to take advantage of the additional context for NER of the original sequence.RESULTSExperimental results on six biomedical datasets show that the proposed model achieves promising performance compared to strong baselines and confirms the contribution of additional contexts for bio NER.CONCLUSIONThe promising results confirm three important points. First, the additional context from PubMed helps to improve the quality of the recognition of biomedical entities. Second, PubMed is more appropriate than the Google search engine for providing relevant information of bio NER. Finally, more relevant sentences from the context are more beneficial than irrelevant ones to provide enhanced information for the original input sequences. The model is flexible to integrate any additional context types for the NER task.