Isothermal surfaces within the device Display omitted
•We achieve named entity recognition in the machine reading comprehension framework.•We explore the effect of different model components on named ...entity recognition.•We use Query to introduce external knowledge and explore its impact on performance.•Our method obtains state-of-the-art performance on six biomedical datasets.
Recognition of biomedical entities from literature is a challenging research focus, which is the foundation for extracting a large amount of biomedical knowledge existing in unstructured texts into structured formats. Using the sequence labeling framework to implement biomedical named entity recognition (BioNER) is currently a conventional method. This method, however, often cannot take full advantage of the semantic information in the dataset, and the performance is not always satisfactory. In this work, instead of treating the BioNER task as a sequence labeling problem, we formulate it as a machine reading comprehension (MRC) problem. This formulation can introduce more prior knowledge utilizing well-designed queries, and no longer need decoding processes such as conditional random fields (CRF). We conduct experiments on six BioNER datasets, and the experimental results demonstrate the effectiveness of our method. Our method achieves state-of-the-art (SOTA) performance on the BC4CHEMD, BC5CDR-Chem, BC5CDR-Disease, NCBI-Disease, BC2GM and JNLPBA datasets, achieving F1-scores of 92.92%, 94.19%, 87.83%, 90.04%, 85.48% and 78.93%, respectively.
Display omitted
•Lattice LSTM-CRF method is applied in Chinese clinical named entity recognition (CNER).•We introduce adversarial training (AT) for CNER research, let alone Chinese CNER.•Compare ...character and word embedding, lattice LTSM for the task of CNER.•While increasing the robustness of the model, it also achieves highly competitive performance.
Clinical named entity recognition (CNER), which intends to automatically detect clinical entities in electronic health record (EHR), is a committed step for further clinical text mining. Recently, more and more deep learning models are used to Chinese CNER. However, these models do not make full use of the information in EHR, for these models are either word-based or character-based. In addition, neural models tend to be locally unstable and even tiny perturbation may mislead them. In this paper, we firstly propose a novel adversarial training based lattice LSTM with a conditional random field layer (AT-lattice LSTM-CRF) for Chinese CNER. Lattice LSTM is used to capture richer information in EHR. As a powerful regularization method, AT can be used to improve the robustness of neural models by adding perturbations to the training data. Then, we conduct experiments on the proposed neural model with dataset of CCKS-2017 Task 2. The results show that the proposed model achieves a highly competitive performance (with an F1 score of 89.64%) compared to other prevalent neural models, which can be a reinforced baseline for further research in this field.
Display omitted
•Radical-level features used to enrich the semantic information of the characters.•Self-attention mechanism used to capture the dependencies between characters.•Our AR-CCNER model ...outperforms previous state-of-the-art models.
Named entity recognition is a fundamental and crucial task in medical natural language processing problems. In medical fields, Chinese clinical named entity recognition identifies boundaries and types of medical entities from unstructured text such as electronic medical records. Recently, a composition model of bidirectional Long Short-term Memory Networks (BiLSTMs) and conditional random field (BiLSTM-CRF) based character-level semantics has achieved great success in Chinese clinical named entity recognition tasks. But this method can only capture contextual semantics between characters in sentences. However, Chinese characters are hieroglyphics, and deeper semantic information is hidden inside, the BiLSTM-CRF model failed to get this information. In addition, some of the entities in the sentence are dependent, but the Long Short-term Memory (LSTM) does not capture long-term dependencies perfectly between characters. So we propose a BiLSTM-CRF model based on the radical-level feature and self-attention mechanism to solve these problems. We use the convolutional neural network (CNN) to extract radical-level features, aims to capture the intrinsic and internal relevances of characters. In addition, we use self-attention mechanism to capture the dependency between characters regardless of their distance. Experiments show that our model achieves F1-score 93.00% and 86.34% on CCKS-2017 and TP_CNER dataset respectively.
Few-shot named entity recognition (NER) exploits limited annotated instances to identify named mentions. Effectively transferring the internal or external resources thus becomes the key to few-shot ...NER. While the existing prompt tuning methods have shown remarkable few-shot performances, they still fail to make full use of knowledge. In this work, we investigate the integration of rich knowledge to prompt tuning for stronger few-shot NER. We propose incorporating the deep prompt tuning framework with threefold knowledge (namely TKDP ), including the internal 1) context knowledge and the external 2) label knowledge & 3) sememe knowledge . TKDP encodes the three feature sources and incorporates them into soft prompt embeddings, which are further injected into an existing pre-trained language model to facilitate predictions. On five benchmark datasets, the performance of our knowledge-enriched model was boosted by at most 11.53% F1 over the raw deep prompt method, and it significantly outperforms 9 strong-performing baseline systems in 5-/10-/20-shot settings, showing great potential in few-shot NER. Our TKDP framework can be broadly adapted to other few-shot tasks without much effort.
Display omitted
•NCBI disease corpus is built as a gold-standard resource for disease recognition.•793 PubMed abstracts are annotated with disease mentions and concepts (MeSH/OMIM).•14 Annotators ...produced high consistency level and inter-annotator agreement.•Normalization benchmark results demonstrate the utility of the corpus.•The corpus is publicly available to the community.
Information encoded in natural language in biomedical literature publications is only useful if efficient and reliable ways of accessing and analyzing that information are available. Natural language processing and text mining tools are therefore essential for extracting valuable information, however, the development of powerful, highly effective tools to automatically detect central biomedical concepts such as diseases is conditional on the availability of annotated corpora.
This paper presents the disease name and concept annotations of the NCBI disease corpus, a collection of 793 PubMed abstracts fully annotated at the mention and concept level to serve as a research resource for the biomedical natural language processing community. Each PubMed abstract was manually annotated by two annotators with disease mentions and their corresponding concepts in Medical Subject Headings (MeSH®) or Online Mendelian Inheritance in Man (OMIM®). Manual curation was performed using PubTator, which allowed the use of pre-annotations as a pre-step to manual annotations. Fourteen annotators were randomly paired and differing annotations were discussed for reaching a consensus in two annotation phases. In this setting, a high inter-annotator agreement was observed. Finally, all results were checked against annotations of the rest of the corpus to assure corpus-wide consistency.
The public release of the NCBI disease corpus contains 6892 disease mentions, which are mapped to 790 unique disease concepts. Of these, 88% link to a MeSH identifier, while the rest contain an OMIM identifier. We were able to link 91% of the mentions to a single disease concept, while the rest are described as a combination of concepts. In order to help researchers use the corpus to design and test disease identification methods, we have prepared the corpus as training, testing and development sets. To demonstrate its utility, we conducted a benchmarking experiment where we compared three different knowledge-based disease normalization methods with a best performance in F-measure of 63.7%. These results show that the NCBI disease corpus has the potential to significantly improve the state-of-the-art in disease name recognition and normalization research, by providing a high-quality gold standard thus enabling the development of machine-learning based approaches for such tasks.
The NCBI disease corpus, guidelines and other associated resources are available at: http://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEASE/.
With the rapid development of related technologies in the field of natural language processing, as an upstream task of natural language processing, improving the accuracy of named entity recognition ...is of great significance for subsequent text processing tasks. However, due to the differences between Chinese and English languages, it is difficult to transfer the research results of English named entity recognition into Chinese research effectively. Therefore, the key issues in the current research of Chinese named entity recognition are analyzed from the following four aspects: Firstly, the development of named entity recognition is taken as the main clue, the advantages and disadvantages, common methods and research results of each stage are comprehensively discussed. Secondly, the Chinese text preprocessing methods are summarized from the perspective of sequence annotation, evaluation index, Chinese word segmentation methods and datasets. Then, aiming at the Chinese character and word feature fusion method,
As an essential task for the architecture, engineering, and construction (AEC) industry, information processing and acquiring from unstructured textual data based on natural language processing (NLP) ...are gaining increasing attention. Although deep learning (DL) models for NLP tasks have been investigated for years, domain-specific pretrained DL models and their advantages are seldomly investigated in the AEC domain. Therefore, this work developed a large scale domain corpora and pretrained domain-specific language models for the AEC domain, and then systematically explores various transfer learning and fine-tuning techniques to explore the performance of pretrained DL models for various NLP tasks. First, both in-domain and close-domain Chinese corpora are developed. Then, two types of pretrained models, including static word embedding models and contextual word embedding models, are pretrained based on various domain corpora. Finally, several widely used DL models for NLP tasks are further trained and tested based on various pretrained models. The result shows that domain corpora can further improve the performance of static word embedding-based DL models and contextual word embedding-based DL models in text classification (TC) and named entity recognition (NER) tasks. Meanwhile, contextual word embedding-based DL models significantly outperform the static word embedding-based DL methods in TC and NER tasks, with maximum improvements of 8.1% and 3.8% in the F1 score, respectively. This research contributes to the body of knowledge in two ways: (1) demonstrating the advantages of domain corpora and pretrained DL models, and (2) opening the first domain-specific dataset and pretrained language models named ARCBERT for the AEC domain. Thus, this work sheds light on the adoption and application of pretrained models in the AEC domain.
•The first domain corpora for the architecture, engineering, and construction (AEC) domain are proposed.•Proposes the first domain-specific pretrained language model for typical natural language processing (NLP) tasks in AEC.•Systematical experiments are carried out to illustrate the effect of domain corpus and transfer learning techniques.•Proposed ARCBERT outperforms static word embedding-based methods in all typical NLP tasks, and increases F1 by up to 8.1%.•Improves the performance of deep learning models without increasing the effort of manual annotation.
Clinical concept extraction using transformers Yang, Xi; Bian, Jiang; Hogan, William R ...
Journal of the American Medical Informatics Association : JAMIA,
12/2020, Letnik:
27, Številka:
12
Journal Article
Recenzirano
Odprti dostop
The goal of this study is to explore transformer-based models (eg, Bidirectional Encoder Representations from Transformers BERT) for clinical concept extraction and develop an open-source package ...with pretrained clinical models to facilitate concept extraction and other downstream natural language processing (NLP) tasks in the medical domain.
We systematically explored 4 widely used transformer-based architectures, including BERT, RoBERTa, ALBERT, and ELECTRA, for extracting various types of clinical concepts using 3 public datasets from the 2010 and 2012 i2b2 challenges and the 2018 n2c2 challenge. We examined general transformer models pretrained using general English corpora as well as clinical transformer models pretrained using a clinical corpus and compared them with a long short-term memory conditional random fields (LSTM-CRFs) mode as a baseline. Furthermore, we integrated the 4 clinical transformer-based models into an open-source package.
The RoBERTa-MIMIC model achieved state-of-the-art performance on 3 public clinical concept extraction datasets with F1-scores of 0.8994, 0.8053, and 0.8907, respectively. Compared to the baseline LSTM-CRFs model, RoBERTa-MIMIC remarkably improved the F1-score by approximately 4% and 6% on the 2010 and 2012 i2b2 datasets. This study demonstrated the efficiency of transformer-based models for clinical concept extraction. Our methods and systems can be applied to other clinical tasks. The clinical transformer package with 4 pretrained clinical models is publicly available at https://github.com/uf-hobi-informatics-lab/ClinicalTransformerNER. We believe this package will improve current practice on clinical concept extraction and other tasks in the medical domain.
We present a multilingual Named Entity Recognition approach based on a robust and general set of features across languages and datasets. Our system combines shallow local information with clustering ...semi-supervised features induced on large amounts of unlabeled text. Understanding via empirical experimentation how to effectively combine various types of clustering features allows us to seamlessly export our system to other datasets and languages. The result is a simple but highly competitive system which obtains state of the art results across five languages and twelve datasets. The results are reported on standard shared task evaluation data such as CoNLL for English, Spanish and Dutch. Furthermore, and despite the lack of linguistically motivated features, we also report best results for languages such as Basque and German. In addition, we demonstrate that our method also obtains very competitive results even when the amount of supervised data is cut by half, alleviating the dependency on manually annotated data. Finally, the results show that our emphasis on clustering features is crucial to develop robust out-of-domain models. The system and models are freely available to facilitate its use and guarantee the reproducibility of results.