Abstract
Because the computer cannot directly understand the text corpus in the NLP task, the first thing to do is to represent the characteristics of the natural language numerically, and the word ...vector technology provides a good way to express it. Because Word2vec considers context and has fewer dimensions, it is now more popular words embedded. However, due to the particularity of Chinese, word2vec cannot accurately identify the polysemy of words. In this paper, a lightweight and effective method is used to merge vocabulary into character representation. This approach avoids designing complex sequence modeling architectures. for any neural network model, simply fine-tuning the character input layer can introduce vocabulary information. The model also uses the modified LSTM to bridge the enormous LSTM and the Transformer model. The interaction between input and context provides a richer modeling space that significantly improves testing on all four public datasets.
Display omitted
•Propose technically simple model using character-level features and attention method.•Effective word representation through the design of high-level feature embedding.•Comparison ...with state-of-the-art methods demonstrates empirical strength of study.
With the rapid advancement of technology and the necessity of processing large amounts of data, biomedical Named Entity Recognition (NER) has become an essential technique for information extraction in the biomedical field. NER, which is a sequence-labeling task, has been performed using various traditional techniques including dictionary-, rule-, machine learning-, and deep learning-based methods. However, as existing biomedical NER models are insufficient to handle new and unseen entity types from the growing biomedical data, the development of more effective and accurate biomedical NER models is being widely researched. Among biomedical NER models utilizing deep learning approaches, there have been only a few studies involving the design of high-level features in the embedding layer. In this regard, herein, we propose a deep learning NER model that effectively represents biomedical word tokens through the design of a combinatorial feature embedding. The proposed model is based on Bidirectional Long Short-Term Memory (bi-LSTM) with Conditional Random Field (CRF) and enhanced by integrating two different character-level representations extracted from a Convolutional Neural Network (CNN) and bi-LSTM. Additionally, an attention mechanism is applied to the model to focus on the relevant tokens in the sentence, which alleviates the long-term dependency problem of the LSTM model and allows effective recognition of entities. The proposed model was evaluated on two benchmark datasets, the JNLPBA and NCBI-Disease, and a comparative analysis with the existing models is performed. The proposed model achieved a relatively higher performance with an F1-score of 86.93% in case of NCBI-Disease, and a competitive performance for the JNLPBA with an F1-score of 75.31%.
Mineral named entity recognition (MNER) is the extraction for the specific types of entities from unstructured Chinese mineral text, which is a prerequisite for building a mineral knowledge graph. ...MNER can also provide important data support for the work related to mineral resources. Chinese mineral text has many types of entities, complex semantics, and a large number of rare characters. To extract entities from Chinese mineral literature, this paper proposes an MNER model based on deep learning. To create word embeddings for mineral text, Bidirectional Encoder Representations from Transformers (BERT) is used. Moreover, the transfer matrix of the Conditional Random Field (CRF) algorithm is combined to improve the accuracy of sequence labeling. Finally, some experiments are conducted on the constructed dataset. The results show that the model can effectively recognize seven mineral entities with an average F1-score of 0.842.
•Present a BERT-based model for Chinese mineral named entity recognition.•Realize extracting seven kinds of mineral entities.•Adopt multiple approaches to optimize NER model.•Construct a Chinese corpus of pre-trained model in the mineral field.
We automatically create enormous, free and multilingual silver-standard training annotations for named entity recognition (ner) by exploiting the text and structure of Wikipedia. Most ner systems ...rely on statistical models of annotated data to identify and classify names of people, locations and organisations in text. This dependence on expensive annotation is the knowledge bottleneck our work overcomes.
We first classify each Wikipedia article into named entity (ne) types, training and evaluating on 7200 manually-labelled Wikipedia articles across nine languages. Our cross-lingual approach achieves up to 95% accuracy.
We transform the links between articles into ne annotations by projecting the target articleʼs classifications onto the anchor text. This approach yields reasonable annotations, but does not immediately compete with existing gold-standard data. By inferring additional links and heuristically tweaking the Wikipedia corpora, we better align our automatic annotations to gold standards.
We annotate millions of words in nine languages, evaluating English, German, Spanish, Dutch and Russian Wikipedia-trained models against conll shared task data and other gold-standard corpora. Our approach outperforms other approaches to automatic ne annotation (Richman and Schone, 2008 61, Mika et al., 2008 46) competes with gold-standard training when tested on an evaluation corpus from a different source; and performs 10% better than newswire-trained models on manually-annotated Wikipedia text.
Display omitted
•We incorporate dictionaries into deep neural networks for the Chinese CNER.•Two architectures and five feature representation schemes are proposed.•Our proposed method is able to ...handle rare and unseen entities.•Our proposed method is highly competitive compared to the state-of-the-art deep learning methods.
Clinical named entity recognition aims to identify and classify clinical terms such as diseases, symptoms, treatments, exams, and body parts in electronic health records, which is a fundamental and crucial task for clinical and translational research. In recent years, deep neural networks have achieved significant success in named entity recognition and many other natural language processing tasks. Most of these algorithms are trained end to end, and can automatically learn features from large scale labeled datasets. However, these data-driven methods typically lack the capability of processing rare or unseen entities. Previous statistical methods and feature engineering practice have demonstrated that human knowledge can provide valuable information for handling rare and unseen cases. In this paper, we propose a new model which combines data-driven deep learning approaches and knowledge-driven dictionary approaches. Specifically, we incorporate dictionaries into deep neural networks. In addition, two different architectures that extend the bi-directional long short-term memory neural network and five different feature representation schemes are also proposed to handle the task. Computational results on the CCKS-2017 Task 2 benchmark dataset show that the proposed method achieves the highly competitive performance compared with the state-of-the-art deep learning methods.
Display omitted
•Multimodal embeddings are used to explore useful information.•A data augmentation method is proposed for more semantic information.•Multi-level CNN is constructed to fuse short-term ...and long-term information.•Attention mechanism is designed to capture global context information.
Named entity recognition (NER) is a fundamental task in Chinese natural language processing (NLP) tasks. Recently, Chinese clinical NER has also attracted continuous research attention because it is an essential preparation for clinical data mining. The prevailing deep learning method for Chinese clinical NER is based on long short-term memory (LSTM) network. However, the recurrent structure of LSTM makes it difficult to utilize GPU parallelism which to some extent lowers the efficiency of models. Besides, when the sentence is long, LSTM can hardly capture global context information. To address these issues, we propose a novel and efficient model completely based on convolutional neural network (CNN) which can fully utilize GPU parallelism to improve model efficiency. Moreover, we construct multi-level CNN to capture short-term and long-term context information. We also design a simple attention mechanism to obtain global context information which is conductive to improving model performance in sequence labeling tasks. Besides, a data augmentation method is proposed to expand the data volume and try to explore more semantic information. Extensive experiments show that our model achieves competitive performance with higher efficiency compared with other remarkable clinical NER models.
Instant analysis of cybersecurity reports is a fundamental challenge for security experts as an immeasurable amount of cyber information is generated on a daily basis, which necessitates automated ...information extraction tools to facilitate querying and retrieval of data. Hence, we present Open-CyKG: an Open Cyber Threat Intelligence (CTI) Knowledge Graph (KG) framework that is constructed using an attention-based neural Open Information Extraction (OIE) model to extract valuable cyber threat information from unstructured Advanced Persistent Threat (APT) reports. More specifically, we first identify relevant entities by developing a neural cybersecurity Named Entity Recognizer (NER) that aids in labeling relation triples generated by the OIE model. Afterwards, the extracted structured data is canonicalized to build the KG by employing fusion techniques using word embeddings. As a result, security professionals can execute queries to retrieve valuable information from the Open-CyKG framework. Experimental results demonstrate that our proposed components that build up Open-CyKG outperform state-of-the-art models.11Our implementation of Open-CyKG is publicly available at https://github.com/IS5882/Open-CyKG.
•We design an attention-based Open Information Extraction (OIE) model.•We develop a Named Entity Recognition (NER) model to label cybersecurity terms.•We present Open-CyKG as an Open Cyber threat intelligence Knowledge Graph.•We canonicalize the Knowledge Graph (KG) using contextualized word embeddings.•We demonstrate Information Retrieval (IR) from Open-CyKG with two example queries.
Hazardous chemicals are widely used in the production activities of the chemical industry. The risk management of hazardous chemicals is critical to the safety of life and property. Hence, the ...effective risk management of hazardous chemicals has always been important to the chemical industry. Since a large quantity of knowledge and information of hazardous chemicals is stored in isolated databases, it is challenging to manage hazardous chemicals in an information-rich manner. Herein, we prompt a knowledge graph to overcome the information gap between decentralized databases, which would improve the hazardous chemical management. In the implementation of the knowledge graph, we design an ontology schema of hazardous chemicals management. To facilitate enterprises to master the knowledge in the full lifecycle of hazardous chemicals, including production, transportation, storage, etc., we jointly use data from companies and open data from the public domain of hazardous chemicals to construct the knowledge graph. The named entity recognition task is one of the key tasks in the implementation of the knowledge graph, which is of great significance for extracting entity information from unstructured data, namely the hazardous chemical accidents records. To extract useful information from multi-source data, we adopt the pre-trained BERT-CRF model to conduct named entity recognition for incidents records. The model achieves good results, exhibiting the effectiveness in the task of named entity recognition in the chemical industry.
In the context of the escalating use of social media in Arabic-speaking countries, driven by improved internet access, affordable smartphones, and a growing digital connectivity trend, this study ...addresses a significant challenge: the widespread dissemination of fake news. The ease and rapidity of spreading information on social media, coupled with a lack of stringent fact-checking measures, exacerbate the issue of misinformation. Our study examines how language features, especially Named Entity Recognition (NER) features, play a role in detecting fake news. We built two models: an AraBERT Multi-task Learning (MTL) based one for classifying Arabic fake news, and a token classification model that focuses on fake news NER features. The study combines embedding vectors from these models using an embedding fusion technique and applies machine learning algorithms for fake news detection in Arabic. We also introduced a feature selection algorithm named RLTTAO based on improving the Triangulation Topology Aggregation Optimizer (TTAO) performance using Reinforcement Learning and random opposition-based learning to enhance the performance by selecting relevant features, thereby improving the fusion process. Our results show that incorporating NER features enhances the accuracy of fake news detection in 5 out of 7 datasets, with an average improvement of 1.62%.
•The research team developed a Tool to automatically estimate Industry 4.0 Impact on Job Profiles and Skills.•The-state-of-the-art indicates Text Mining as an Effective Support to achieve the ...research goal.•The System proved to be Cost-effective and Easily Replicable in other contexts.•The Job Profiles 4-0-Ready seem to have a stronger component of Horizontal Skills.•The Managerial Roles seem to be more impacted by Technologies 4.0.
Industry 4.0 is introducing rapid and epochal changes and challenges. Among these, the issue of skills and job profiles is assuming a critical role. In fact, the literature highlights not only the necessary integration of existing skills in professional profiles, but also the inevitable creation of new ones to properly manage the digitalisation trends. Although, the state of the art mostly focuses on building models to assess the digital maturity of companies, considering instead the impact on the labor market as a hazy issue. Moreover, the literature tends to offer qualitative approaches to the topic, making the results uncertain; on the other side, quantitative ones tend to be mainly applied on structured databases, while the supply and demand of competences (findable in CVs, vacancies or firm’s job profiles) are less treated. The goal of the present research is developing a measure for quantifying the readiness of employees belonging to a big firm with respect to the Industry 4.0 paradigm. To reach the goal, a data-driven approach based on text mining techniques is applied to a case study. In particular the present methodology makes use of a previously developed enriched dictionary of technologies and methods 4.0 (Chiarello et al., 2018). The source is used to analyze job profiles’ descriptions belonging to Whirlpool, a multinational company with a structured database of jobs and skills. The process allows the identification of technologies, techniques and related skills contained in job descriptions. Starting from these, the Industry 4.0 impact on each job profile is measured. Finally, the metadata of the job profiles are analyzed to evaluate to which extent the skills of profiles 4.0-ready and non-4.0-ready differ. In the end, the work provides a framework for estimating the Industry 4.0 readiness of enterprises’ human capital which demonstrates to be fast, adaptable and reusable.