Code-switching entails mixing multiple languages. It is an increasingly occurring phenomenon in social media texts. Usually, code-mixed texts are written in a single script, even though the languages ...involved have different scripts. Pre-trained multilingual models primarily utilize the data in the native script of the language. In existing studies, the code-switched texts are utilized as they are. However, using the native script for each language can generate better representations of the text owing to the pre-trained knowledge. Therefore, a cross-language-script knowledge-sharing architecture utilizing the cross-attention and alignment of the representations of text in individual language scripts was proposed in this study. Experimental results on two different datasets containing Nepali-English and Hindi-English code-switched texts, demonstrate the effectiveness of the proposed method. The interpretation of the model using the model explainability technique illustrates the sharing of language-specific knowledge between language-specific representations.
As the amount of generated information grows, reading and summarizing texts of large collections turns into a challenging task. Many documents do not come with descriptive terms, thus requiring ...humans to generate keywords on-the-fly. The need to automate this kind of task demands the development of keyword extraction systems with the ability to automatically identify keywords within the text. One approach is to resort to machine-learning algorithms. These, however, depend on large annotated text corpora, which are not always available. An alternative solution is to consider an unsupervised approach. In this article, we describe YAKE!, a light-weight unsupervised automatic keyword extraction method which rests on statistical text features extracted from single documents to select the most relevant keywords of a text. Our system does not need to be trained on a particular set of documents, nor does it depend on dictionaries, external corpora, text size, language, or domain. To demonstrate the merits and significance of YAKE!, we compare it against ten state-of-the-art unsupervised approaches and one supervised method. Experimental results carried out on top of twenty datasets show that YAKE! significantly outperforms other unsupervised methods on texts of different sizes, languages, and domains.
Interest point detection is one of the most fundamental and critical problems in computer vision and image processing. In this paper, we carry out a comprehensive review on image feature information ...(IFI) extraction techniques for interest point detection. To systematically introduce how the existing interest point detection methods extract IFI from an input image, we propose a taxonomy of the IFI extraction techniques for interest point detection. According to this taxonomy, we discuss different types of IFI extraction techniques for interest point detection. Furthermore, we identify the main unresolved issues related to the existing IFI extraction techniques for interest point detection and any interest point detection methods that have not been discussed before. The existing popular datasets and evaluation standards are provided and the performances for fifteen state-of-the-art approaches are evaluated and discussed. Moreover, future research directions on IFI extraction techniques for interest point detection are elaborated.
Entity and relation extraction is a task that combines detecting entity mentions and recognizing entities’ semantic relationships from unstructured text. We propose a hybrid neural network model to ...extract entities and their relationships without any handcrafted features. The hybrid neural network contains a novel bidirectional encoder-decoder LSTM module (BiLSTM-ED) for entity extraction and a CNN module for relation classification. The contextual information of entities obtained in BiLSTM-ED further pass though to CNN module to improve the relation classification. We conduct experiments on the public dataset ACE05 (Automatic Content Extraction program) to verify the effectiveness of our method. The method we proposed achieves the state-of-the-art results on entity and relation extraction task.
Document-level relation extraction (RE) aims to simultaneously predict relations (including no-relation cases denoted as NA) between all entity pairs in a document. It is typically formulated as a ...relation classification task with entities pre-detected in advance and solved by a hard-label training regime, which, however, neglects the divergence of the NA class and the correlations among other classes. This article introduces progressive self-distillation (PSD), a new training regime that employs online, self-knowledge distillation (KD) to produce and incorporate soft labels for document-level RE.The key idea of PSD is to gradually soften hard labels using past predictions from an RE model itself, which are adjusted adaptively as training proceeds. As such, PSD has to learn only one RE model within a single training pass, requiring no extra computation or annotation to pretrain another high-capacity teacher. PSD is conceptually simple, easy to implement, and generally applicable to various RE models to further improve their performance, without introducing additional parameters or significantly increasing training overheads into the models. It is also a general framework that can be flexibly extended to distilling various types of knowledge, rather than being restricted to soft labels themselves. Extensive experiments on four benchmarking datasets verify the effectiveness and generality of the proposed approach. The code is available at https://github.com/GaoJieCN/psd
Cross-domain Named Entity Recognition (NER) transfers knowledge learned from a rich-resource source domain to improve the learning in a low-resource target domain. Most existing works are designed ...based on the sequence labeling framework, defining entity detection and type prediction as a monolithic process. However, they typically ignore the discrepant transferability of these two sub-tasks: the former locating spans corresponding to entities is largely domain-robust, whereas the latter owns distinct entity types across domains. Combining them into an entangled learning problem may contribute to the complexity of domain transfer. In this work, we propose the novel divide-and-transfer paradigm in which different sub-tasks are learned using separate functional modules for respective cross-domain transfer. To demonstrate the effectiveness of divide-and-transfer, we concretely implement two NER frameworks by applying this paradigm with different cross-domain transfer strategies. Experimental results on 10 different domain pairs show the notable superiority of our proposed frameworks. Experimental analyses indicate that significant advantages of the divide-and-transfer paradigm over prior monolithic ones originate from its better performance on low-resource data and a much greater transferability. It gives us a new insight into cross-domain NER. Our code is available on GitHub.1
We present YAGO2, an extension of the YAGO knowledge base, in which entities, facts, and events are anchored in both time and space. YAGO2 is built automatically from Wikipedia, GeoNames, and ...WordNet. It contains 447 million facts about 9.8 million entities. Human evaluation confirmed an accuracy of 95% of the facts in YAGO2. In this paper, we present the extraction methodology, the integration of the spatio-temporal dimension, and our knowledge representation SPOTL, an extension of the original SPO-triple model to time and space.
The growing number of scientific papers and document sources underscores the need for methods capable of evaluating the quality of publications. Researchers who are looking for relevant papers for ...their studies need ways to assess the scientific value of these documents. One approach involves using semantic search engines that can automatically extract important knowledge from the growing body of text. In this study, we introduce a new metric called “MAATrica,” which serves as the foundation for an innovative method designed to evaluate research papers. MAATrica offers a new way to analyze and categorize text, focusing on the consistency of research documents in the life sciences, particularly in the fields of medicinal and nutraceutical chemistry. This method utilizes semantic descriptions to cover in silico experiments, as well as in vitro and in vivo essays. Created to aid in evaluation processes like peer review, MAATrica uses toolkits and semantic applications to build the proposed measure, identify scientific entities, and gather information. We have applied MAATrica to roughly 90,000 papers and present our findings here.
Display omitted
•MAATrica is a novel metric for assessing the coherence of methodologies in research papers within the fields of medicinal and nutraceutical chemistry.•MAATrica utilizes SciWalker (SW) as semantic search engine and ontologies to automate knowledge extraction and evaluate research papers.•MAATrica metric has been tested and validated using a dataset comprising approximately 90,000 papers within the SW platform.•MAATrica's reliability has undergone testing through comparisons with manual evaluations, revealing strong agreement and potential support for peer review.•MAATrica employs a user-controlled and customizable ontology, enabling personalized analysis of research papers in the fields of medicinal and nutraceutical chemistry.
With the advent of Web 2.0, there exist many online platforms that result in massive textual-data production. With ever-increasing textual data at hand, it is of immense importance to extract ...information nuggets from this data. One approach towards effective harnessing of this unstructured textual data could be its transformation into structured text. Hence, this study aims to present an overview of approaches that can be applied to extract key insights from textual data in a structured way. For this, Named Entity Recognition and Relation Extraction are being majorly addressed in this review study. The former deals with identification of named entities, and the latter deals with problem of extracting relation between set of entities. This study covers early approaches as well as the developments made up till now using machine learning models. Survey findings conclude that deep-learning-based hybrid and joint models are currently governing the state-of-the-art. It is also observed that annotated benchmark datasets for various textual-data generators such as Twitter and other social forums are not available. This scarcity of dataset has resulted into relatively less progress in these domains. Additionally, the majority of the state-of-the-art techniques are offline and computationally expensive. Last, with increasing focus on deep-learning frameworks, there is need to understand and explain the under-going processes in deep architectures.