Semantic similarity measuring between words can be applied to many applications, such as Artificial Intelligence, Information Processing, Medical Care and Linguistics. In this paper, we present a new ...approach for semantic similarity measuring which is based on edge-counting and information content theory. Specifically, the proposed measure nonlinearly transforms the weighted shortest path length between the compared concepts to achieve the semantic similarity results, and the relation between parameters and the correlation value is discussed in detail. Experimental results show that the proposed approach not only achieves high correlation value against human ratings but also has better distribution characteristics of the correlation coefficient compared with several related works in the literature. In addition, the proposed method is computationally efficient due to the simplified ways of weighting the shortest path length between the concept pairs.
The goal of this research is to design and implement a chatbot for querying Wordnet semantic relations. The study creates a contextual chatbot named WordnetBot, a web application that utilizes the ...use of technologies such as Dialogflow, React, NodeJS, Javascript, and MariaDB. The Wordnet database which leverages all other dictionaries due to its semantic relations representation was used as the data source. Phrase Structure Analysis extracts the keyword and the semantic relation from a user’s message or query. It complements the Machine Learning and AI capabilities of Dialogflow in the analysis. The researcher designed an architectural framework for the integration of the different components of WordnetBot.
We present a formal Arabic wordnet built on the basis of a carefully designed ontology hereby referred to as the Arabic Ontology. The ontology provides a formal representation of the concepts that ...the Arabic terms convey, and its content was built with ontological analysis in mind, and benchmarked to scientific advances and rigorous knowledge sources as much as this is possible, rather than to only speakers’ beliefs as lexicons typically are. A comprehensive evaluation was conducted thereby demonstrating that the current version of the top-levels of the ontology can top the majority of the Arabic meanings. The ontology consists currently of about 1,800 well-investigated concepts in addition to 16,000 concepts that are partially validated. The ontology is accessible and searchable through a lexicographic search engine (http://ontology.birzeit.edu) that also includes about 150 Arabic-multilingual lexicons, and which are being mapped and enriched using the ontology. The ontology is fully mapped with Princeton WordNet, Wikidata, and other resources.
The rise of superfluous information day by day has made the clustering of information into meaningful sets challenging. We propose an efficient approach for obtaining semantic clusters from a huge ...volume of documents. The preprocessing based on the lexical ontological information from WordNet helps in reducing the feature space and eliminating synonymy problems among the features. A considerable decrease in computational time is achieved by means of an enhanced k-means clustering algorithm. This algorithm computes the starting centroids using a sorting technique based on Red-Black Tree, thus ensuring efficiency and meaningful clusters. Memoization techniques are utilized in the subsequent stages, to avoid redundant computations. Results indicate that our method produces more meaningful clusters than those employing word embedding models like Word2Vec, FastText, and BERT for feature extraction. The experimental results on the MiniNewsGroup, 20NewsGroup Large dataset, and Reuters-21578 dataset show remarkable achievement in clustering outcomes in terms of purity and execution time. Results on the enormous data collection 20NewsGroup Large dataset show a better NMI (Normalized Mutual Information) score compared to the existing methods.
Success of Natural Language Processing (NLP) models, just like all advanced machine learning models, rely heavily on large -scale lexical resources. For English, English WordNet (EWN) is a leading ...example of a large-scale resource that has enabled advances in Natural Language Understanding (NLU) tasks such as word sense disambiguation, question answering, sentiment analysis, and emotion recognition. EWN includes sets of cognitive synonyms called synsets, which are interlinked by means of conceptual-semantic and lexical relations and where each synset expresses a distinct concept. However, other languages are still lagging behind in having large-scale and rich lexical resources similar to EWN. In this article, we focus on enabling the development of such resources for Arabic. While there have been efforts in developing an Arabic WordNet (AWN), the current version of AWN has its limitations in size and in lacking transliteration standards, which are important for compatibility with Arabic NLP tools. Previous efforts for extending AWN resulted in a lexicon, called ArSenL, that overcame the size and the transliteration standard limitation but was limited in accuracy due to the heuristic approach that only considered surface matching between the English definitions from the Standard Arabic Morphological Analyzer (SAMA) and EWN synset terms, and that resulted in inaccurate mapping of Arabic lemmas to EWN’s synsets. Furthermore, there has been limited exploration of other expansion methods due to expensive manual validation needed. To address these limitations of simultaneously having large-scale size with high accuracy and standard representations, the mapping problem is formulated as a link prediction problem between a large-scale Arabic lexicon and EWN, where a word in one lexicon is linked to a word in another lexicon if the two words are semantically related. We use a semi-supervised approach to create a training dataset by finding common terms in the large-scale Arabic resource and AWN. This set of data becomes implicitly linked to EWN and can be used for training and evaluating prediction models. We propose the use of a two-step Boosting method, where the first step aims at linking English translations of SAMA’s terms to EWN’s synsets. The second step uses surface similarity between SAMA’s glosses and EWN’s synsets. The method results in a new large-scale Arabic lexicon that we call ArSenL 2.0 as a sequel to the previously developed sentiment lexicon ArSenL. A comprehensive study covering both intrinsic and extrinsic evaluations shows the superiority of the method compared to several baseline and state-of-the-art link prediction methods. Compared to previously developed ArSenL, ArSenL 2.0 included a larger set of sentimentally charged adjectives and verbs. It also showed higher linking accuracy on the ground truth data compared to previous ArSenL. For extrinsic evaluation, ArSenL 2.0 was used for sentiment analysis and showed, here, too, higher accuracy compared to previous ArSenL.
Various applications in computational linguistics and artificial intelligence employ semantic similarity to solve challenging tasks, such as word sense disambiguation, text classification, ...information retrieval, machine translation, and document clustering. To our knowledge, research to date rely solely on the taxonomic relation “ISA” to evaluate semantic similarity and relatedness between terms. This paper explores the benefits of using all types of non-taxonomic relations in large linked data, such as WordNet knowledge graph, to enhance existing semantic similarity and relatedness measures. We propose a holistic poly-relational approach based on a new relation-based information content and non-taxonomic-based weighted paths to devise a comprehensive semantic similarity and relatedness measure. To demonstrate the benefits of exploiting non-taxonomic relations in a knowledge graph, we used three strategies to deploy non-taxonomic relations at different granularity levels. We conduct experiments on four well-known gold standard datasets. The results of our proposed method demonstrate an improvement over the benchmark semantic similarity methods, including the state-of-the-art knowledge graph embedding techniques, that ranged from 3.8%–23.8%, 1.3%–18.3%, 31.8%–117.2%, and 19.1%–111.1%, on all gold standard datasets MC, RG, WordSim, and Mturk, respectively. These results demonstrate the robustness and scalability of the proposed semantic similarity and relatedness measure, significantly improving existing similarity measures.
Research in the area of the interconnection of lexical resources represents a real challenge, because it addresses the difficult problem of semantic understanding and, more precisely, the ...disambiguation of the meaning of the words - Word Sense Disambiguation (WSD). In the current and prospective context of the information society, the existence of the digital format of the fundamental works of a national culture is strictly necessary. It is a topical issue throughout the world of creating a representative corpus of a language accessible through the Internet, the corpus being a concrete, clear picture of the use of that language. In this study we will describe the development of a Romanian language GOLD corpus, related to the multiple meanings existing for various words. We propose a corpus annotation standard, based on three lexical resources as follows: the Thesaurus Dictionary of the Romanian Language in electronic format (eDTLR), from which we extracted a list of words with multiple meanings; from the Reference Corpus for Contemporary Romanian Language (CoRoLa) we extracted contexts in which these words were founded and from the the Romanian WordNet (RoWN) resource, we took into account the sense meaning of the word from the corpus context.
Emotion lexicons are useful in research across various disciplines, but the availability of such resources remains limited for most languages. While existing emotion lexicons typically comprise ...words, it is a particular meaning of a word (rather than the word itself) that conveys emotion. To mitigate this issue, we present the Emotion Meanings dataset, a novel dataset of 6000 Polish word meanings. The word meanings are derived from the Polish wordnet (plWordNet), a large semantic network interlinking words by means of lexical and conceptual relations. The word meanings were manually rated for valence and arousal, along with a variety of basic emotion categories (anger, disgust, fear, sadness, anticipation, happiness, surprise, and trust). The annotations were found to be highly reliable, as demonstrated by the similarity between data collected in two independent samples:
unsupervised
(
n
= 21,317) and
supervised
(
n
= 561). Although we found the annotations to be relatively stable for female, male, younger, and older participants, we share both summary data and individual data to enable emotion research on different demographically specific subgroups. The word meanings are further accompanied by the relevant metadata, derived from open-source linguistic resources. Direct mapping to Princeton WordNet makes the dataset suitable for research on multiple languages. Altogether, this dataset provides a versatile resource that can be employed for emotion research in psychology, cognitive science, psycholinguistics, computational linguistics, and natural language processing.