•This work is a detailed companion reproducibility paper of the methods and experiments proposed in three previous works by Lastra-Díaz and García-Serrano, which introduce a set of reproducible ...experiments on word similarity based on HESML and ReproZip with the aim of exactly reproducing the experimental surveys in the aforementioned works.•This work introduces a new representation model for taxonomies called PosetHERep, and a Java software library called Half-Edge Semantic Measures Library (HESML) based on it, which implements most ontology-based semantic similarity measures and Information Content (IC) models based on WordNet reported in the literature.•PosetHERep proposes a memory-efficient representation for taxonomies which linearly scales with the size of the taxonomy and provides an efficient implementation of a large set of topological queries and graph-based algorithms, which is an adaptation of the half-edge data structure commonly used to represent discrete manifolds and planar graphs in computational geometry.•This work also introduces a replication framework and dataset, called WNSimRep v1, which is provided as supplementary material and whose aim is to assist the exact replication of most similarity measures and IC models reported in the literature.•Finally, this work introduces an experimental survey on the performance and scalability of the most recent state-of-the-art semantic measures libraries. This latter experimental survey confirms the statistically significant outperformance of HESML on the state-of-the-art libraries in terms of performance and scalability, as well as the possibility to improve significantly the performance and scalability of the semantic measures libraries without caching using PosetHERep.
This work is a detailed companion reproducibility paper of the methods and experiments proposed by Lastra-Díaz and García-Serrano in (2015, 2016) 56–58, which introduces the following contributions: (1) a new and efficient representation model for taxonomies, called PosetHERep, which is an adaptation of the half-edge data structure commonly used to represent discrete manifolds and planar graphs; (2) a new Java software library called the Half-Edge Semantic Measures Library (HESML) based on PosetHERep, which implements most ontology-based semantic similarity measures and Information Content (IC) models reported in the literature; (3) a set of reproducible experiments on word similarity based on HESMLand ReproZip with the aim of exactly reproducing the experimental surveys in the three aforementioned works; (4) a replication framework and dataset, called WNSimRep v1, whose aim is to assist the exact replication of most methods reported in the literature; and finally, (5) a set of scalability and performance benchmarks for semantic measures libraries. PosetHERep and HESML are motivated by several drawbacks in the current semantic measures libraries, especially the performance and scalability, as well as the evaluation of new methods and the replication of most previous methods. The reproducible experiments introduced herein are encouraged by the lack of a set of large, self-contained and easily reproducible experiments with the aim of replicating and confirming previously reported results. Likewise, the WNSimRep v1 dataset is motivated by the discovery of several contradictory results and difficulties in reproducing previously reported methods and experiments. PosetHERepproposes a memory-efficient representation for taxonomies which linearly scales with the size of the taxonomy and provides an efficient implementation of most taxonomy-based algorithms used by the semantic measures and IC models, whilst HESML provides an open framework to aid research into the area by providing a simpler and more efficient software architecture than the current software libraries. Finally, we prove the outperformance of HESML on the state-of-the-art libraries, as well as the possibility of significantly improving their performance and scalability without caching using PosetHERep.
Konkani WordNet: Corpus-Based Enhancement using Crowdsourcing Manerkar, Sanjana; Asnani, Kavita; Khorjuvenkar, Preeti Ravindranath ...
ACM transactions on Asian and low-resource language information processing,
07/2022, Volume:
21, Issue:
4
Journal Article
Peer reviewed
Konkani is one of the languages included in the eighth schedule of the Indian constitution. It is the official language of Goa and is spoken mainly in Goa and some places in Karnataka and Kerala. ...Konkani WordNet or Konkani Shabdamalem (kōṁkanī śabdamālēṁ) as it has been referred to, was developed under the Indradhanush WordNet Project Consortium during the period from August 2010 to October 2013. This project was funded by Technology Development for Indian Languages (TDIL), Department of Electronics & Information Technology (Deity), and Ministry of Communication and Information Technology (MCIT). The work on Konkani WordNet has halted since the end of the project. Currently, the Konkani WordNet contains around 32,370 synsets. However, to make it a powerful resource for NLP applications in the Konkani language, a need is felt for research work toward enhancement of the Konkani WordNet via community involvement. Crowdsourcing is a technique in which the knowledge of the crowd is utilized to accomplish a particular task. In this article, we have presented the details of the crowdsourcing platform named “Konkani Shabdarth” (kōṁkanī śabdārth). Konkani Shabdarth attempts to use the knowledge of Konkani speaking people for creating new synsets and perform the quantitative enhancement of the wordnet. It also intends to work toward enhancing the overall quality of the Konkani WordNet by validating the existing synsets, and adding the missing words to the existing synsets. A text corpus named “Konkani Shabdarth Corpus”, has been created from the Konkani literature while implementing the Konkani Shabdarth tool. Using this corpus, 572 root words that are missing from the Konkani WordNet have been identified which are given as input to Konkani Shabdarth. As of now, total 94 users have registered on the platform, out of which 25 users have actually played the game. Currently, 71 new synsets have been obtained for 21 words. For some of the words, multiple entries for the concept definition have been received. This overlap is essential for automating the process of validating the synsets. Due to the pandemic period, it has been difficult to train and get players to actually play the game and contribute. We studied the impact of adding missing words from other existing Konkani text corpus on the coverage of Konkani WordNet. The expected increase in the percentage coverage of Konkani WordNet has been found to be in the range 20–27 after adding the missing words from the Konkani Shabdarth corpus in comparison to the other corpora for which the increase is in the range 1–10.
Quantifying semantic similarity between linguistic items lies at the core of many applications in Natural Language Processing and Artificial Intelligence. It has therefore received a considerable ...amount of research interest, which in its turn has led to a wide range of approaches for measuring semantic similarity. However, these measures are usually limited to handling specific types of linguistic item, e.g., single word senses or entire sentences. Hence, for a downstream application to handle various types of input, multiple measures of semantic similarity are needed, measures that often use different internal representations or have different output scales. In this article we present a unified graph-based approach for measuring semantic similarity which enables effective comparison of linguistic items at multiple levels, from word senses to full texts. Our method first leverages the structural properties of a semantic network in order to model arbitrary linguistic items through a unified probabilistic representation, and then compares the linguistic items in terms of their representations. We report state-of-the-art performance on multiple datasets pertaining to three different levels: senses, words, and texts.
PolyWordNet is a new lexical database which deals with the organization of senses of polysemy words. It mimics the way how human mind organizes the senses of polysemy words and their related words to ...analyze and determine correct meaning of a polysemy word in a context. A related word of a sense of a polysemy word is a word which provides necessary and sufficient context to disambiguate the meaning of the polysemy word. A context with a polysemy word must contain at least one related word that determines the correct sense of the polysemy word. The PolyWordNet utilizes this fact to organize the senses of a polysemy word with their corresponding related words. PolyWordNet is completely different than that of the dictionary and WordNet. The words which spell similar come together in dictionary. The words with similar meaning come together in WordNet. The same words, in WordNet, are connected to the multiple senses of the same polysemy word. This introduces an ambiguity. This ambiguity is resolved in PolyWordNet by linking one related word only with a single sense of the same polysemy word. The PolyWordNet can be used to disambiguate senses of polysemy words more precisely and more efficiently. Keywords: PolyWordNet, WordNet, Dictionary
Abstract Object detectors are used for searching all objects belonging to a pre-defined set of categories contained in a given picture. However, users are often not interested in finding all objects, ...but only those that pertain to a small set of categories or concepts. Nowadays, the standard approach to solve this task involves initially employing an object detector to identify all objects within the image, followed by refining the outcomes to retain only the ones of interest. Nevertheless, the object detector does not take advantage of the user’s prior intent that, when used, can potentially improve the detection performance of the model. This work presents a method to condition an existing object detector with the user’s intent, encoded as one or more concepts from the WordNet graph, to find just those objects of interest. The proposed approach takes advantage of existing datasets for object detection without the need for new annotations, and it allows to adapt the already existing object detector models with minor changes. The evaluation, performed on the COCO and the Visual Genome datasets considering several object detector architectures, shows that conditioning the search on concepts is actually beneficial. The code and the pre-trained model weights are released at: https://github.com/drigoni/Concept-Conditioned-Object-Detector .
Query expansion (QE) is a well-known technique used to enhance the effectiveness of information retrieval. QE reformulates the initial query by adding similar terms that help in retrieving more ...relevant results. Several approaches have been proposed in literature producing quite favorable results, but they are not evenly favorable for all types of queries (individual and phrase queries). One of the main reasons for this is the use of the same kind of data sources and weighting scheme while expanding both the individual and the phrase query terms. As a result, the holistic relationship among the query terms is not well captured or scored. To address this issue, we have presented a new approach for QE using Wikipedia and WordNet as data sources. Specifically, Wikipedia gives rich expansion terms for phrase terms, while WordNet does the same for individual terms. We have also proposed novel weighting schemes for expansion terms: in-link score (for terms extracted from Wikipedia) and a tf-idf based scheme (for terms extracted from WordNet). In the proposed Wikipedia-WordNet-based QE technique (WWQE), we weigh the expansion terms twice: first, they are scored by the weighting scheme individually, and then, the weighting scheme scores the selected expansion terms concerning the entire query using correlation score. The proposed approach gains improvements of 24% on the MAP score and 48% on the GMAP score over unexpanded queries on the FIRE dataset. Experimental results achieve a significant improvement over individual expansion and other related state-of-the-art approaches. We also analyzed the effect on retrieval effectiveness of the proposed technique by varying the number of expansion terms.
Growth in the area of opinion mining and sentiment analysis has been rapid and aims to explore the opinions or text present on different platforms of social media through machine-learning techniques ...with sentiment, subjectivity analysis or polarity calculations. Despite the use of various machine-learning techniques and tools for sentiment analysis during elections, there is a dire need for a state-of-the-art approach. To deal with these challenges, the contribution of this paper includes the adoption of a hybrid approach that involves a sentiment analyzer that includes machine learning. Moreover, this paper also provides a comparison of techniques of sentiment analysis in the analysis of political views by applying supervised machine-learning algorithms such as Naïve Bayes and support vector machines (SVM).
Many applications in cognitive science and artificial intelligence utilize semantic similarity and relatedness to solve difficult tasks such as information retrieval, word sense disambiguation, and ...text classification. Previously, several approaches for evaluating concept similarity and relatedness based on WordNet or Wikipedia have been proposed. WordNet-based methods rely on highly precise knowledge but have limited lexical coverage. In contrast, Wikipedia-based models achieve more coverage but sacrifice knowledge quality. Therefore, in this paper, we focus on developing a comprehensive semantic similarity and relatedness method based on WordNet and Wikipedia. To improve the accuracy of existing measures, we combine various taxonomic and non-taxonomic features of WordNet, including gloss, lemmas, examples, sister-terms, derivations, holonyms/meronyms, and hypernyms/hyponyms, with Wikipedia gloss and hyperlinks, to describe concepts. We present a novel technique for extracting ‘is-a’ and ‘part-whole’ relationships between concepts using the Wikipedia link structure. The suggested technique identifies taxonomic and non-taxonomic relationships between concepts and offers dense vector representations of concepts. To fully exploit WordNet and Wikipedia’s semantic attributes, the proposed method integrates their semantic knowledge at feature-level, combining semantic similarity and relatedness into a single comprehensive measure. The experimental results demonstrate the effectiveness of the proposed method over state-of-the-art measures on various gold standard benchmarks.
This paper presents a method for measuring the semantic similarity between concepts in Knowledge Graphs (KGs) such as WordNet and DBpedia. Previous work on semantic similarity methods have focused on ...either the structure of the semantic network between concepts (e.g., path length and depth), or only on the Information Content (IC) of concepts. We propose a semantic similarity method, namely wpath, to combine these two approaches, using IC to weight the shortest path length between concepts. Conventional corpus-based IC is computed from the distributions of concepts over textual corpus, which is required to prepare a domain corpus containing annotated concepts and has high computational cost. As instances are already extracted from textual corpus and annotated by concepts in KGs, graph-based IC is proposed to compute IC based on the distributions of concepts over instances. Through experiments performed on well known word similarity datasets, we show that the wpath semantic similarity method has produced a statistically significant improvement over other semantic similarity methods. Moreover, in a real category classification evaluation, the wpath method has shown the best performance in terms of accuracy and F score.