This paper presents a method for measuring the semantic similarity between concepts in Knowledge Graphs (KGs) such as WordNet and DBpedia. Previous work on semantic similarity methods have focused on ...either the structure of the semantic network between concepts (e.g., path length and depth), or only on the Information Content (IC) of concepts. We propose a semantic similarity method, namely wpath, to combine these two approaches, using IC to weight the shortest path length between concepts. Conventional corpus-based IC is computed from the distributions of concepts over textual corpus, which is required to prepare a domain corpus containing annotated concepts and has high computational cost. As instances are already extracted from textual corpus and annotated by concepts in KGs, graph-based IC is proposed to compute IC based on the distributions of concepts over instances. Through experiments performed on well known word similarity datasets, we show that the wpath semantic similarity method has produced a statistically significant improvement over other semantic similarity methods. Moreover, in a real category classification evaluation, the wpath method has shown the best performance in terms of accuracy and F score.
Advances in natural language processing provide accessible approaches to analyze psychological open‐ended data. However, comprehensive instruments for text analysis of stereotype content are missing. ...We developed stereotype content dictionaries using a semi‐automated method based on WordNet and word embeddings. These stereotype content dictionaries covered over 80% of open‐ended stereotypes about salient American social groups, compared to 20% coverage from words extracted directly from the stereotype content literature. The dictionaries showed high levels of internal consistency and validity, predicting stereotype scale ratings and human judgments of online text. We developed the R package Semi‐Automated Dictionary Creation for Analyzing Text (SADCAT; https://github.com/gandalfnicolas/SADCAT) for access to the stereotype content dictionaries and the creation of novel dictionaries for constructs of interest. Potential applications of the dictionaries range from advancing person perception theories through laboratory studies and analysis of online data to identifying social biases in artificial intelligence, social media, and other ubiquitous text sources.
•A modified WordNet based similarity measure for word sense disambiguation.•Lexical chains as text representation for ideally cover the theme of texts.•Extracted core semantics are sufficient to ...reduce dimensionality of feature set.•The proposed scheme is able to correctly estimate the true number of clusters.•The topic labels have good indicator of recognizing and understanding the clusters.
Traditional clustering algorithms do not consider the semantic relationships among words so that cannot accurately represent the meaning of documents. To overcome this problem, introducing semantic information from ontology such as WordNet has been widely used to improve the quality of text clustering. However, there still exist several challenges, such as synonym and polysemy, high dimensionality, extracting core semantics from texts, and assigning appropriate description for the generated clusters. In this paper, we report our attempt towards integrating WordNet with lexical chains to alleviate these problems. The proposed approach exploits ontology hierarchical structure and relations to provide a more accurate assessment of the similarity between terms for word sense disambiguation. Furthermore, we introduce lexical chains to extract a set of semantically related words from texts, which can represent the semantic content of the texts. Although lexical chains have been extensively used in text summarization, their potential impact on text clustering problem has not been fully investigated. Our integrated way can identify the theme of documents based on the disambiguated core features extracted, and in parallel downsize the dimensions of feature space. The experimental results using the proposed framework on reuters-21578 show that clustering performance improves significantly compared to several classical methods.
Sentences are the language of human communication. This communication medium is so fluid that words and meaning can have many interpretations by readers. Besides, a document that consists of ...thousands of sentences would be tough for the reader to understand the content. In this case, computer power is required to analyse the gigantic batch size of the text. However, there are several arguments that actively discuss regarding the output generated by a computer toward the meaning of the passage in terms of accuracy. One of the reasons for this issue is the existing of the ambiguous word with multiple meanings in a sentence. The passage might be incorrectly translated due to wrong sense selection during the early phase of sentence translation. Translating sentence in this paper means either the sentence has a negative or positive meaning. Thus, this research discusses on how to disambiguate the term in a sentence by referring to the Wordnet repository by proposing the use of fuzzy semantic-based similarity model. The proposed model promising to return a good result for detecting the similarity of two sentences that has been proven in the past research. At the end of this paper, preliminary result which shows the flow of how the proposed framework working is discussed.
In this paper we will present an approach for interconnecting some of the most important lexical resources for the Romanian language, among which the Corpus for Contemporary Romanian Language ...(CoRoLa), Romanian WordNet, the Thesaurus Dictionary of Romanian Language in Electronic Format (eDTLR). We will show how they are indexed to each other in order to obtain a standard interconnected structure, and then, we will present a general architecture of a system dedicated to interconnecting lexical resources “on the fly”.
The automatic word sense disambiguation has been a topic of interest since the 1950s. Meaning disambiguation is not an end in itself, it is an intermediate process, necessary at a certain level to ...use in natural language processing. It is obviously useful for applications that require language interpretation (message communication, human-machine interaction), but it is also used in fields whose main purpose is not to understand natural language. In this paper we propose an approach for word sense disambiguation for Romanian language, based on a supervised learning, a large number of features, and information gain to reduce the dimensions. The evaluation for the Romanian words subtask is produced by the extensive feature set in conjunction with a Maximum Entropy classifier and outcomes that are better than other similar researches.
Wordnet is a collection of words that interpret or present a meaning, in its development Wordnet has an important part, the Synonym Set or Synset. In making Synonym sets, synonyms are needed and the ...commutative nature of words is needed. To get word synonyms, the English language thesaurus becomes the reference data for taking synonym data. Broadly speaking, the difference between Wordnet and the dictionary is that the meaning of the word is related to other words, to determine the equation requires a commutative process. The process is made easy by using commutative methods that will produce a candidate synonym set. Candidates for the synonym set cannot be used for word syntax, the grouping process of words which produces the Synonym set as the final result must be carried out. The process of grouping words can one of them use clustering techniques, in this study will use Agglomerative Clustering techniques. In the process of agglomerative clustering techniques there is a threshold value to determine the number of repetitions or as a condition to stop the iteration process. The clustering process in this study will use a threshold value of 0.1 to 1 to test the best threshold value to produce the best Synonym set and calculate its accuracy value. Accuracy calculation and evaluation will use the F-measure method to find the best results.
On the development of Indonesian WordNet, the synonym set is an important part that represents the similarity of meaning between words. Synonym sets are built using the Indonesian Thesaurus as the ...lexical database. After going through the extraction process from the Indonesian Thesaurus, we will get a synonym set that has a similarity or word sense between words. In general, the difference between WordNet and the dictionary is their main focus, in which the dictionary usually focuses on just one word, while in WordNet the focus is on the meaning of words and connectedness with other words. Explained in previous research, the constructions of synonym sets were done using several approaches, which is clustering to produce synonym sets and WSD (Word Sense Disambiguation). In this article, the approach used to produce synonym sets is the ROCK (Robust Clustering Using Links) algorithm, which uses similarity and link values. The resulting synonym sets will then be used for lexical database development. Therefore, the main focus of this article is to produce synonym sets through the clustering process and calculate their accuracy, using the F-Measure method involving the gold standard for performance calculation and evaluation.
Semantic similarity measuring between words can be applied to many applications, such as Artificial Intelligence, Information Processing, Medical Care and Linguistics. In this paper, we present a new ...approach for semantic similarity measuring which is based on edge-counting and information content theory. Specifically, the proposed measure nonlinearly transforms the weighted shortest path length between the compared concepts to achieve the semantic similarity results, and the relation between parameters and the correlation value is discussed in detail. Experimental results show that the proposed approach not only achieves high correlation value against human ratings but also has better distribution characteristics of the correlation coefficient compared with several related works in the literature. In addition, the proposed method is computationally efficient due to the simplified ways of weighting the shortest path length between the concept pairs.