The paper presents our work towards the simultaneous creation of a data-driven WordNet for Bulgarian and a manually annotated treebank with semantic information. Such an approach requires ...synchronization of the word senses in both — syntactic and lexical resources, without limiting the WordNet senses to the corpus or vice versa. Our strategy focuses on the identification of senses used in BulTreeBank, but the missing senses of a lemma also have been covered through exploration of bigger corpora. The identified senses have been organized in synsets for the Bulgarian WordNet. Then they have been aligned to the Princeton WordNet synsets. Various types of mappings are considered between both resources in a cross-lingual aspect and with respect to ensuring maximum connectivity and potential for incorporating the language specific concepts. The mapping between the two WordNets (English and Bulgarian) is a basis for applications such as machine translation and multilingual information retrieval.
Abstract
The results of manual mapping of Polish plWordNet onto English Princeton WordNet revealed a number of gaps and mismatches between those interlinked lexical resources. Preliminary studies ...have shown that they embrace wordnet-specific and language-specific differences, and in this exploratory study we focus on the latter, also called lacunae. Capitalising on the system of equivalence types and features for linking wordnet senses (Rudnicka et al. 2019), we present a semi-automatic, rule-based diagnostic system developed specifically for systematic detection and classification of gaps and mismatches between wordnets. First, focusing on noun synsets, we aim to identify those network fragments that are the most prone to reveal lexical and referential gaps (Svensén 2009). Second, we attempt to identify areas in an interlinked Polish-English wordnet that require resource expansion or modification of the existing network of inter-lingual relations.
•This paper proposes a methodology for designing a TechWord-based lexical database.•The approach can improve the text mining performance of technological information.•This paper defines TechWord, a ...technology lexical information.•A TechSynset is constructed by network analysis based on word embedding vector.•A case of the automotive field is illustrated to validate the proposed approach.
The role of text mining based on technological documents such as patents is important in the research field of technology intelligence for technology R&D planning. In addition, WordNet, an English-based lexical database, is widely used for pre-processing text data such as word lemmatization and synonym search. However, technological vocabulary information is complex and specific, and WordNet’s ability to analyze technological information is limited in its reflecting technological features. Thus, to improve the text mining performance of technological information, this study proposes a methodology for designing a TechWord-based lexical database that is based on the lexical characteristics of technological words that are differentiated from general words. To do this, we define TechWord, a technology lexical information, and construct a TechSynset, a synonym set between TechWords. First, through dependency parsing between words, TechWord, a unit word that describes a technology, is structured and identifies nouns and verbs. The importance of connectivity is investigated by a network centrality index analysis based on the dependency relations of words. Subsequently, to search for synonyms suitable for the target technology domain, a TechSynset is constructed through synset information, with an additional analysis that calculates cosine similarity based on a word embedding vector. Applying the proposed methodology to the actual technology-related information analysis, we collect patent data on the technological fields of the automotive field, and present the results of the TechWord and TechSynset. This study improves technological information-based text mining by structuring the word-to-word link information in technological documents based on an automated process.
In recent years, researchers have proposed several feature-based methods to measure semantic similarity using knowledge resources like Wikipedia and WordNet. While Wikipedia covers millions of ...concepts with multiple features, it has some limitations such as articles with limited content and concept ambiguity. Disambiguating these concepts remains a challenge. Conversely, WordNet offers unambiguous terms by covering all possible senses, making it a useful resource for disambiguating Wikipedia concepts. Additionally, WordNet can enrich the limited content of Wikipedia articles. Thus, we present a new approach that combines both resources to enhance previous feature-based methods of semantic similarity. We begin by analyzing the limitations of previous research, followed by introducing a novel method to disambiguate Wikipedia concepts using WordNet’s synonym structure, resulting in more effective disambiguation. Furthermore, we use WordNet to supplement the features in Wikipedia articles and redefine the feature similarity functions. Finally, we train non-linear fitting-based models to measure semantic similarity. Our approach outperforms other previous methods on various benchmarks. To further showcase our approach, we apply our models to develop a movie recommender system using the MovieLens dataset, which consistently outperforms other systems.
Identifying the home location of Twitter users is very important in many business community applications. Therefore, many approaches have been developed to automatically geolocate Twitter users using ...their tweets. In this paper, a new model to predict home location for Twitter users based on sentiment analysis (Pre-HLSA) is proposed. It predicts the users’ home location using only their tweets, by analyzing some of the tweet’s features. Achieving this goal allows providing geospatial services, especially in the epidemic dispersion. The Pre-HLSA represents user tweets as a set of extracted features and predicts the users’ home locations by analyzing their tweets to find sentiments and polarities, even in the absence of geospatial clues. Then, different classifiers are applied. The experimental results show a promising performance compared to the previous methods in terms of accuracy, mean and median performance measures. It achieves up to 85% accuracy, 223 km mean, and 96 km median.
Emails have become one of the major applications in daily life. The continuous growth in the number of email users has led to a massive increase of unsolicited emails, which are also known as spam ...emails. Managing and classifying this huge number of emails is an important challenge. Most of the approaches introduced to solve this problem handled the high dimensionality of emails by using syntactic feature selection. In this paper, an efficient email filtering approach based on semantic methods is addressed. The proposed approach employs the WordNet ontology and applies different semantic based methods and similarity measures for reducing the huge number of extracted textual features, and hence the space and time complexities are reduced. Moreover, to get the minimal optimal features’ set, feature dimensionality reduction has been integrated using feature selection techniques such as the Principal Component Analysis (PCA) and the Correlation Feature Selection (CFS). Experimental results on the standard benchmark Enron Dataset showed that the proposed semantic filtering approach combined with the feature selection achieves high computational performance at high space and time reduction rates. A comparative study for several classification algorithms indicated that the Logistic Regression achieves the highest accuracy compared to Naïve Bayes, Support Vector Machine, J48, Random Forest, and radial basis function networks. By integrating the CFS feature selection technique, the average recorded accuracy for the all used algorithms is above 90%, with more than 90% feature reduction. Besides, the conducted experiments showed that the proposed work has a highly significant performance with higher accuracy and less time compared to other related works.
The complex nature of big data resources requires new structuring methods, especially for textual content. WordNet is a good knowledge source for the comprehensive abstraction of natural language as ...it offers good implementation for many languages. Since WordNet embeds natural language in the form of a complex network, a transformation mechanism, WordNet2Vec, is proposed in this paper. This creates vectors for each word from WordNet. These vectors encapsulate a general position — the role of a given word related to all other words in the given natural language. Any list or set of such vectors contains knowledge about the context of its components within the whole language. This type of word representation can be easily applied to many analytic tasks such as classification or clustering. The usefulness of the WordNet2Vec method is demonstrated in sentiment analysis including the classification of an Amazon opinion text dataset with transfer learning.
•Statistical and semantic features based text clustering technique is proposed.•Combining the statistical and semantic features improves text clustering.•The formation of lexical chains from WordNet ...captures important terms.•Semantic relations such as synonym and hypernym etc. help to reduce dimensionality.
Document clustering in text mining is a problem that is heavily researched upon. It is observed that individual approaches based on statistical features and semantic features have been extensively used to solve this problem. However, techniques combining the advantages of both types of features have not been frequently researched upon. Specifically, when the growth in the size of textual data is immense, there is a need for such an approach that combines the advantages of both types of features to give more accurate results within an acceptable range of time. In this paper, a document clustering technique is proposed that combines the effectiveness of the statistical features (using TF-IDF) and semantic features (using lexical chains). It is designed to use a fewer number of features while maintaining a comparable and even better accuracy for the task of document clustering.
In the last decade, several lexical-semantic knowledge bases (LKBs) were developed for Portuguese, by different teams and following different approaches. Most of them are open and freely available ...for the community. Those LKBs are briefly analysed here, with a focus on size, structure, and overlapping contents. However, we go further and exploit all of the analysed LKBs in the creation of new LKBs, based on the redundant contents. Both original and redundancy-based LKBs are then compared, indirectly, based on the performance of automatic procedures that exploit them for solving four different semantic analysis tasks. In addition to conclusions on the performance of the original LKBs, results show that, instead of selecting a single LKB to use, it is generally worth combining the contents of all the open Portuguese LKBs, towards better results.
Expressive Clustering contains ordinarily filtering through information occasions into get-togethers and making a practical plan for each get-together. The depiction should inspire a client to press ...ahead with no more prominent assessment of the specific occasions regarding the substance regarding each social occasion, enabling a client to channel quickly for suitable classes. Once in a while, the choice of delineations relies on anxious representation of heuristic data. We model and coordinate reasonable assembly that recognises highlights from bundle assignments and from a subset of highlights seeks bundle assignments. For updated extraction of Multi Labeled Multi-Word Phrases, we present a zone free clustering based way of thought (MLMWEs). The framework incorporates true information from Wikipedia articles from an all-around obliging corpus and connexions. We lace alliance checks to package MLMWEs through pieces of server homesteads and then the organising score for each MLMWEs subject to the closest model offered to a social affair after that process. Results of the assessment, A combination of association figures, achieved for two vernaculars, shows that an improvement in the organisation of independent and vital coefficient frequency controls and ultimately undeniable steps for MLMWEs is given.