Automatic classification of sentiment is important for numerous applications such as opinion mining, opinion summarization, contextual advertising, and market analysis. Typically, sentiment ...classification has been modeled as the problem of training a binary classifier using reviews annotated for positive or negative sentiment. However, sentiment is expressed differently in different domains, and annotating corpora for every possible domain of interest is costly. Applying a sentiment classifier trained using labeled data for a particular domain to classify sentiment of user reviews on a different domain often results in poor performance because words that occur in the train (source) domain might not appear in the test (target) domain. We propose a method to overcome this problem in cross-domain sentiment classification. First, we create a sentiment sensitive distributional thesaurus using labeled data for the source domains and unlabeled data for both source and target domains. Sentiment sensitivity is achieved in the thesaurus by incorporating document level sentiment labels in the context vectors used as the basis for measuring the distributional similarity between words. Next, we use the created thesaurus to expand feature vectors during train and test times in a binary classifier. The proposed method significantly outperforms numerous baselines and returns results that are comparable with previously proposed cross-domain sentiment classification methods on a benchmark data set containing Amazon user reviews for different types of products. We conduct an extensive empirical analysis of the proposed method on single- and multisource domain adaptation, unsupervised and supervised domain adaptation, and numerous similarity measures for creating the sentiment sensitive thesaurus. Moreover, our comparisons against the SentiWordNet, a lexical resource for word polarity, show that the created sentiment-sensitive thesaurus accurately captures words that express similar sentiments.
SEthesaurus: WordNet in Software Engineering Chen, Xiang; Chen, Chunyang; Zhang, Dun ...
IEEE transactions on software engineering,
09/2021, Volume:
47, Issue:
9
Journal Article
Peer reviewed
Informal discussions on social platforms (e.g., Stack Overflow, CodeProject) have accumulated a large body of programming knowledge in the form of natural language text. Natural language process ...(NLP) techniques can be utilized to harvest this knowledge base for software engineering tasks. However, consistent vocabulary for a concept is essential to make an effective use of these NLP techniques. Unfortunately, the same concepts are often intentionally or accidentally mentioned in many different morphological forms (such as abbreviations, synonyms and misspellings) in informal discussions. Existing techniques to deal with such morphological forms are either designed for general English or mainly resort to domain-specific lexical rules. A thesaurus, which contains software-specific terms and commonly-used morphological forms, is desirable to perform normalization for software engineering text. However, constructing this thesaurus in a manual way is a challenge task. In this paper, we propose an automatic unsupervised approach to build such a thesaurus. In particular, we first identify software-specific terms by utilizing a software-specific corpus (e.g., Stack Overflow) and a general corpus (e.g., Wikipedia). Then we infer morphological forms of software-specific terms by combining distributed word semantics, domain-specific lexical rules and transformations. Finally, we perform graph analysis on morphological relations. We evaluate the coverage and accuracy of our constructed thesaurus against community-cumulated lists of software-specific terms, abbreviations and synonyms. We also manually examine the correctness of the identified abbreviations and synonyms in our thesaurus. We demonstrate the usefulness of our constructed thesaurus by developing three applications and also verify the generality of our approach in constructing thesauruses from data sources in other domains.
This article explores three aspects of Masterman's language work and applies them to questions of spiritual intelligence: metaphor, coherence, and ambiguity. First, metaphor, which is ubiquitous in ...ordinary language, both leads and misleads in religious and scientific understanding. Masterman's case for a “dual‐approach” to thinking, both speculative and critical, is explored and tied to concepts of moral‐spiritual development per Pierre Hadot and Hannah Arendt. Second, Masterman's work on machine translation presents semantic disambiguation as an emerging coherence wherein one gradually hones in on meaning through features of ordinary language (like redundancy and repetition). This is applied to the problem of comprehending difficult spiritual language, and tied to spiritual stretching and spiritual cartography. Third, Masterman's work with thesauri, rather than relying on words as having fixed meanings, appeals to a concept of semantic spaces, nebulae of variously interconnected meanings. This is constructed into an exhortation to reambiguate overfamiliar religious language, to reinvest one's quotidian surroundings with spiritual meaning through defamilarization.
Full text
Available for:
DOBA, FZAB, GIS, IJS, IZUM, KILJ, NLZOH, NUK, OILJ, PILJ, PNG, SAZU, SBCE, SBMB, SIK, UILJ, UKNU, UL, UM, UPUK
5.
What is all this new MeSH about? Nentidis Anastasios; Krithara Anastasia; Tsoumakas Grigorios ...
International journal on digital libraries,
12/2021, Volume:
22, Issue:
4
Journal Article
Peer reviewed
Open access
The Medical Subject Headings (MeSH) thesaurus is a controlled vocabulary widely used in biomedical knowledge systems, particularly for semantic indexing of scientific literature. As the MeSH ...hierarchy evolves through annual version updates, some new descriptors are introduced that were not previously available. This paper explores the conceptual provenance of these new descriptors. In particular, we investigate whether such new descriptors have been previously covered by older descriptors and what is their current relation to them. To this end, we propose a framework to categorize new descriptors based on their current relation to older descriptors. Based on the proposed classification scheme, we quantify, analyze, and present the different types of new descriptors introduced in MeSH during the last fifteen years. The results show that only about 25% of new MeSH descriptors correspond to new emerging concepts, whereas the rest were previously covered by one or more existing descriptors, either implicitly or explicitly. Most of them were covered by a single existing descriptor and they usually end up as descendants of it in the current hierarchy, gradually leading toward a more fine-grained MeSH vocabulary. These insights about the dynamics of the thesaurus are useful for the retrospective study of scientific articles annotated with MeSH, but could also be used to inform the policy of updating the thesaurus in the future.
Full text
Available for:
DOBA, EMUNI, FIS, FZAB, GEOZS, GIS, IJS, IMTLJ, IZUM, KILJ, KISLJ, MFDPS, NLZOH, NUK, OILJ, PILJ, PNG, SAZU, SBCE, SBJE, SBMB, SBNM, UILJ, UKNU, UL, UM, UPUK, VKSCE, ZAGLJ
The performance of keyword expansion in prior methods is often enhanced by adopting external knowledge. Given a set of initial keywords, this paper is motivated to propose a novel method to expand ...semantically or conceptually related keywords from domain corpus by employing mass diffusion. A bipartite word network is thus constructed based on co-occurrence relations between initial keywords and candidate words. The expanded keywords are identified via two-step mass diffusion which is carried out in the bipartite network. Experimental results prove that the proposed method outperforms both the typical statistical-based approach and graph-based approach. Our research is expected to complement the theoretical framework of keyword expansion and is applicable to the scenarios of query expansion, thesaurus construction, and text clustering.
Full text
Available for:
EMUNI, FIS, FZAB, GEOZS, GIS, IJS, IMTLJ, KILJ, KISLJ, MFDPS, NLZOH, NUK, OBVAL, OILJ, PNG, SAZU, SBCE, SBJE, SBMB, SBNM, UKNU, UL, UM, UPUK, VKSCE, ZAGLJ
After years of research on ontology matching, it is reasonable to consider several questions: is the field of ontology matching still making progress? Is this progress significant enough to pursue ...further research? If so, what are the particularly promising directions? To answer these questions, we review the state of the art of ontology matching and analyze the results of recent ontology matching evaluations. These results show a measurable improvement in the field, the speed of which is albeit slowing down. We conjecture that significant improvements can be obtained only by addressing important challenges for ontology matching. We present such challenges with insights on how to approach them, thereby aiming to direct research into the most promising tracks and to facilitate the progress of the field.
Pointwise mutual information (PMI) is a widely used word similarity measure, but it lacks a clear explanation of how it works. We explore how PMI differs from distributional similarity, and we ...introduce a novel metric, PMI max , that augments PMI with information about a word's number of senses. The coefficients of PMI max are determined empirically by maximizing a utility function based on the performance of automatic thesaurus generation. We show that it outperforms traditional PMI in the application of automatic thesaurus generation and in two word similarity benchmark tasks: human similarity ratings and TOEFL synonym questions. PMI max achieves a correlation coefficient comparable to the best knowledge-based approaches on the Miller-Charles similarity rating data set.
Information retrieval systems traditionally rely on textual keywords to index and retrieve documents. Keyword-based retrieval may return inaccurate and incomplete results when different keywords are ...used to describe the same concept in the documents and in the queries. Furthermore, the relationship between these related keywords may be semantic rather than syntactic, and capturing it thus requires access to comprehensive human world knowledge. Concept-based retrieval methods have attempted to tackle these difficulties by using manually built thesauri, by relying on term cooccurrence data, or by extracting latent word relationships and concepts from a corpus. In this article we introduce a new concept-based retrieval approach based on Explicit Semantic Analysis (ESA), a recently proposed method that augments keyword-based text representation with concept-based features, automatically extracted from massive human knowledge repositories such as Wikipedia. Our approach generates new text features automatically, and we have found that high-quality feature selection becomes crucial in this setting to make the retrieval more focused. However, due to the lack of labeled data, traditional feature selection methods cannot be used, hence we propose new methods that use self-generated labeled training data. The resulting system is evaluated on several TREC datasets, showing superior performance over previous state-of-the-art results.
The article summarizes the experience in the field of education of the designer's aircraft with the support of modern information systems, programming languages, databases and knowledge. We consider ...the problem of interface design assistant job designer based on thesaurus of subject area. We present the existing type of the interface work of the designer and its possibilities. Further ways of development of the program, communication and interaction with the thesaurus of the subject area are considered. The conceptual description of the model of ontological support of the intelligent assistant of the robot designer is shown.