Display omitted
•We survey methods of representing clinical text using neural networks.•We provide a “how-to” guide for training these representations on clinical text.•We describe word models, ...corpora, evaluation methods, and applications.
Representing words as numerical vectors based on the contexts in which they appear has become the de facto method of analyzing text with machine learning. In this paper, we provide a guide for training these representations on clinical text data, using a survey of relevant research. Specifically, we discuss different types of word representations, clinical text corpora, available pre-trained clinical word vector embeddings, intrinsic and extrinsic evaluation, applications, and limitations of these approaches. This work can be used as a blueprint for clinicians and healthcare workers who may want to incorporate clinical text features in their own models and applications.
Efficient distributed numerical word representation models (word embeddings) combined with modern machine learning algorithms have recently yielded considerable improvement on automatic document ...classification tasks. However, the effectiveness of such techniques has not been assessed for the hierarchical text classification (HTC) yet. This study investigates the application of those models and algorithms on this specific problem by means of experimentation and analysis. We trained classification models with prominent machine learning algorithm implementations—fastText, XGBoost, SVM, and Keras’ CNN—and noticeable word embeddings generation methods—GloVe, word2vec, and fastText—with publicly available data and evaluated them with measures specifically appropriate for the hierarchical context. FastText achieved an lcaF1 of 0.893 on a single-labeled version of the RCV1 dataset. An analysis indicates that using word embeddings and its flavors is a very promising approach for HTC.
•We present a fully unsupervised CLIR method based on neural topic-enhanced cross-lingual word embeddings.•We propose a three-part method to model neural topic relevance and word embeddings in a ...mutual reinforcement way.•We perform extensive experiments and show that our proposed method outperforms several state-of-the-art methods.
Cross-lingual information retrieval (CLIR) methods have quickly made the transition from translation-based approaches to semantic-based approaches. In this paper, we examine the limitations of current unsupervised neural CLIR methods, especially those leveraging aligned cross-lingual word embedding (CLWE) spaces. At the moment, CLWEs are normally constructed on the monolingual corpus of bilingual texts through an iterative induction process. Homonymy and polysemy have become major obstacles in this process. On the other hand, contextual text representation methods often fail to outperform static CLWE methods significantly for CLIR. We propose a method utilizing a novel neural generative model with Wasserstein autoencoders to learn neural topic-enhanced CLWEs for CLIR purposes. Our method requires minimal or no supervision at all. On the CLEF test collections, we perform a comparative evaluation of the state-of-the-art semantic CLWE methods along with our proposed method for neural CLIR tasks. We demonstrate that our method outperforms the existing CLWE methods and multilingual contextual text encoders. We also show that our proposed method obtains significant improvements over the CLWE methods based upon representative topical embeddings.
Word embeddings are fixed-length, dense and distributed word representations that are used in natural language processing (NLP) applications. There are basically two types of word embedding models ...which are non-contextual (static) models and contextual models. The former method generates a single embedding for a word regardless of its context, while the latter method produces distinct embeddings for a word based on the specific contexts in which it appears. There are plenty of works that compare contextual and non-contextual embedding models within their respective groups in different languages. However, the number of studies that compare the models in these two groups with each other is very few and there is no such study in Turkish. This process necessitates converting contextual embeddings into static embeddings. In this paper, we compare and evaluate the performance of several contextual and non-contextual models in both intrinsic and extrinsic evaluation settings for Turkish. We make a fine-grained comparison by analyzing the syntactic and semantic capabilities of the models separately. The results of the analyses provide insights about the suitability of different embedding models in different types of NLP tasks. We also build a Turkish word embedding repository comprising the embedding models used in this work, which may serve as a valuable resource for researchers and practitioners in the field of Turkish NLP. We make the word embeddings, scripts, and evaluation datasets publicly available.
•A comprehensive analysis of static word embedding models for the Turkish language.•Intrinsic and extrinsic performances of the models on several tasks are presented.•BERT with X2Static method outperformed all models in intrinsic and extrinsic tasks.•FastText excels in conjugation tasks, while word2vec shines in analogy tasks.
The Geometry of Culture Kozlowski, Austin C.; Taddy, Matt; Evans, James A.
American sociological review,
10/2019, Volume:
84, Issue:
5
Journal Article
Peer reviewed
Open access
We argue word embedding models are a useful tool for the study of culture using a historical analysis of shared understandings of social class as an empirical case. Word embeddings represent semantic ...relations between words as relationships between vectors in a highdimensional space, specifying a relational model of meaning consistent with contemporary theories of culture. Dimensions induced by word differences (rich–poor) in these spaces correspond to dimensions of cultural meaning, and the projection of words onto these dimensions reflects widely shared associations, which we validate with surveys. Analyzing text from millions of books published over 100 years, we show that the markers of class continuously shifted amidst the economic transformations of the twentieth century, yet the basic cultural dimensions of class remained remarkably stable. The notable exception is education, which became tightly linked to affluence independent of its association with cultivated taste.
Static word embeddings (SWE) and contextualized word embeddings (CWE) are the foundation of modern natural language processing. However, these embeddings suffer from spatial bias in the form of ...anisotropy, which has been demonstrated to reduce their performance. A method to alleviate the anisotropy is the “whitening” transformation. Whitening is a standard method in signal processing and other areas, however, its effect on SWE and CWE is not well understood. In this study, we conduct an experiment to elucidate the effect of whitening on SWE and CWE. The results indicate that whitening predominantly removes the word frequency bias in SWE, and biases other than the word frequency bias in CWE.
Estimating the semantic similarity between text data is one of the challenging and open research problems in the field of Natural Language Processing (NLP). The versatility of natural language makes ...it difficult to define rule-based methods for determining semantic similarity measures. To address this issue, various semantic similarity methods have been proposed over the years. This survey article traces the evolution of such methods beginning from traditional NLP techniques such as kernel-based methods to the most recent research work on transformer-based models, categorizing them based on their underlying principles as knowledge-based, corpus-based, deep neural network–based methods, and hybrid methods. Discussing the strengths and weaknesses of each method, this survey provides a comprehensive view of existing systems in place for new researchers to experiment and develop innovative ideas to address the issue of semantic similarity.
Sentiment analysis has been a hot area in the exploration field of language understanding, however, neural networks used in it are even lacking. Presently, the greater part of the work is proceeding ...on recognizing sentiments by concentrating on syntax and vocabulary. In addition, the task identified with natural language processing and for computing the exceptional and remarkable outcomes Recurrent neural networks (RNNs) and Convolutional neural networks (CNNs) have been utilized. Keeping in mind the end goal to capture the long-term dependencies CNNs, need to rely on assembling multiple layers. In this Paper for the improvement in understanding the sentiments, we constructed a joint architecture which places of RNN at first for capturing long-term dependencies with CNNs using global average pooling layer while on top a word embedding method using GloVe procured by unsupervised learning in the light of substantial twitter corpora to deal with this problem. Experimentations exhibit better execution when it is compared with the baseline model on the twitter’s corpora which tends to perform dependable results for the analysis of sentiment benchmarks by achieving 90.59% on Stanford Twitter Sentiment Corpus, 89.46% on Sentiment Strength Twitter Data and 88.72% on Health Care Reform Dataset respectively. Empirically, our work turned to be an efficient architecture with slight hyperparameter tuning which capable us to reduce the number of parameters with higher performance and not merely relying on convolutional multiple layers by constructing the RNN layer followed by convolutional layer to seizure long-term dependencies.
•Convolutional Neural Network utilizes many layers learn to extract local features.•Recurrent Neural Network abled to catch the long-term dependencies in single layer.•GloVe domain-specific word embedding capably to realize the execution of models.•It engages one layer RNN and one convolutional layer with global average pooling.•Joint architecture for the sentiment analysis on small, medium and large datasets.
This Special Issue highlights the most recent research being carried out in the NLP field to discuss relative open issues, with a particular focus on both emerging approaches for language learning, ...understanding, production, and grounding interactively or autonomously from data in cognitive and neural systems, as well as on their potential or real applications in different domains.
Question Answering (QA) systems based on Information Retrieval return precise answers to natural language questions, extracting relevant sentences from document collections. However, questions and ...sentences cannot be aligned terminologically, generating errors in the sentence retrieval. In order to augment the effectiveness in retrieving relevant sentences from documents, this paper proposes a hybrid Query Expansion (QE) approach, based on lexical resources and word embeddings, for QA systems. In detail, synonyms and hypernyms of relevant terms occurring in the question are first extracted from MultiWordNet and, then, contextualized to the document collection used in the QA system. Finally, the resulting set is ranked and filtered on the basis of wording and sense of the question, by employing a semantic similarity metric built on the top of a Word2Vec model. This latter is locally trained on an extended corpus pertaining the same topic of the documents used in the QA system. This QE approach is implemented into an existing QA system and experimentally evaluated, with respect to different possible configurations and selected baselines, for the Italian language and in the Cultural Heritage domain, assessing its effectiveness in retrieving sentences containing proper answers to questions belonging to four different categories.