Analyzing complex system with multimodal data, such as image and text, has recently received tremendous attention. Modeling the relationship between different modalities is the key to address this ...problem. Motivated by recent successful applications of deep neural learning in unimodal data, in this paper, we propose a computational deep neural architecture, bimodal deep architecture (BDA) for measuring the similarity between different modalities. Our proposed BDA architecture has three closely related consecutive components. For image and text modalities, the first component can be constructed using some popular feature extraction methods in their individual modalities. The second component has two types of stacked restricted Boltzmann machines (RBMs). Specifically, for image modality a binary-binary RBM is stacked over a Gaussian-binary RBM; for text modality a binary-binary RBM is stacked over a replicated softmax RBM. In the third component, we come up with a variant autoencoder with a predefined loss function for discriminatively learning the regularity between different modalities. We show experimentally the effectiveness of our approach to the task of classifying image tags on public available datasets.
Microblog has become a daily communication tool in recent years. Researches on microblog have drawn more and more attention. Microblogging emotional classification is a major research of user intent ...analysis based on User-Generated Content (UGC). This paper focuses on the discrimination on two emotional tendencies: positive and negative. Firstly, the system cleared the noisy elements in the microblog, then extracted the features of the microblog and finally classified the microblog using Support Vector Machine (SVM). Furthermore, we improve the algorithms of feature extraction and weight computing combining dictionary approach and rule based approach. The result of experiment shows that the method is effective.
On grey degree of compound grey number Dapeng Wang; Ruifan Li
Proceedings of 2013 IEEE International Conference on Grey systems and Intelligent Services (GSIS),
11/2013
Conference Proceeding
Compound grey number and its grey degree are of importance for research on grey number and it algorithm rules. This paper proposes definitions of compound grey number and corresponding grey degree ...based on the "field" and axiomatic definition of grey degree, formulizes the grey degree of compound grey number by giving full consideration to the variations of "field" when compound grey number is generated from simple grey numbers, thus enrich the original definition of grey degree. Furthermore, this paper investigates the linear convergence of grey degree of compound grey number. Examples show the plausibility of these definitions and linear convergence, and the validity of Accumulated Generating Operation in terms of excluding outside disturbance.
Extracting key sentences with sentiments from discourses plays an important role in sentiment analysis. Different from general discourses, Internet news has its own fashion of sentiment expression. ...In this paper, we attempt to extract key sentiment sentences from those Internet news articles. In this paper, we propose a method, called MSF, by using multiple sources features. In our method, for each sentence we first design four sources of features, including lexical sentiment, global position, word grammar indicator, and title similarity. Then, these features are linearly combined to obtain a score indicating the probability that the sentence is a key sentiment sentence. Experiments on a publicly available dataset show the effectiveness of our MSF method.
Visual grounding aims to locate a specific region in a given image guided by a natural language query. It relies on the alignment of visual information and text semantics in a fine-grained fashion. ...We propose a one-stage visual grounding model based on cross-modal feature fusion, which regards the task as a coordinate regression problem and implement an end-to-end optimization. The coordinates of bounding box are directly predicted by the fusion features, but previous fusion methods such as element-wise product, summation, and concatenation are too simple to combine the deep information within feature vectors. In order to improve the quality of the fusion features, we incorporate co-attention mechanism to deeply transform the representations from two modalities. We evaluate our grounding model on publicly available datasets, including Flickr30k Entities, RefCOCO, RefCOCO+ and RefCOCOg. Quantitative evaluation results show that co-attention mechanism plays a positive role in multi-modal feature fusion for the task of visual grounding.
A hybrid approach to identifying sentiment polarity for new words Yang Yang; Ruifan Li; Yanquan Zhou
2014 4th International Conference on Wireless Communications, Vehicular Technology, Information Theory and Aerospace & Electronic Systems (VITAE)
Conference Proceeding
Microblog is a typical form of heterogeneous information. For this information, identifying sentiment polarity of new words plays a fundamental role in sentiment analysis. In this paper, we proposed ...a hybrid approach using both statistic and syntax information to identifying the sentiment polarity of new words. We first filter the raw tweets out some noises and segment the clean data with POS tagging. Next, we collect new words by filtering rules. Then, we assign each new word with a polarity using both statistics and patterns information. We evaluate our approach on a real dataset from Sina Weibo, achieving a relatively high F-score of 0.241 compared with the baseline of 0.22.
Differential Networks for Visual Question Answering Wu, Chenfei; Liu, Jinlai; Wang, Xiaojie ...
Proceedings of the ... AAAI Conference on Artificial Intelligence,
07/2019, Letnik:
33, Številka:
1
Journal Article
Odprti dostop
The task of Visual Question Answering (VQA) has emerged in recent years for its potential applications. To address the VQA task, the model should fuse feature elements from both images and questions ...efficiently. Existing models fuse image feature element vi and question feature element qi directly, such as an element product viqi. Those solutions largely ignore the following two key points: 1) Whether vi and qi are in the same space. 2) How to reduce the observation noises in vi and qi. We argue that two differences between those two feature elements themselves, like (vi − vj) and (qi −qj), are more probably in the same space. And the difference operation would be beneficial to reduce observation noise. To achieve this, we first propose Differential Networks (DN), a novel plug-and-play module which enables differences between pair-wise feature elements. With the tool of DN, we then propose DN based Fusion (DF), a novel model for VQA task. We achieve state-of-the-art results on four publicly available datasets. Ablation studies also show the effectiveness of difference operations in DF model.
The task of image captioning aims to automatically generate descriptive sentences for a given image. Most existing works use recurrent neural network as language decoder. In this paper, we use a ...transformer structure to generate descriptive captions. When applied in the task of image captioning, the transformer network exists two problems. The first is the disappearance of the query vector information in stacking network. The second is the lacking of spatial information between objects in the decoding process. To solve these problems, we propose an improved Transformer with IoU Position encoding model, i.e., TIP. We improve the transformer from two aspects. First, we pro-pose an intra-modal attention mechanism to alleviate the problem of vanishing query vectors. Second, we propose an Intersection-over-Union (IoU) spatial position encoding method to enhance the semantic information of images. Extensive experiments on MS-COCO datasets demonstrate the effectiveness of our model.
The limit of labeled data has become the bottleneck in numerous text-related tasks. Recently, few-shot learning based on pre-trained language model has become an attractive topic. Entailment-based ...Few-shot Learning (i.e., EFL) is an effective way through transforming a text classification task into a textual entailment task, which bridges the gap between down-stream tasks and pre-trained tasks. However, the performance of the downstream task is sensitive to the manually selected templates in this type of approaches. To alleviate this problem, we improve the EFL method by applying a naïve template selection mechanism, leveraging masked language model to assess the quality of candidate templates. Moreover, we evaluate our method on FewCLUE shared tasks. Extensive experiments demonstrate the effectiveness of our proposed method.