Computational saliency models for still images have gained significant popularity in recent years. Saliency prediction from videos, on the other hand, has received relatively little interest from the ...community. Motivated by this, in this paper, we study the use of deep learning for dynamic saliency prediction and propose the so-called spatio-temporal saliency networks. The key to our models is the architecture of two-stream networks where we investigate different fusion mechanisms to integrate spatial and temporal information. We evaluate our models on the dynamic images and eye movements and University of Central Florida-Sports datasets and present highly competitive results against the existing state-of-the-art models. We also carry out some experiments on a number of still images from the MIT300 dataset by exploiting the optical flow maps predicted from these images. Our results show that considering inherent motion information in this way can be helpful for static saliency estimation.
Recent years have witnessed the emergence of new image smoothing techniques which have provided new insights and raised new questions about the nature of this well-studied problem. Specifically, ...these models separate a given image into its structure and texture layers by utilizing non-gradient based definitions for edges or special measures that distinguish edges from oscillations. In this study, we propose an alternative yet simple image smoothing approach which depends on covariance matrices of simple image features, aka the region covariances. The use of second order statistics as a patch descriptor allows us to implicitly capture local structure and texture information and makes our approach particularly effective for structure extraction from texture. Our experimental results have shown that the proposed approach leads to better image decompositions as compared to the state-of-the-art methods and preserves prominent edges and shading well. Moreover, we also demonstrate the applicability of our approach on some image editing and manipulation tasks such as image abstraction, texture and detail enhancement, image composition, inverse halftoning and seam carving.
Automatic description generation from natural images is a challenging problem that has recently received a large amount of interest from the computer vision and natural language processing ...communities. In this survey, we classify the existing approaches based on how they conceptualize this problem, viz., models that cast description as either generation problem or as a retrieval problem over a visual or multimodal representational space. We provide a detailed review of existing models, highlighting their advantages and disadvantages. Moreover, we give an overview of the benchmark image datasets and the evaluation measures that have been developed to assess the quality of machine-generated image descriptions. Finally we extrapolate future directions in the area of automatic image description generation.
The immense amount of videos being uploaded to video sharing platforms makes it impossible for a person to watch all the videos understand what happens in them. Hence, machine learning techniques are ...now deployed to index videos by recognizing key objects, actions and scenes or places. Summarization is another alternative as it offers to extract only important parts while covering the gist of the video content. Ideally, the user may prefer to analyze a certain action or scene by searching a query term within the video. Current summarization methods generally do not take queries into account or require exhaustive data labeling. In this work, we present a weakly supervised
query-focused
video summarization method. Our proposed approach makes use of semantic attributes as an indicator of query relevance and semantic attention maps to locate related regions in the frames and utilizes both within a submodular maximization framework. We conducted experiments on the recently introduced RAD dataset and obtained highly competitive results. Moreover, to better evaluate the performance of our approach on longer videos, we collected a new dataset, which consists of 10 videos from YouTube and annotated with shot-level multiple attributes. Our dataset enables much diverse set of queries that can be used to summarize a video from different perspectives with more degrees of freedom.
Reconstructing high dynamic range (HDR) images of a complex scene involving moving objects and dynamic backgrounds is prone to artifacts. A large number of methods have been proposed that attempt to ...alleviate these artifacts, known as HDR deghosting algorithms. Currently, the quality of these algorithms are judged by subjective evaluations, which are tedious to conduct and get quickly outdated as new algorithms are proposed on a rapid basis. In this paper, we propose an objective metric which aims to simplify this process. Our metric takes a stack of input exposures and the deghosting result and produces a set of artifact maps for different types of artifacts. These artifact maps can be combined to yield a single quality score. We performed a subjective experiment involving 52 subjects and 16 different scenes to validate the agreement of our quality scores with subjective judgements and observed a concordance of almost 80%. Our metric also enables a novel application that we call as hybrid deghosting, in which the output of different deghosting algorithms are combined to obtain a superior deghosting result.
In daily life, humans demonstrate an amazing ability to remember images they see on magazines, commercials, TV, web pages, etc. but automatic prediction of intrinsic memorability of images using ...computer vision and machine learning techniques has only been investigated very recently. Our goal in this article is to explore the role of visual attention and image semantics in understanding image memorability. In particular, we present an attention-driven spatial pooling strategy and show that considering image features from the salient parts of images improves the results of the previous models. We also investigate different semantic properties of images by carrying out an analysis of a diverse set of recently proposed semantic features which encode meta-level object categories, scene attributes, and invoked feelings. We show that these features which are automatically extracted from images provide memorability predictions as nearly accurate as those derived from human annotations. Moreover, our combined model yields results superior to those of state-of-the art fully automatic models.
•We examine the role of visual attention and image semantics in understanding image memorability.•We propose an attention-driven spatial pooling strategy for image memorability.•Considering image features from the salient parts of images improves the results of the previous models.•We also investigate different semantic properties of images.•Combining attention-driven pooling with semantic features yields state-of-the-art results.
Automatic generation of video descriptions in natural language, also called
video captioning
, aims to understand the visual content of the video and produce a natural language sentence depicting the ...objects and actions in the scene. This challenging integrated vision and language problem, however, has been predominantly addressed for English. The lack of data and the linguistic properties of other languages limit the success of existing approaches for such languages. In this paper we target Turkish, a morphologically rich and agglutinative language that has very different properties compared to English. To do so, we create the first large-scale video captioning dataset for this language by carefully translating the English descriptions of the videos in the MSVD (Microsoft Research Video Description Corpus) dataset into Turkish. In addition to enabling research in video captioning in Turkish, the parallel English–Turkish descriptions also enable the study of the role of video context in (multimodal) machine translation. In our experiments, we build models for both video captioning and multimodal machine translation and investigate the effect of different word segmentation approaches and different neural architectures to better address the properties of Turkish. We hope that the MSVD-Turkish dataset and the results reported in this work will lead to better video captioning and multimodal machine translation models for Turkish and other morphology rich and agglutinative languages.
In the past few years, automatically generating descriptions for images has attracted a lot of attention in computer vision and natural language processing research. Among the existing approaches, ...data-driven methods have been proven to be highly effective. These methods compare the given image against a large set of training images to determine a set of relevant images, then generate a description using the associated captions. In this study, the authors propose to integrate an object-based semantic image representation into a deep features-based retrieval framework to select the relevant images. Moreover, they present a novel phrase selection paradigm and a sentence generation model which depends on a joint analysis of salient regions in the input and retrieved images within a clustering framework. The authors demonstrate the effectiveness of their proposed approach on Flickr8K and Flickr30K benchmark datasets and show that their model gives highly competitive results compared with the state-of-the-art models.
To detect visually salient elements of complex natural scenes, computational bottom-up saliency models commonly examine several feature channels such as color and orientation in parallel. They ...compute a separate feature map for each channel and then linearly combine these maps to produce a master saliency map. However, only a few studies have investigated how different feature dimensions contribute to the overall visual saliency. We address this integration issue and propose to use covariance matrices of simple image features (known as region covariance descriptors in the computer vision community; Tuzel, Porikli, & Meer, 2006) as meta-features for saliency estimation. As low-dimensional representations of image patches, region covariances capture local image structures better than standard linear filters, but more importantly, they naturally provide nonlinear integration of different features by modeling their correlations. We also show that first-order statistics of features could be easily incorporated to the proposed approach to improve the performance. Our experimental evaluation on several benchmark data sets demonstrate that the proposed approach outperforms the state-of-art models on various tasks including prediction of human eye fixations, salient object detection, and image-retargeting.
Capturing images under extremely low-light conditions poses significant challenges for the standard camera pipeline. Images become too dark and too noisy, which makes traditional enhancement ...techniques almost impossible to apply. Recently, learning-based approaches have shown very promising results for this task since they have substantially more expressive capabilities to allow for improved quality. Motivated by these studies, in this paper, we aim to leverage burst photography to boost the performance and obtain much sharper and more accurate RGB images from extremely dark raw images. The backbone of our proposed framework is a novel coarse-to-fine network architecture that generates high-quality outputs progressively. The coarse network predicts a low-resolution, denoised raw image, which is then fed to the fine network to recover fine-scale details and realistic textures. To further reduce the noise level and improve the color accuracy, we extend this network to a permutation invariant structure so that it takes a burst of low-light images as input and merges information from multiple images at the feature-level. Our experiments demonstrate that our approach leads to perceptually more pleasing results than the state-of-the-art methods by producing more detailed and considerably higher quality images.