Books are a rich source of both fine-grained information, how a character, an object or a scene looks like, as well as high-level semantics, what someone is thinking, feeling and how these states ...evolve through a story. This paper aims to align books to their movie releases in order to provide rich descriptive explanations for visual content that go semantically far beyond the captions available in the current datasets. To align movies and books we propose a neural sentence embedding that is trained in an unsupervised way from a large corpus of books, as well as a video-text neural embedding for computing similarities between movie clips and sentences in the book. We propose a context-aware CNN to combine information from multiple sources. We demonstrate good quantitative performance for movie/book alignment and show several qualitative examples that showcase the diversity of tasks our model can be used for.
Display omitted
•Framework allows for direct comparison of vessel segmentation algorithms for thoracic CT.•Annotated reference dataset available online for evaluation of algorithms.•Proposed a ...quantitative scoring system for nine evaluation categories.•Most vesselness-based methods correctly distinguish vessels from lung parenchyma.•No method is capable of distinguishing vessels from all other dense structures in the lungs.
The VESSEL12 (VESsel SEgmentation in the Lung) challenge objectively compares the performance of different algorithms to identify vessels in thoracic computed tomography (CT) scans. Vessel segmentation is fundamental in computer aided processing of data generated by 3D imaging modalities. As manual vessel segmentation is prohibitively time consuming, any real world application requires some form of automation. Several approaches exist for automated vessel segmentation, but judging their relative merits is difficult due to a lack of standardized evaluation. We present an annotated reference dataset containing 20 CT scans and propose nine categories to perform a comprehensive evaluation of vessel segmentation algorithms from both academia and industry. Twenty algorithms participated in the VESSEL12 challenge, held at International Symposium on Biomedical Imaging (ISBI) 2012. All results have been published at the VESSEL12 website http://vessel12.grand-challenge.org. The challenge remains ongoing and open to new participants. Our three contributions are: (1) an annotated reference dataset available online for evaluation of new algorithms; (2) a quantitative scoring system for objective comparison of algorithms; and (3) performance analysis of the strengths and weaknesses of the various vessel segmentation methods in the presence of various lung diseases.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UL, UM, UPCLJ, UPUK
Semi-structured documents are a common type of data containing free text in natural language (unstructured data) as well as additional information about the document, or meta-data, typically ...following a schema or controlled vocabulary (structured data). Simultaneous analysis of unstructured and structured data enables the discovery of hidden relationships that cannot be identified from either of these sources when analyzed independently of each other. In this work, we present a visual text analytics tool for semi-structured documents (ViTA-SSD), that aims to support the user in the exploration and finding of insightful patterns in a visual and interactive manner in a semi-structured collection of documents. It achieves this goal by presenting to the user a set of coordinated visualizations that allows the linking of the metadata with interactively generated clusters of documents in such a way that relevant patterns can be easily spotted. The system contains two novel approaches in its back end: a feature-learning method to learn a compact representation of the corpus and a fast-clustering approach that has been redesigned to allow user supervision. These novel contributions make it possible for the user to interact with a large and dynamic document collection and to perform several text analytical tasks more efficiently. Finally, we present two use cases that illustrate the suitability of the system for in-depth interactive exploration of semi-structured document collections, two user studies, and results of several evaluations of our text-mining components.
We present a new technique for learning visual-semantic embeddings for cross-modal retrieval. Inspired by hard negative mining, the use of hard negatives in structured prediction, and ranking loss ...functions, we introduce a simple change to common loss functions used for multi-modal embeddings. That, combined with fine-tuning and use of augmented data, yields significant gains in retrieval performance. We showcase our approach, VSE++, on MS-COCO and Flickr30K datasets, using ablation studies and comparisons with existing methods. On MS-COCO our approach outperforms state-of-the-art methods by 8.8% in caption retrieval and 11.3% in image retrieval (at R@1).
Layer Normalization Lei, Jimmy; Jamie Ryan Kiros; Hinton, Geoffrey E
arXiv (Cornell University),
07/2016
Paper, Journal Article
Open access
Training state-of-the-art, deep neural networks is computationally expensive. One way to reduce the training time is to normalize the activities of the neurons. A recently introduced technique called ...batch normalization uses the distribution of the summed input to a neuron over a mini-batch of training cases to compute a mean and variance which are then used to normalize the summed input to that neuron on each training case. This significantly reduces the training time in feed-forward neural networks. However, the effect of batch normalization is dependent on the mini-batch size and it is not obvious how to apply it to recurrent neural networks. In this paper, we transpose batch normalization into layer normalization by computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case. Like batch normalization, we also give each neuron its own adaptive bias and gain which are applied after the normalization but before the non-linearity. Unlike batch normalization, layer normalization performs exactly the same computation at training and test times. It is also straightforward to apply to recurrent neural networks by computing the normalization statistics separately at each time step. Layer normalization is very effective at stabilizing the hidden state dynamics in recurrent networks. Empirically, we show that layer normalization can substantially reduce the training time compared with previously published techniques.
In this thesis we introduce conditional neural language models based on log-bilinear and recurrent neural networks with applications to multimodal learning and natural language understanding. We ...first introduce a LSTM encoder for learning visual-semantic embeddings for ranking the relevance of text to images in a joint embedding space. Next we introduce three log-bilinear models for generating image descriptions that integrate both additive and multiplicative interactions. Beyond image conditioning, we describe a multiplicative conditional neural language model for learning distributed representations of attributes and meta data. Our model allows for contextual word relatedness comparisons through decompositions of a word embedding tensor. Finally we show how we can abstract the skip-gram model for learning word representations to a conditional recurrent neural language model for unsupervised learning of sentence representations. We introduce a family of models called contextual encoder-decoders and demonstrate how our models can be used to induce generic sentence representations as well as unaligned generation of short stories conditioned on images. This thesis closes by highlighting several open areas of future work.
Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through ...curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with gradient and curvature mini-batches independent of the dataset size. We modify Martens' HF for these settings and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. Stochastic Hessian-free optimization gives an intermediary between SGD and HF that achieves competitive performance on both classification and deep autoencoder experiments.
In this article we describe the approach we applied for the JRS 2012 Data Mining Competition. The task of the competition was the multi-labelled classification of biomedical documents. Our method is ...motivated by recent work in the machine learning and computer vision communities that highlights the usefulness of feature learning for classification tasks. Our approach uses orthogonal matching persuit to learn a dictionary from PCA-transformed features. Binary relevance with logistic regression is applied to the encoded representations, leading to a fifth place performance in the competition. In order to show the suitability of our approach outside the competition task we also report a state-of-the-art classification performance on the multi-label ASRS dataset.
We propose a soft attention based model for the task of action recognition in videos. We use multi-layered Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) units which are deep ...both spatially and temporally. Our model learns to focus selectively on parts of the video frames and classifies videos after taking a few glimpses. The model essentially learns which parts in the frames are relevant for the task at hand and attaches higher importance to them. We evaluate the model on UCF-11 (YouTube Action), HMDB-51 and Hollywood2 datasets and analyze how the model focuses its attention depending on the scene and the action being performed.