We present a model that generates natural language descriptions of images and their regions. Our approach leverages datasets of images and their sentence descriptions to learn about the inter-modal ...correspondences between language and visual data. Our alignment model is based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural Networks (RNN) over sentences, and a structured objective that aligns the two modalities through a multimodal embedding. We then describe a Multimodal Recurrent Neural Network architecture that uses the inferred alignments to learn to generate novel descriptions of image regions. We demonstrate that our alignment model produces state of the art results in retrieval experiments on Flickr8K, Flickr30K and MSCOCO datasets. We then show that the generated descriptions outperform retrieval baselines on both full images and on a new dataset of region-level annotations. Finally, we conduct large-scale analysis of our RNN language model on the Visual Genome dataset of 4.1 million captions and highlight the differences between image and region-level caption statistics.
Advances in machine learning and contactless sensors have given rise to ambient intelligence-physical spaces that are sensitive and responsive to the presence of humans. Here we review how this ...technology could improve our understanding of the metaphorically dark, unobserved spaces of healthcare. In hospital spaces, early applications could soon enable more efficient clinical workflows and improved patient safety in intensive care units and operating rooms. In daily living spaces, ambient intelligence could prolong the independence of older individuals and improve the management of individuals with a chronic disease by understanding everyday behaviour. Similar to other technologies, transformation into clinical applications at scale must overcome challenges such as rigorous clinical validation, appropriate data privacy and model transparency. Thoughtful use of this technology would enable us to understand the complex interplay between the physical environment and health-critical human behaviours.
One-shot learning of object categories Li Fei-Fei; Fergus, R.; Perona, P.
IEEE transactions on pattern analysis and machine intelligence,
04/2006, Letnik:
28, Številka:
4
Journal Article
Recenzirano
Odprti dostop
Learning visual models of object categories notoriously requires hundreds or thousands of training examples. We show that it is possible to learn much information about a category from just one, or a ...handful, of images. The key insight is that, rather than learning from scratch, one can take advantage of knowledge coming from previously learned categories, no matter how different these categories might be. We explore a Bayesian implementation of this idea. Object categories are represented by probabilistic models. Prior knowledge is represented as a probability density function on the parameters of these models. The posterior model for an object category is obtained by updating the prior in the light of one or more observations. We test a simple implementation of our algorithm on a database of 101 diverse object categories. We compare category models learned by an implementation of our Bayesian approach to models learned from by maximum likelihood (ML) and maximum a posteriori (MAP) methods. We find that on a database of more than 100 categories, the Bayesian approach produces informative models when the number of training examples is too small for other methods to operate successfully.
Every moment counts in action recognition. A comprehensive understanding of human activity in video requires labeling every frame according to the actions occurring, placing multiple labels densely ...over a video sequence. To study this problem we extend the existing THUMOS dataset and introduce MultiTHUMOS, a new dataset of dense labels over unconstrained internet videos. Modeling multiple, dense labels benefits from temporal relations within and across classes. We define a novel variant of long short-term memory deep networks for modeling these temporal relations via multiple input and output connections. We show that this model improves action labeling accuracy and further enables deeper understanding tasks ranging from structured retrieval to action prediction.
We present a model that generates natural language descriptions of images and their regions. Our approach leverages datasets of images and their sentence descriptions to learn about the inter-modal ...correspondences between language and visual data. Our alignment model is based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural Networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding. We then describe a Multimodal Recurrent Neural Network architecture that uses the inferred alignments to learn to generate novel descriptions of image regions. We demonstrate that our alignment model produces state of the art results in retrieval experiments on Flickr8K, Flickr30K and MSCOCO datasets. We then show that the generated descriptions significantly outperform retrieval baselines on both full images and on a new dataset of region-level annotations.
Image Generation from Scene Graphs Johnson, Justin; Gupta, Agrim; Fei-Fei, Li
2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
Conference Proceeding
To truly understand the visual world our models should be able not only to recognize images but also generate them. To this end, there has been exciting recent progress on generating images from ...natural language descriptions. These methods give stunning results on limited domains such as descriptions of birds or flowers, but struggle to faithfully reproduce complex sentences with many objects and relationships. To overcome this limitation we propose a method for generating images from scene graphs, enabling explicitly reasoning about objects and their relationships. Our model uses graph convolution to process input graphs, computes a scene layout by predicting bounding boxes and segmentation masks for objects, and converts the layout to an image with a cascaded refinement network. The network is trained adversarially against a pair of discriminators to ensure realistic outputs. We validate our approach on Visual Genome and COCO-Stuff, where qualitative results, ablations, and user studies demonstrate our method's ability to generate complex images with multiple objects.
Abstract The intertwined processes of learning and evolution in complex environmental niches have resulted in a remarkable diversity of morphological forms. Moreover, many aspects of animal ...intelligence are deeply embodied in these evolved morphologies. However, the principles governing relations between environmental complexity, evolved morphology, and the learnability of intelligent control, remain elusive, because performing large-scale in silico experiments on evolution and learning is challenging. Here, we introduce Deep Evolutionary Reinforcement Learning (DERL): a computational framework which can evolve diverse agent morphologies to learn challenging locomotion and manipulation tasks in complex environments. Leveraging DERL we demonstrate several relations between environmental complexity, morphological intelligence and the learnability of control. First, environmental complexity fosters the evolution of morphological intelligence as quantified by the ability of a morphology to facilitate the learning of novel tasks. Second, we demonstrate a morphological Baldwin effect i.e., in our simulations evolution rapidly selects morphologies that learn faster, thereby enabling behaviors learned late in the lifetime of early ancestors to be expressed early in the descendants lifetime. Third, we suggest a mechanistic basis for the above relationships through the evolution of morphologies that are more physically stable and energy efficient, and can therefore facilitate learning and control.
The integration of Fe dopant and interfacial FeOOH into Ni‐MOFs Fe‐doped‐(Ni‐MOFs)/FeOOH to construct Fe−O−Ni−O−Fe bonding is demonstrated and the origin of remarkable electrocatalytic performance of ...Ni‐MOFs is elucidated. X‐ray absorption/photoelectron spectroscopy and theoretical calculation results indicate that Fe‐O−Ni−O−Fe bonding can facilitate the distorted coordinated structure of the Ni site with a short nickel–oxygen bond and low coordination number, and can promote the redistribution of Ni/Fe charge density to efficiently regulate the adsorption behavior of key intermediates with a near‐optimal d‐band center. Here the Fe‐doped‐(Ni‐MOFs)/FeOOH with interfacial Fe−O−Ni−O−Fe bonding shows superior catalytic performance for OER with a low overpotential of 210 mV at 15 mA cm−2 and excellent stability with ≈3 % attenuation after a 120 h cycle test. This study provides a novel strategy to design high‐performance Ni/Fe‐based electrocatalysts for OER in alkaline media.
Iron doping and FeOOH decorating leads to interfacial Fe−O−Ni−O−Fe bonding in Fe‐doped‐(Ni‐MOF)/FeOOH. This interfacial bonding can regulate the active Ni site to give the appropriate adsorption behavior of intermediates for the oxygen evolution reaction (OER). As a result, Fe‐doped‐(Ni‐MOF)/FeOOH shows outstanding catalytic performance with low overpotential, small Tafel slope, and high durability.
Convolutional Neural Networks (CNNs) have been established as a powerful class of models for image recognition problems. Encouraged by these results, we provide an extensive empirical evaluation of ...CNNs on large-scale video classification using a new dataset of 1 million YouTube videos belonging to 487 classes. We study multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggest a multiresolution, foveated architecture as a promising way of speeding up the training. Our best spatio-temporal networks display significant performance improvements compared to strong feature-based baselines (55.3% to 63.9%), but only a surprisingly modest improvement compared to single-frame models (59.3% to 60.9%). We further study the generalization performance of our best model by retraining the top layers on the UCF-101 Action Recognition dataset and observe significant performance improvements compared to the UCF-101 baseline model (63.3% up from 43.9%).