Visual7W: Grounded Question Answering in Images Yuke Zhu; Groth, Oliver; Bernstein, Michael ...
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
2016-June
Conference Proceeding
Odprti dostop
We have seen great progress in basic perceptual tasks such as object recognition and detection. However, AI models still fail to match humans in high-level vision tasks due to the lack of capacities ...for deeper reasoning. Recently the new task of visual question answering (QA) has been proposed to evaluate a model's capacity for deep image understanding. Previous works have established a loose, global association between QA sentences and images. However, many questions and answers, in practice, relate to local regions in the images. We establish a semantic link between textual descriptions and image regions by object-level grounding. It enables a new type of QA with visual answers, in addition to textual answers used in previous work. We study the visual QA tasks in a grounded setting with a large collection of 7W multiple-choice QA pairs. Furthermore, we evaluate human performance and several baseline models on the QA tasks. Finally, we propose a novel LSTM model with spatial attention to tackle the 7W QA tasks.
Understanding a visual scene goes beyond recognizing individual objects in isolation. Relationships between objects also constitute rich semantic information about the scene. In this work, we ...explicitly model the objects and their relationships using scene graphs, a visually-grounded graphical structure of an image. We propose a novel end-to-end model that generates such structured scene representation from an input image. Our key insight is that the graph generation problem can be formulated as message passing between the primal node graph and its dual edge graph. Our joint inference model can take advantage of contextual cues to make better predictions on objects and their relationships. The experiments show that our model significantly outperforms previous methods on the Visual Genome dataset as well as support relation inference in NYU Depth V2 dataset.
Four high-entropy perovskite (HEP) RETa
3
O
9
samples were fabricated via a spark plasma sintering (SPS) method, and the corresponding thermophysical properties and underlying mechanisms were ...investigated for environmental/thermal barrier coating (E/TBC) applications. The prepared samples maintained low thermal conductivity (1.50 W·m
−1
·K
−1
), high hardness (10 GPa), and an appropriate Young’s modulus (180 GPa), while the fracture toughness increased to 2.5 MPa·m
1/2
. Nanoindentation results showed the HEP ceramics had excellent mechanical properties and good component homogeneity. We analysed the influence of different parameters (the disorder parameters of the electronegativity, ionic radius, and atomic mass, as well as the tolerance factor) of A-site atoms on the thermal conductivity. Enhanced thermal expansion coefficients, combined with a high melting point and extraordinary phase stability, expanded the applications of the HEP RETa
3
O
9
. The results of this study had motivated a follow-up study on tantalate high-entropy ceramics with desirable properties.
Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that ...involve not just recognizing, but reasoning about our visual world. However, models used to tackle the rich content in images for cognitive tasks are still being trained using the same datasets designed for perceptual tasks. To achieve success at cognitive tasks, models need to understand the interactions and relationships between objects in an image. When asked “What vehicle is the person riding?”, computers will need to identify the objects in an image as well as the relationships
riding(man, carriage)
and
pulling(horse, carriage)
to answer correctly that “the person is riding a horse-drawn carriage.” In this paper, we present the Visual Genome dataset to enable the modeling of such relationships. We collect dense annotations of objects, attributes, and relationships within each image to learn these models. Specifically, our dataset contains over 108K images where each image has an average of
35
objects,
26
attributes, and
21
pairwise relationships between objects. We canonicalize the objects, attributes, relationships, and noun phrases in region descriptions and questions answer pairs to WordNet synsets. Together, these annotations represent the densest and largest dataset of image descriptions, objects, attributes, relationships, and question answer pairs.
Contact-rich manipulation tasks in unstructured environments often require both haptic and visual feedback. It is nontrivial to manually design a robot controller that combines these modalities, ...which have very different characteristics. While deep reinforcement learning has shown success in learning control policies for high-dimensional inputs, these algorithms are generally intractable to train directly on real robots due to sample complexity. In this article, we use self-supervision to learn a compact and multimodal representation of our sensory inputs, which can then be used to improve the sample efficiency of our policy learning. Evaluating our method on a peg insertion task, we show that it generalizes over varying geometries, configurations, and clearances, while being robust to external perturbations. We also systematically study different self-supervised learning objectives and representation learning architectures. Results are presented in simulation and on a physical robot.
Tool manipulation is vital for facilitating robots to complete challenging task goals. It requires reasoning about the desired effect of the task and, thus, properly grasping and manipulating the ...tool to achieve the task. Most work in robotics has focused on task-agnostic grasping, which optimizes for only grasp robustness without considering the subsequent manipulation tasks. In this article, we propose the Task-Oriented Grasping Network (TOG-Net) to jointly optimize both task-oriented grasping of a tool and the manipulation policy for that tool. The training process of the model is based on large-scale simulated self-supervision with procedurally generated tool objects. We perform both simulated and real-world experiments on two tool-based manipulation tasks: sweeping and hammering. Our model achieves overall 71.1% task success rate for sweeping and 80.0% task success rate for hammering.
The application of thermoelectric technology is hindered by low efficiencies and high costs, demonstrating a strong demand for high-performance thermoelectric materials composed of low-cost and ...earth-abundant elements. PbS-based materials have attracted much attention for thermoelectric power generation due to their low-cost and earth-abundant features. However, the high lattice thermal conductivities and low electron mobilities of these materials limit their thermoelectric performance. Here, we show that we can largely reduce the lattice thermal conductivity of an n-type PbS-based material to 0.4 W m
−1
K
−1
through introducing zigzag nanoprecipitates with a uniform width of around 1 nm. The electron mobility was also successfully improved by reducing the effective mass through Se alloying. Finally, an extraordinary figure of merit of 1.7 at 900 K was realized in an n-type Pb
0.93
Sb
0.05
S
0.5
Se
0.5
sample. A thermoelectric power generation module was fabricated with this n-type PbS material and our home-made high-performance p-type PbTe. It demonstrated a high conversion efficiency of 8.0% at a temperature difference of 565 K. Furthermore, a segmented module consisting of n-/p-Bi
2
Te
3
and n-PbS/p-PbTe was fabricated, which exhibited a high conversion efficiency of 11.2% at a temperature difference of 585 K. This efficiency is the same as those of reported PbTe-based modules, and it was realized at a much lower cost. As a result, low-cost high-performance n-type PbS-based materials as a promising PbTe alternative will promote the extensive commercial application of thermoelectric power generation.
A high conversion efficiency of 11.2% was realized in a low-cost PbS-based segmented thermoelectric module.
A crucial capability of real-world intelligent agents is their ability to plan a sequence of actions to achieve their goals in the visual world. In this work, we address the problem of visual ...semantic planning: the task of predicting a sequence of actions from visual observations that transform a dynamic environment from an initial state to a goal state. Doing so entails knowledge about objects and their affordances, as well as actions and their preconditions and effects. We propose learning these through interacting with a visual and dynamic environment. Our proposed solution involves bootstrapping reinforcement learning with imitation learning. To ensure cross task generalization, we develop a deep predictive model based on successor representations. Our experimental results show near optimal results across a wide range of tasks in the challenging THOR environment.
Our goal is to generate a policy to complete an unseen task given just a single video demonstration of the task in a given domain. We hypothesize that to successfully generalize to unseen complex ...tasks from a single video demonstration, it is necessary to explicitly incorporate the compositional structure of the tasks into the model. To this end, we propose Neural Task Graph (NTG) Networks, which use conjugate task graph as the intermediate representation to modularize both the video demonstration and the derived policy. We empirically show NTG achieves inter-task generalization on two complex tasks: Block Stacking in BulletPhysics and Object Collection in AI2-THOR. NTG improves data efficiency with visual input as well as achieve strong generalization without the need for dense hierarchical supervision. We further show that similar performance trends hold when applied to real-world data. We show that NTG can effectively predict task structure on the JIGSAWS surgical dataset and generalize to unseen tasks.
Manipulating volumetric deformable objects in the real world, like plush toys and pizza dough, brings substantial challenges due to infinite shape variations, non-rigid motions, and partial ...observability. We introduce ACID, an action-conditional visual dynamics model for volumetric deformable objects based on structured implicit neural representations. ACID integrates two new techniques: implicit representations for action-conditional dynamics and geodesics-based contrastive learning. To represent deformable dynamics from partial RGB-D observations, we learn implicit representations of occupancy and flow-based forward dynamics. To accurately identify state change under large non-rigid deformations, we learn a correspondence embedding field through a novel geodesics-based contrastive loss. To evaluate our approach, we develop a simulation framework for manipulating complex deformable shapes in realistic scenes and a benchmark containing over 17,000 action trajectories with six types of plush toys and 78 variants. Our model achieves the best performance in geometry, correspondence, and dynamics predictions over existing approaches. The ACID dynamics models are successfully employed for goal-conditioned deformable manipulation tasks, resulting in a 30% increase in task success rate over the strongest baseline. Furthermore, we apply the simulation-trained ACID model directly to real-world objects and show success in manipulating them into target configurations. https://b0ku1.github.io/acid/