We present a unified framework for understanding 3D hand and object interactions in raw image sequences from egocentric RGB cameras. Given a single RGB image, our model jointly estimates the 3D hand ...and object poses, models their interactions, and recognizes the object and action classes with a single feed-forward pass through a neural network. We propose a single architecture that does not rely on external detection algorithms but rather is trained end-to-end on single images. We further merge and propagate information in the temporal domain to infer interactions between hand and object trajectories and recognize actions. The complete model takes as input a sequence of frames and outputs per-frame 3D hand and object pose predictions along with the estimates of object and action categories for the entire sequence. We demonstrate state-of-the-art performance of our algorithm even in comparison to the approaches that work on depth data and ground-truth annotations.
In this work we address the problem of finding reliable pixel-level correspondences under difficult imaging conditions. We propose an approach where a single convolutional neural network plays a dual ...role: It is simultaneously a dense feature descriptor and a feature detector. By postponing the detection to a later stage, the obtained keypoints are more stable than their traditional counterparts based on early detection of low-level structures. We show that this model can be trained using pixel correspondences extracted from readily available large-scale SfM reconstructions, without any further annotations. The proposed method obtains state-of-the-art performance on both the difficult Aachen Day-Night localization dataset and the InLoc indoor localization benchmark, as well as competitive performance on other benchmarks for image matching and 3D reconstruction.
Hand motion capture is a popular research field, recently gaining more attention due to the ubiquity of RGB-D sensors. However, even most recent approaches focus on the case of a single isolated ...hand. In this work, we focus on hands that interact with other hands or objects and present a framework that successfully captures motion in such interaction scenarios for both rigid and articulated objects. Our framework combines a generative model with discriminatively trained salient points to achieve a low tracking error and with collision detection and physics simulation to achieve physically plausible estimates even in case of occlusions and missing visual data. Since all components are unified in a single objective function which is almost everywhere differentiable, it can be optimized with standard optimization techniques. Our approach works for monocular RGB-D sequences as well as setups with multiple synchronized RGB cameras. For a qualitative and quantitative evaluation, we captured 29 sequences with a large variety of interactions and up to 150 degrees of freedom.
We cast the problem of motion segmentation of feature trajectories as linear manifold finding problems and propose a general framework for motion segmentation under affine projections which utilizes ...two properties of trajectory data: geometric constraint and locality. The geometric constraint states that the trajectories of the same motion lie in a low dimensional linear manifold and different motions result in different linear manifolds; locality, by which we mean in a transformed space a data and its neighbors tend to lie in the same linear manifold, provides a cue for efficient estimation of these manifolds. Our algorithm estimates a number of linear manifolds, whose dimensions are unknown beforehand, and segment the trajectories accordingly. It first transforms and normalizes the trajectories; secondly, for each trajectory it estimates a local linear manifold through local sampling; then it derives the affinity matrix based on principal subspace angles between these estimated linear manifolds; at last, spectral clustering is applied to the matrix and gives the segmentation result. Our algorithm is general without restriction on the number of linear manifolds and without prior knowledge of the dimensions of the linear manifolds. We demonstrate in our experiments that it can segment a wide range of motions including independent, articulated, rigid, non-rigid, degenerate, non-degenerate or any combination of them. In some highly challenging cases where other state-of-the-art motion segmentation algorithms may fail, our algorithm gives expected results.
Accurate representation of soft transitions between image regions is essential for high-quality image editing and compositing. Current techniques for generating such representations depend heavily on ...interaction by a skilled visual artist, as creating such accurate object selections is a tedious task. In this work, we introduce
semantic soft segments
, a set of layers that correspond to semantically meaningful regions in an image with accurate soft transitions between different objects. We approach this problem from a spectral segmentation angle and propose a graph structure that embeds texture and color features from the image as well as higher-level semantic information generated by a neural network. The soft segments are generated via eigendecomposition of the carefully constructed Laplacian matrix fully automatically. We demonstrate that otherwise complex image editing tasks can be done with little effort using semantic soft segments.
Visual localization is the task of accurate camera pose estimation in a known scene. It is a key problem in computer vision and robotics, with applications including self-driving cars, ...Structure-from-Motion, SLAM, and Mixed Reality. Traditionally, the localization problem has been tackled using 3D geometry. Recently, end-to-end approaches based on convolutional neural networks have become popular. These methods learn to directly regress the camera pose from an input image. However, they do not achieve the same level of pose accuracy as 3D structure-based methods. To understand this behavior, we develop a theoretical model for camera pose regression. We use our model to predict failure cases for pose regression techniques and verify our predictions through experiments. We furthermore use our model to show that pose regression is more closely related to pose approximation via image retrieval than to accurate pose estimation via 3D structure. A key result is that current approaches do not consistently outperform a handcrafted image retrieval baseline. This clearly shows that additional research is needed before pose regression algorithms are ready to compete with structure-based methods.
Traditional fluid simulations require large computational resources even for an average sized scene with the main bottleneck being a very small time step size, required to guarantee the stability of ...the solution. Despite a large progress in parallel computing and efficient algorithms for pressure computation in the recent years, realtime fluid simulations have been possible only under very restricted conditions. In this paper we propose a novel machine learning based approach, that formulates physics-based fluid simulation as a regression problem, estimating the acceleration of every particle for each frame. We designed a feature vector, directly modelling individual forces and constraints from the Navier-Stokes equations, giving the method strong generalization properties to reliably predict positions and velocities of particles in a large time step setting on yet unseen test videos. We used a regression forest to approximate the behaviour of particles observed in the large training set of simulations obtained using a traditional solver. Our GPU implementation led to a speed-up of one to three orders of magnitude compared to the state-of-the-art position-based fluid solver and runs in real-time for systems with up to 2 million particles.
In this paper, we propose a deep learning architecture that produces accurate dense depth for the outdoor scene from a single color image and a sparse depth. Inspired by the indoor depth completion, ...our network estimates surface normals as the intermediate representation to produce dense depth, and can be trained end-to-end. With a modified encoder-decoder structure, our network effectively fuses the dense color image and the sparse LiDAR depth. To address outdoor specific challenges, our network predicts a confidence mask to handle mixed LiDAR signals near foreground boundaries due to occlusion, and combines estimates from the color image and surface normals with learned attention maps to improve the depth accuracy especially for distant areas. Extensive experiments demonstrate that our model improves upon the state-of-the-art performance on KITTI depth completion benchmark. Ablation study shows the positive impact of each model components to the final performance, and comprehensive analysis shows that our model generalizes well to the input with higher sparsity or from indoor scenes.
Cameras are a crucial exteroceptive sensor for self-driving cars as they are low-cost and small, provide appearance information about the environment, and work in various weather conditions. They can ...be used for multiple purposes such as visual navigation and obstacle detection. We can use a surround multi-camera system to cover the full 360-degree field-of-view around the car. In this way, we avoid blind spots which can otherwise lead to accidents. To minimize the number of cameras needed for surround perception, we utilize fisheye cameras. Consequently, standard vision pipelines for 3D mapping, visual localization, obstacle detection, etc. need to be adapted to take full advantage of the availability of multiple cameras rather than treat each camera individually. In addition, processing of fisheye images has to be supported. In this paper, we describe the camera calibration and subsequent processing pipeline for multi-fisheye-camera systems developed as part of the V-Charge project. This project seeks to enable automated valet parking for self-driving cars. Our pipeline is able to precisely calibrate multi-camera systems, build sparse 3D maps for visual navigation, visually localize the car with respect to these maps, generate accurate dense maps, as well as detect obstacles based on real-time depth map extraction.
•3D visual perception with a multi-camera system•Fisheye cameras for autonomous driving tasks•Pipeline overview for calibration, mapping, localization and obstacle detection•Novel algorithms that directly work on fisheye images