We tackle the problem of large scale visual place recognition, where the task is to quickly and accurately recognize the location of a given query photograph. We present the following four principal ...contributions. First, we develop a convolutional neural network (CNN) architecture that is trainable in an end-to-end manner directly for the place recognition task. The main component of this architecture, NetVLAD, is a new generalized VLAD layer, inspired by the "Vector of Locally Aggregated Descriptors" image representation commonly used in image retrieval. The layer is readily pluggable into any CNN architecture and amenable to training via backpropagation. Second, we create a new weakly supervised ranking loss, which enables end-to-end learning of the architecture's parameters from images depicting the same places over time downloaded from Google Street View Time Machine. Third, we develop an efficient training procedure which can be applied on very large-scale weakly labelled tasks. Finally, we show that the proposed architecture and training procedure significantly outperform non-learnt image representations and off-the-shelf CNN descriptors on challenging place recognition and image retrieval benchmarks.
We tackle the problem of large scale visual place recognition, where the task is to quickly and accurately recognize the location of a given query photograph. We present the following three principal ...contributions. First, we develop a convolutional neural network (CNN) architecture that is trainable in an end-to-end manner directly for the place recognition task. The main component of this architecture, NetVLAD, is a new generalized VLAD layer, inspired by the "Vector of Locally Aggregated Descriptors" image representation commonly used in image retrieval. The layer is readily pluggable into any CNN architecture and amenable to training via backpropagation. Second, we develop a training procedure, based on a new weakly supervised ranking loss, to learn parameters of the architecture in an end-to-end manner from images depicting the same places over time downloaded from Google Street View Time Machine. Finally, we show that the proposed architecture significantly outperforms non-learnt image representations and off-the-shelf CNN descriptors on two challenging place recognition benchmarks, and improves over current state of-the-art compact image representations on standard image retrieval benchmarks.
In this paper we consider the motion segmentation problem on sparse and unstructured datasets involving rigid motions, motivated by multibody structure from motion. In particular, we assume only ...two-frame correspondences as input without prior knowledge about trajectories. Inspired by the success of synchronization methods, we address this problem by introducing a two-stage approach: first, motion segmentation is addressed on image pairs independently; then, two-frame results are combined in a robust way to compute the final multi-frame segmentation. Our synthetic and real experiments demonstrate that the proposed approach is very effective in reducing the errors among two-frame results and it can cope with a large amount of mismatches. Moreover, our method can be profitably used to build a multibody structure from motion pipeline.
In this work we address the problem of finding reliable pixel-level correspondences under difficult imaging conditions. We propose an approach where a single convolutional neural network plays a dual ...role: It is simultaneously a dense feature descriptor and a feature detector. By postponing the detection to a later stage, the obtained keypoints are more stable than their traditional counterparts based on early detection of low-level structures. We show that this model can be trained using pixel correspondences extracted from readily available large-scale SfM reconstructions, without any further annotations. The proposed method obtains state-of-the-art performance on both the difficult Aachen Day-Night localization dataset and the InLoc indoor localization benchmark, as well as competitive performance on other benchmarks for image matching and 3D reconstruction.
We seek to predict the 6 degree-of-freedom (6DoF) pose of a query photograph with respect to a large indoor 3D map. The contributions of this work are three-fold. First, we develop a new large-scale ...visual localization method targeted for indoor environments. The method proceeds along three steps: (i) efficient retrieval of candidate poses that ensures scalability to large-scale environments, (ii) pose estimation using dense matching rather than local features to deal with texture less indoor scenes, and (iii) pose verification by virtual view synthesis to cope with significant changes in viewpoint, scene layout, and occluders. Second, we collect a new dataset with reference 6DoF poses for large-scale indoor localization. Query photographs are captured by mobile phones at a different time than the reference 3D map, thus presenting a realistic indoor localization scenario. Third, we demonstrate that our method significantly outperforms current state-of-the-art indoor localization approaches on this new challenging data.
Globally Optimal Hand-Eye Calibration Using Branch-and-Bound Heller, Jan; Havlena, Michal; Pajdla, Tomas
IEEE transactions on pattern analysis and machine intelligence,
2016-May-1, 2016-May, 2016-5-1, 20160501, Letnik:
38, Številka:
5
Journal Article
Recenzirano
This paper introduces a novel solution to the hand-eye calibration problem. It uses camera measurements directly and, at the same time, requires neither prior knowledge of the external camera ...calibrations nor a known calibration target. Our algorithm uses branch-and-bound approach to minimize an objective function based on the epipolar constraint. Further, it employs Linear Programming to decide the bounding step of the algorithm.Our technique is able to recover both the unknown rotation and translation simultaneously and the solution is guaranteed to be globally optimal with respect to the <inline-formula><tex-math notation="LaTeX">L_{\infty}</tex-math> <inline-graphic xlink:type="simple" xlink:href="heller-ieq1-2469299.gif"/> </inline-formula>-norm.
We propose a novel method for the multi-view reconstruction problem. Surfaces which do not have direct support in the input 3D point cloud and hence need not be photo-consistent but represent real ...parts of the scene (e.g. low-textured walls, windows, cars) are important for achieving complete reconstructions. We augmented the existing Labatut CGF 2009 method with the ability to cope with these difficult surfaces just by changing the t-edge weights in the construction of surfaces by a minimal s-t cut. Our method uses Visual-Hull to reconstruct the difficult surfaces which are not sampled densely enough by the input 3D point cloud. We demonstrate importance of these surfaces on several real-world data sets. We compare our improvement to our implementation of the Labatut CGF 2009 method and show that our method can considerably better reconstruct difficult surfaces while preserving thin structures and details in the same quality and computational time.
The aim of this work is to localize a query photograph by finding other images depicting the same place in a large geotagged image database. This is a challenging task due to changes in viewpoint, ...imaging conditions and the large size of the image database. The contribution of this work is two-fold. First, we cast the place recognition problem as a classification task and use the available geotags to train a classifier for each location in the database in a similar manner to per-exemplar SVMs in object recognition. Second, as only one or a few positive training examples are available for each location, we propose two methods to calibrate all the per-location SVM classifiers without the need for additional positive training data. The first method relies on p-values from statistical hypothesis testing and uses only the available negative training data. The second method performs an affine calibration by appropriately normalizing the learnt classifier hyperplane and does not need any additional labelled training data. We test the proposed place recognition method with the bag-of-visual-words and Fisher vector image representations suitable for large scale indexing. Experiments are performed on three datasets: 25,000 and 55,000 geotagged street view images of Pittsburgh, and the 24/7 Tokyo benchmark containing 76,000 images with varying illumination conditions. The results show improved place recognition accuracy of the learnt image representation over direct matching of raw image descriptors.
Selecting visually overlapping image pairs without any prior information is an essential task of large-scale structure from motion (SfM) pipelines. To address this problem, many state-of-the-art ...image retrieval systems adopt the idea of bag of visual words (BoVW) for computing image-pair similarity. In this paper, we present a method for improving the image pair selection using BoVW. Our method combines a conventional vector-based approach and a set-based approach. For the set similarity, we introduce a modified version of the Simpson (m-Simpson) coefficient. We show the advantage of this measure over three typical set similarity measures and demonstrate that the combination of vector similarity and the m-Simpson coefficient effectively reduces false positives and increases accuracy. To discuss the choice of vocabulary construction, we prepared both a sampled vocabulary on an evaluation dataset and a basic pre-trained vocabulary on a training dataset. In addition, we tested our method on vocabularies of different sizes. Our experimental results show that the proposed method dramatically improves precision scores especially on the sampled vocabulary and performs better than the state-of-the-art methods that use pre-trained vocabularies. We further introduce a method to determine the k value of top-k relevant searches for each image and show that it obtains higher precision at the same recall.
Central catadioptric cameras are cameras which combine lenses and mirrors to capture a very wide field of view with a central projection. In this paper we extend the classical epipolar geometry of ...perspective cameras to all central catadioptric cameras. Epipolar geometry is formulated as the geometry of corresponding rays in a three-dimensional space. Using the model of image formation of central catadioptric cameras, the constraint on corresponding image points is then derived. It is shown that the corresponding points lie on epipolar conics. In addition, the shape of the conics for all types of central catadioptric cameras is classified. Finally, the theory is verified by experiments with real central catadioptric cameras.PUBLICATION ABSTRACT