The success of monocular depth estimation relies on large and diverse training sets. Due to the challenges associated with acquiring dense ground-truth depth across different environments at scale, a ...number of datasets with distinct characteristics and biases have emerged. We develop tools that enable mixing multiple datasets during training, even if their annotations are incompatible. In particular, we propose a robust training objective that is invariant to changes in depth range and scale, advocate the use of principled multi-objective learning to combine data from different sources, and highlight the importance of pretraining encoders on auxiliary tasks. Armed with these tools, we experiment with five diverse training datasets, including a new, massive data source: 3D films. To demonstrate the generalization power of our approach we use zero-shot cross-dataset transfer , i.e. we evaluate on datasets that were not seen during training. The experiments confirm that mixing data from complementary sources greatly improves monocular depth estimation. Our approach clearly outperforms competing methods across diverse datasets, setting a new state of the art for monocular depth estimation.
Learning Aerial Image Segmentation From Online Maps Kaiser, Pascal; Wegner, Jan Dirk; Lucchi, Aurelien ...
IEEE transactions on geoscience and remote sensing,
2017-Nov., 2017-11-00, 20171101, Letnik:
55, Številka:
11
Journal Article
Recenzirano
Odprti dostop
This paper deals with semantic segmentation of high-resolution (aerial) images where a semantic class label is assigned to each pixel via supervised classification as a basis for automatic map ...generation. Recently, deep convolutional neural networks (CNNs) have shown impressive performance and have quickly become the de-facto standard for semantic segmentation, with the added benefit that task-specific feature design is no longer necessary. However, a major downside of deep learning methods is that they are extremely data hungry, thus aggravating the perennial bottleneck of supervised classification, to obtain enough annotated training data. On the other hand, it has been observed that they are rather robust against noise in the training labels. This opens up the intriguing possibility to avoid annotating huge amounts of training data, and instead train the classifier from existing legacy data or crowd-sourced maps that can exhibit high levels of noise. The question addressed in this paper is: can training with large-scale publicly available labels replace a substantial part of the manual labeling effort and still achieve sufficient performance? Such data will inevitably contain a significant portion of errors, but in return virtually unlimited quantities of it are available in larger parts of the world. We adapt a state-of-the-art CNN architecture for semantic segmentation of buildings and roads in aerial images, and compare its performance when using different training data sets, ranging from manually labeled pixel-accurate ground truth of the same city to automatic training data derived from OpenStreetMap data from distant locations. We report our results that indicate that satisfying performance can be obtained with significantly less manual annotation effort, by exploiting noisy large-scale training data.
We propose a method to automatically register two point clouds acquired with a terrestrial laser scanner without placing any markers in the scene. What makes this task challenging are the strongly ...varying point densities caused by the line-of-sight measurement principle, and the huge amount of data. The first property leads to low point densities in potential overlap areas with scans taken from different viewpoints while the latter calls for highly efficient methods in terms of runtime and memory requirements.
A crucial yet largely unsolved step is the initial coarse alignment of two scans without any simplifying assumptions, that is, point clouds are given in arbitrary local coordinates and no knowledge about their relative orientation is available. Once coarse alignment has been solved, scans can easily be fine-registered with standard methods like least-squares surface or Iterative Closest Point matching. In order to drastically thin out the original point clouds while retaining characteristic features, we resort to extracting 3D keypoints. Such clouds of keypoints, which can be viewed as a sparse but nevertheless discriminative representation of the original scans, are then used as input to a very efficient matching method originally developed in computer graphics, called 4-Points Congruent Sets (4PCS) algorithm. We adapt the 4PCS matching approach to better suit the characteristics of laser scans.
The resulting Keypoint-based 4-Points Congruent Sets (K-4PCS) method is extensively evaluated on challenging indoor and outdoor scans. Beyond the evaluation on real terrestrial laser scans, we also perform experiments with simulated indoor scenes, paying particular attention to the sensitivity of the approach with respect to highly symmetric scenes.
Standardized benchmarks have been crucial in pushing the performance of computer vision algorithms, especially since the advent of deep learning. Although leaderboards should not be over-claimed, ...they often provide the most objective measure of performance and are therefore important guides for research. We present
MOTChallenge
, a benchmark for single-camera Multiple Object Tracking (MOT) launched in late 2014, to collect existing and new data and create a framework for the standardized evaluation of multiple object tracking methods. The benchmark is focused on multiple people tracking, since pedestrians are by far the most studied object in the tracking community, with applications ranging from robot navigation to self-driving cars. This paper collects the first three releases of the benchmark: (i)
MOT15
, along with numerous state-of-the-art results that were submitted in the last years, (ii)
MOT16
, which contains new challenging videos, and (iii)
MOT17
, that extends
MOT16
sequences with more precise labels and evaluates tracking performance on three different object detectors. The second and third release not only offers a significant increase in the number of labeled boxes, but also provide labels for multiple object classes beside pedestrians, as well as the level of visibility for every single object of interest. We finally provide a categorization of state-of-the-art trackers and a broad error analysis. This will help newcomers understand the related work and research trends in the MOT community, and hopefully shed some light into potential future research directions.
Land use (LU) and land cover (LC) are two complementary pieces of cartographic information used for urban planning and environmental monitoring. In the context of New Caledonia, a biodiversity ...hotspot, the availability of up-to-date LULC maps is essential to monitor the impact of extreme events such as cyclones and human activities on the environment. With the democratization of satellite data and the development of high-performance deep learning techniques, it is possible to create these data automatically. This work aims at determining the best current deep learning configuration (pixel-wise vs. semantic labelling architectures, data augmentation, image prepossessing, …), to perform LULC mapping in a complex, subtropical environment. For this purpose, a specific data set based on SPOT6 satellite data was created and made available for the scientific community as an LULC benchmark in a tropical, complex environment using five representative areas of New Caledonia labelled by a human operator: four used as training sets, and the fifth as a test set. Several architectures were trained and the resulting classification was compared with a state-of-the-art machine learning technique: XGboost. We also assessed the relevance of popular neo-channels derived from the raw observations in the context of deep learning. The deep learning approach showed comparable results to XGboost for LC detection and over-performed it on the LU detection task (61.45% vs. 51.56% of overall accuracy). Finally, adding LC classification output of the dedicated deep learning architecture to the raw channels input significantly improved the overall accuracy of the deep learning LU classification task (63.61% of overall accuracy). All the data used in this study are available on line for the remote sensing community and for assessing other LULC detection techniques.
In this paper we present a framework for the automatic registration of multiple terrestrial laser scans. The proposed method can handle arbitrary point clouds with reasonable pairwise overlap, ...without knowledge about their initial orientation and without the need for artificial markers or other specific objects. The framework is divided into a coarse and a fine registration part, which each start with pairwise registration and then enforce consistent global alignment across all scans. While we put forward a complete, functional registration system, the novel contribution of the paper lies in the coarse global alignment step. Merging multiple scans into a consistent network creates loops along which the relative transformations must add up. We pose the task of finding a global alignment as picking the best candidates from a set of putative pairwise registrations, such that they satisfy the loop constraints. This yields a discrete optimization problem that can be solved efficiently with modern combinatorial methods. Having found a coarse global alignment in this way, the framework proceeds by pairwise refinement with standard ICP, followed by global refinement to evenly spread the residual errors.
The framework was tested on six challenging, real-world datasets. The discrete global alignment step effectively detects, removes and corrects failures of the pairwise registration procedure, finally producing a globally consistent coarse scan network which can be used as initial guess for the highly non-convex refinement. Our overall system reaches success rates close to 100% at acceptable runtimes <1h, even in challenging conditions such as scanning in the forest.
We address the problem of vision-based navigation in busy inner-city locations, using a stereo rig mounted on a mobile platform. In this scenario semantic information becomes important: rather than ...modeling moving objects as arbitrary obstacles, they should be categorized and tracked in order to predict their future behavior. To this end, we combine classical geometric world mapping with object category detection and tracking. Object-category-specific detectors serve to find instances of the most important object classes (in our case pedestrians and cars). Based on these detections, multi-object tracking recovers the objects’ trajectories, thereby making it possible to predict their future locations, and to employ dynamic path planning. The approach is evaluated on challenging, realistic video sequences recorded at busy inner-city locations.
Hyperspectral sensors capture a portion of the visible and near-infrared spectrum with many narrow spectral bands. This makes it possible to better discriminate objects based on their reflectance ...spectra and to derive more detailed object properties. For technical reasons, the high spectral resolution comes at the cost of lower spatial resolution. To mitigate that problem, one may combine such images with conventional multispectral images of higher spatial, but lower spectral resolution. The process of fusing the two types of imagery into a product with both high spatial and spectral resolution is called hyperspectral super-resolution. We propose a method that performs hyperspectral super-resolution by jointly unmixing the two input images into pure reflectance spectra of the observed materials, along with the associated mixing coefficients. Joint super-resolution and unmixing is solved by a coupled matrix factorization, taking into account several useful physical constraints. The formulation also includes adaptive spatial regularization to exploit local geometric information from the multispectral image. Moreover, we estimate the relative spatial and spectral responses of the two sensors from the data. That information is required for the super-resolution, but often at most approximately known for real-world images. In experiments with five public datasets, we show that the proposed approach delivers up to 15% improved hyperspectral super-resolution.
Mapping agricultural crops is an important application of remote sensing. However, in many cases it is based either on hyperspectral imagery or on multitemporal coverage, both of which are difficult ...to scale up to large-scale deployment at high spatial resolution. In the present paper, we evaluate the possibility of crop classification based on single images from very high-resolution (VHR) satellite sensors. The main objective of this work is to expose performance difference between state-of-the-art parcel-based smoothing and purely data-driven conditional random field (CRF) smoothing, which is yet unknown. To fulfill this objective, we perform extensive tests with four different classification methods (Support Vector Machines, Random Forest, Gaussian Mixtures, and Maximum Likelihood) to compute the pixel-wise data term; and we also test two different definitions of the pairwise smoothness term. We have performed a detailed evaluation on different multispectral VHR images (Ikonos, QuickBird, Kompsat-2). The main finding of this study is that pairwise CRF smoothing comes close to the state-of-the-art parcel-based method that requires parcel boundaries (average difference alomst equal to 2.5%). Our results indicate that a single multispectral (R, G, B, NIR) image is enough to reach satisfactory classification accuracy for six crop classes (corn, pasture, rice, sugar beet, wheat, and tomato) in Mediterranean climate. Overall, it appears that crop mapping using only one-shot VHR imagery taken at the right time may be a viable alternative, especially since high-resolution multitemporal or hyperspectral coverage as well as parcel boundaries are in practice often not available.