The application of deep learning in robotics leads to very specific problems and research questions that are typically not addressed by the computer vision and machine learning communities. In this ...paper we discuss a number of robotics-specific learning, reasoning, and embodiment challenges for deep learning. We explain the need for better evaluation metrics, highlight the importance and unique challenges for deep robotic learning in simulation, and explore the spectrum between purely data-driven and model-driven approaches. We hope this paper provides a motivating overview of important research directions to overcome the current limitations, and helps to fulfill the promising potentials of deep learning in robotics.
Learning to reliably perceive and understand the scene is an integral enabler for robots to operate in the real-world. This problem is inherently challenging due to the multitude of object types as ...well as appearance changes caused by varying illumination and weather conditions. Leveraging complementary modalities can enable learning of semantically richer representations that are resilient to such perturbations. Despite the tremendous progress in recent years, most multimodal convolutional neural network approaches directly concatenate feature maps from individual modality streams rendering the model incapable of focusing only on the relevant complementary information for fusion. To address this limitation, we propose a mutimodal semantic segmentation framework that dynamically adapts the fusion of modality-specific features while being sensitive to the object category, spatial location and scene context in a self-supervised manner. Specifically, we propose an architecture consisting of two modality-specific encoder streams that fuse intermediate encoder representations into a single decoder using our proposed self-supervised model adaptation fusion mechanism which optimally combines complementary features. As intermediate representations are not aligned across modalities, we introduce an attention scheme for better correlation. In addition, we propose a computationally efficient unimodal segmentation architecture termed AdapNet++ that incorporates a new encoder with multiscale residual units and an efficient atrous spatial pyramid pooling that has a larger effective receptive field with more than
10
×
fewer parameters, complemented with a strong decoder with a multi-resolution supervision scheme that recovers high-resolution details. Comprehensive empirical evaluations on Cityscapes, Synthia, SUN RGB-D, ScanNet and Freiburg Forest benchmarks demonstrate that both our unimodal and multimodal architectures achieve state-of-the-art performance while simultaneously being efficient in terms of parameters and inference time as well as demonstrating substantial robustness in adverse perceptual conditions.
Visual place recognition (VPR) is the process of recognising a previously visited place using visual information, often under varying appearance conditions and viewpoint changes and with ...computational constraints. VPR is related to the concepts of localisation, loop closure, image retrieval and is a critical component of many autonomous navigation systems ranging from autonomous vehicles to drones and computer vision systems. While the concept of place recognition has been around for many years, VPR research has grown rapidly as a field over the past decade due to improving camera hardware and its potential for deep learning-based techniques, and has become a widely studied topic in both the computer vision and robotics communities. This growth however has led to fragmentation and a lack of standardisation in the field, especially concerning performance evaluation. Moreover, the notion of viewpoint and illumination invariance of VPR techniques has largely been assessed qualitatively and hence ambiguously in the past. In this paper, we address these gaps through a new comprehensive open-source framework for assessing the performance of VPR techniques, dubbed “VPR-Bench”. VPR-Bench (Open-sourced at:
https://github.com/MubarizZaffar/VPR-Bench
) introduces two much-needed capabilities for VPR researchers: firstly, it contains a benchmark of 12 fully-integrated datasets and 10 VPR techniques, and secondly, it integrates a comprehensive variation-quantified dataset for quantifying viewpoint and illumination invariance. We apply and analyse popular evaluation metrics for VPR from both the computer vision and robotics communities, and discuss how these different metrics complement and/or replace each other, depending upon the underlying applications and system requirements. Our analysis reveals that no universal SOTA VPR technique exists, since: (a) state-of-the-art (SOTA) performance is achieved by 8 out of the 10 techniques on at least one dataset, (b) SOTA technique in one community does not necessarily yield SOTA performance in the other given the differences in datasets and metrics. Furthermore, we identify key open challenges since: (c) all 10 techniques suffer greatly in perceptually-aliased and less-structured environments, (d) all techniques suffer from viewpoint variance where lateral change has less effect than 3D change, and (e) directional illumination change has more adverse effects on matching confidence than uniform illumination change. We also present detailed meta-analyses regarding the roles of varying ground-truths, platforms, application requirements and technique parameters. Finally, VPR-Bench provides a unified implementation to deploy these VPR techniques, metrics and datasets, and is extensible through templates.
Vehicle-to-everything (V2X) communication techniques enable the collaboration between vehicles and many other entities in the neighboring environment, which could fundamentally improve the perception ...system for autonomous driving. However, the lack of a public dataset significantly restricts the research progress of collaborative perception. To fill this gap, we present V2X-Sim, a comprehensive simulated multi-agent perception dataset for V2X-aided autonomous driving. V2X-Sim provides: (1) multi-agent sensor recordings from the road-side unit (RSU) and multiple vehicles that enable collaborative perception, (2) multi-modality sensor streams that facilitate multi-modality perception, and (3) diverse ground truths that support various perception tasks. Meanwhile, we build an open-source testbed and provide a benchmark for the state-of-the-art collaborative perception algorithms on three tasks, including detection, tracking and segmentation. V2X-Sim seeks to stimulate collaborative perception research for autonomous driving before realistic datasets become widely available.
Active vision aims to equip computer vision methods with the ability to dynamically adjust the capturing sensor’s viewpoint, position, or parameters in real time. This dynamic capability allows for ...improving the accuracy of the perception process. However, training and evaluating an active vision model often requires a large number of annotated images captured under different sensor and environmental settings, in order to emulate actions like moving around, approaching, or moving away from a person and thus effectively model the active perception dynamics. Obviously, collecting and annotating such datasets is a challenging and expensive task. To overcome these limitations, this paper introduces a synthetic image generation pipeline specifically designed to support active vision tasks. The pipeline is developed using a highly realistic simulation framework based on Unity and allows for the generation of images depicting humans, captured at varying view angles, distances, illumination conditions, and backgrounds, supporting a wide range of different tasks. Two annotated datasets, namely ActiveHuman and ActiveFace, are generated using the pipeline and the effectiveness of the proposed approach is demonstrated by a solid use case that involves training and evaluating an embedding-based active face recognizer. Furthermore, we demonstrate how the proposed generation approach enables expanding existing active face recognition methods by training models that control both the left/right movements, as well as the distance to a subject, leveraging the additional information provided by ActiveFace dataset. To facilitate replication and encourage the use of the generated datasets for training and evaluating other active vision approaches, the associated assets and the developed dataset generation pipeline is to become publicly available.
•Active vision models require a large number of annotated data samples.•We propose a synthetic image generation pipeline designed to support active vision.•A highly realistic simulation framework based on Unity is used.•Two annotated datasets (ActiveHuman and ActiveFace) are generated.•Evaluation provided on a challenging active face recognition setup.
Nowadays, although significant progress has been made by convolutional neural network, it is still difficult to realize accurate and robust stereo matching in real time. In this article, we study how ...to achieve more accurate and robust disparity estimation based on real-time requirement. For this reason, a Multi-scale Volume Fusion (MVF) module was proposed and embedded to improve the matching accuracy. To achieve real-time performance, an innovative way to use 3D convolution is proposed. The 3D convolution is used during training for guidance and supervision, making the inference lightweight. Based on these two structures, we designed an end-to-end stereo matching method called 3D Convolution Guided and Multi-scale Cost Volume Fusion Network (CGFNet). Experimental results showed that our CGFNet has better generalization performance on cross-domain datasets, which achieves more accurate disparity estimation without additional fine tuning process in challenging regions. On KITTI benchmark, CGFNet reached D1-all=1.98% with substantial improvement among the State-Of-The-Art (SOTA) real-time models and runs a pair of images within 38 ms (26 fps). The results are notable when considering both matching accuracy and real-time performance.
•We designed multi-scale cost volume fusion module to achieve robust matching.•We adopted 3D convolution guided branch for better cost aggregation.•Our CGFNet achieves competitive results: D1-all=1.98% on KITTI and 26 fps.
Once an academic venture, autonomous driving has received unparalleled corporate funding in the last decade. Still, operating conditions of current autonomous cars are mostly restricted to ideal ...scenarios. This means that driving in challenging illumination conditions such as night, sunrise, and sunset remains an open problem. In these cases, standard cameras are being pushed to their limits in terms of low light and high dynamic range performance. To address these challenges, we propose, DSEC, a new dataset that contains such demanding illumination conditions and provides a rich set of sensory data. DSEC offers data from a wide-baseline stereo setup of two color frame cameras and two high-resolution monochrome event cameras. In addition, we collect lidar data and RTK GPS measurements, both hardware synchronized with all camera data. One of the distinctive features of this dataset is the inclusion of high-resolution event cameras. Event cameras have received increasing attention for their high temporal resolution and high dynamic range performance. However, due to their novelty, event camera datasets in driving scenarios are rare. This work presents the first high resolution, large scale stereo dataset with event cameras. The dataset contains 53 sequences collected by driving in a variety of illumination conditions and provides ground truth disparity for the development and evaluation of event-based stereo algorithms.
This work addresses the problem of semantic scene understanding under fog. Although marked progress has been made in semantic scene understanding, it is mainly concentrated on clear-weather scenes. ...Extending semantic segmentation methods to adverse weather conditions such as fog is crucial for outdoor applications. In this paper, we propose a novel method, named Curriculum Model Adaptation (CMAda), which
gradually
adapts a semantic segmentation model from light synthetic fog to dense real fog in multiple steps, using both labeled synthetic foggy data and unlabeled real foggy data. The method is based on the fact that the results of semantic segmentation in moderately adverse conditions (light fog) can be bootstrapped to solve the same problem in highly adverse conditions (dense fog). CMAda is extensible to other adverse conditions and provides a new paradigm for learning with synthetic data and unlabeled real data. In addition, we present four other main stand-alone contributions: (1) a novel method to add synthetic fog to real, clear-weather scenes using semantic input; (2) a new fog density estimator; (3) a novel fog densification method for real foggy scenes without known depth; and (4) the
Foggy Zurich
dataset comprising 3808 real foggy images, with pixel-level semantic annotations for 40 images with dense fog. Our experiments show that (1) our fog simulation and fog density estimator outperform their state-of-the-art counterparts with respect to the task of semantic foggy scene understanding (SFSU); (2) CMAda improves the performance of state-of-the-art models for SFSU significantly, benefiting both from our synthetic and real foggy data. The foggy datasets and code are publicly available.
Event cameras are novel vision sensors that report per-pixel brightness changes as a stream of asynchronous "events". They offer significant advantages compared to standard cameras due to their high ...temporal resolution, high dynamic range and lack of motion blur. However, events only measure the varying component of the visual signal, which limits their ability to encode scene context. By contrast, standard cameras measure absolute intensity frames, which capture a much richer representation of the scene. Both sensors are thus complementary. However, due to the asynchronous nature of events, combining them with synchronous images remains challenging, especially for learning-based methods. This is because traditional recurrent neural networks (RNNs) are not designed for asynchronous and irregular data from additional sensors. To address this challenge, we introduce Recurrent Asynchronous Multimodal (RAM) networks, which generalize traditional RNNs to handle asynchronous and irregular data from multiple sensors. Inspired by traditional RNNs, RAM networks maintain a hidden state that is updated asynchronously and can be queried at any time to generate a prediction. We apply this novel architecture to monocular depth estimation with events and frames where we show an improvement over state-of-the-art methods by up to 30% in terms of mean absolute depth error. To enable further research on multimodal learning with events, we release EventScape, a new dataset with events, intensity frames, semantic labels, and depth maps recorded in the CARLA simulator.
Abstract
Autonomous navigation in agricultural environments is challenged by varying field conditions that arise in arable fields. State‐of‐the‐art solutions for autonomous navigation in such ...environments require expensive hardware, such as Real‐Time Kinematic Global Navigation Satellite System. This paper presents a robust crop row detection algorithm that withstands such field variations using inexpensive cameras. Existing data sets for crop row detection do not represent all the possible field variations. A data set of sugar beet images was created representing 11 field variations comprised of multiple grow stages, light levels, varying weed densities, curved crop rows, and discontinuous crop rows. The proposed pipeline segments the crop rows using a deep learning‐based method and employs the predicted segmentation mask for extraction of the central crop using a novel central crop row selection algorithm. The novel crop row detection algorithm was tested for crop row detection performance and the capability of visual servoing along a crop row. The visual servoing‐based navigation was tested on a realistic simulation scenario with the real ground and plant textures. Our algorithm demonstrated robust vision‐based crop row detection in challenging field conditions outperforming the baseline.