Deep networks trained on demonstrations of human driving have learned to follow roads and avoid obstacles. However, driving policies trained via imitation learning cannot be controlled at test time. ...A vehicle trained end-to-end to imitate an expert cannot be guided to take a specific turn at an upcoming intersection. This limits the utility of such systems. We propose to condition imitation learning on high-level command input. At test time, the learned driving policy functions as a chauffeur that handles sensorimotor coordination but continues to respond to navigational commands. We evaluate different architectures for conditional imitation learning in vision-based driving. We conduct experiments in realistic three-dimensional simulations of urban driving and on a 1/5 scale robotic truck that is trained to drive in a residential area. Both systems drive based on visual input yet remain responsive to high-level navigational commands.
Motivated by the astonishing capabilities of natural intelligent agents and inspired by theories from psychology, this paper explores the idea that perception gets coupled to 3D properties of the ...world via interaction with the environment. Existing works for depth estimation require either massive amounts of annotated training data or some form of hard-coded geometrical constraint. This paper explores a new approach to learning depth perception requiring neither of those. Specifically, we propose a novel global-local network architecture that can be trained with the data observed by a robot exploring an environment: images and extremely sparse depth measurements, down to even a single pixel per image. From a pair of consecutive images, the proposed network outputs a latent representation of the camera's and scene's parameters, and a dense depth map. Experiments on several datasets show that, when ground truth is available even for just one of the image pixels, the proposed network can learn monocular dense depth estimation up to 22.5% more accurately than state-of-the-art approaches. We believe that this work, in addition to its scientific interest, lays the foundations to learn depth with extremely sparse supervision, which can be valuable to all robotic systems acting under severe bandwidth or sensing constraints.
A classical problem in computer vision is to infer a 3D scene representation from few images that can be used to render novel views at interactive rates. Previous work focuses on reconstructing ...pre-defined 3D representations, e.g. textured meshes, or implicit representations, e.g. radiance fields, and often requires input images with precise camera poses and long processing times for each novel scene. In this work, we propose the Scene Representation Transformer (SRT), a method which processes posed or unposed RGB images of a new area, infers a "set-latent scene representation ", and synthesises novel views, all in a single feed-forward pass. To calculate the scene representation, we propose a generalization of the Vision Transformer to sets of images, enabling global information integration, and hence 3D reasoning. An efficient decoder transformer parameterizes the light field by attending into the scene representation to render novel views. Learning is supervised end-to-end by minimizing a novel-view reconstruction error. We show that this method outperforms recent baselines in terms of PSNR and speed on synthetic datasets, including a new dataset created for the paper. Further, we demonstrate that SRT scales to support interactive visualization and semantic segmentation of real-world outdoor environments using Street View imagery.
The present study deals with the fabrication of light-reflecting materials used in pixelated scintillator detectors. For the first time the reflecting surfaces for pixels of different sizes (from 0.8 ...to 3.2 mm) were obtained via low-cost DLP 3D printing technique. The material for the reflectors was the new composite of transparent ultraviolet light-cured resin and TiO2 as a light-scattering filler. It was observed that TiO2 showed better performance compare to other pigments such as BaSO4, hBN or cubic zirconia. The reflectors formation rate was about 1 cm per hour with the possibility to produce several units simultaneously that simplifies the wrapping procedure. It was found that the regular pattern of the fabricated reflectors (stair-steps effect) could increase a light collection from a scintillator. The reflective properties of such surfaces were comparable to conventional reflection coating (e.g., Teflon wrapping).
•A new light reflecting material comprising TiO2 and a binder resin have been studied.•Reflectors with complex shapes were formed via a low-cost DLP 3D printing technique.•Reflecting properties of the composite are comparable to standard Teflon wrapping.•The reflectors can be successfully used in scintillator arrays assembling.
Autonomous micro aerial vehicles still struggle with fast and agile maneuvers, dynamic environments, imperfect sensing, and state estimation drift. Autonomous drone racing brings these challenges to ...the fore. Human pilots can fly a previously unseen track after a handful of practice runs. In contrast, state-of-the-art autonomous navigation algorithms require either a precise metric map of the environment or a large amount of training data collected in the track of interest. To bridge this gap, we propose an approach that can fly a new track in a previously unseen environment without a precise map or expensive data collection. Our approach represents the global track layout with coarse gate locations, which can be easily estimated from a single demonstration flight. At test time, a convolutional network predicts the poses of the closest gates along with their uncertainty. These predictions are incorporated by an extended Kalman filter to maintain optimal maximum-a-posteriori estimates of gate locations. This allows the framework to cope with misleading high-variance estimates that could stem from poor observability or lack of visible gates. Given the estimated gate poses, we use model predictive control to quickly and accurately navigate through the track. We conduct extensive experiments in the physical world, demonstrating agile and robust flight through complex and diverse previously-unseen race tracks. The presented approach was used to win the IROS 2018 Autonomous Drone Race Competition, outracing the second-placing team by a factor of two.
Text-to-Image (T2I) models have made significant advancements in recent
years, but they still struggle to accurately capture intricate details
specified in complex compositional prompts. While ...fine-tuning T2I models with
reward objectives has shown promise, it suffers from "reward hacking" and may
not generalize well to unseen prompt distributions. In this work, we propose
Reward-based Noise Optimization (ReNO), a novel approach that enhances T2I
models at inference by optimizing the initial noise based on the signal from
one or multiple human preference reward models. Remarkably, solving this
optimization problem with gradient ascent for 50 iterations yields impressive
results on four different one-step models across two competitive benchmarks,
T2I-CompBench and GenEval. Within a computational budget of 20-50 seconds,
ReNO-enhanced one-step models consistently surpass the performance of all
current open-source Text-to-Image models. Extensive user studies demonstrate
that our model is preferred nearly twice as often compared to the popular SDXL
model and is on par with the proprietary Stable Diffusion 3 with 8B parameters.
Moreover, given the same computational resources, a ReNO-optimized one-step
model outperforms widely-used open-source models such as SDXL and
PixArt-$\alpha$, highlighting the efficiency and effectiveness of ReNO in
enhancing T2I model performance at inference time. Code is available at
https://github.com/ExplainableML/ReNO.
We address the problem of learning accurate 3D shape and camera pose from a collection of unlabeled category-specific images. We train a convolutional network to predict both the shape and the pose ...from a single image by minimizing the reprojection error: given several views of an object, the projections of the predicted shapes to the predicted camera poses should match the provided views. To deal with pose ambiguity, we introduce an ensemble of pose predictors which we then distill to a single "student" model. To allow for efficient learning of high-fidelity shapes, we represent the shapes by point clouds and devise a formulation allowing for differentiable projection of these. Our experiments show that the distilled ensemble of pose predictors learns to estimate the pose accurately, while the point cloud representation allows to predict detailed shape models. The supplementary video can be found at https://www.youtube.com/watch?v=LuIGovKeo60