With the rise and development of deep learning, computer vision has been tremendously transformed and reshaped. As an important research area in computer vision, scene text detection and recognition ...has been inevitably influenced by this wave of revolution, consequentially entering the era of deep learning. In recent years, the community has witnessed substantial advancements in mindset, methodology and performance. This survey is aimed at summarizing and analyzing the major changes and significant progresses of scene text detection and recognition in the deep learning era. Through this article, we devote to: (1) introduce new insights and ideas; (2) highlight recent techniques and benchmarks; (3) look ahead into future trends. Specifically, we will emphasize the dramatic differences brought by deep learning and remaining grand challenges. We expect that this review paper would serve as a reference book for researchers in this field. Related resources are also collected in our Github repository (
https://github.com/Jyouhou/SceneTextPapers
).
This paper presents a method for the 3D reconstruction of a piecewise‐planar surface from range images, typically laser scans with millions of points. The reconstructed surface is a watertight ...polygonal mesh that conforms to observations at a given scale in the visible planar parts of the scene, and that is plausible in hidden parts. We formulate surface reconstruction as a discrete optimization problem based on detected and hypothesized planes. One of our major contributions, besides a treatment of data anisotropy and novel surface hypotheses, is a regularization of the reconstructed surface w.r.t. the length of edges and the number of corners. Compared to classical area‐based regularization, it better captures surface complexity and is therefore better suited for man‐made environments, such as buildings. To handle the underlying higher‐order potentials, that are problematic for MRF optimizers, we formulate minimization as a sparse mixed‐integer linear programming problem and obtain an approximate solution using a simple relaxation. Experiments show that it is fast and reaches near‐optimal solutions.
The detection of anomalous structures in natural image data is of utmost importance for numerous tasks in the field of computer vision. The development of methods for unsupervised anomaly detection ...requires data on which to train and evaluate new approaches and ideas. We introduce the MVTec anomaly detection dataset containing 5354 high-resolution color images of different object and texture categories. It contains normal, i.e., defect-free images intended for training and images with anomalies intended for testing. The anomalies manifest themselves in the form of over 70 different types of defects such as scratches, dents, contaminations, and various structural changes. In addition, we provide pixel-precise ground truth annotations for all anomalies. We conduct a thorough evaluation of current state-of-the-art unsupervised anomaly detection methods based on deep architectures such as convolutional autoencoders, generative adversarial networks, and feature descriptors using pretrained convolutional neural networks, as well as classical computer vision methods. We highlight the advantages and disadvantages of multiple performance metrics as well as threshold estimation techniques. This benchmark indicates that methods that leverage descriptors of pretrained networks outperform all other approaches and deep-learning-based generative models show considerable room for improvement.
Standardized benchmarks have been crucial in pushing the performance of computer vision algorithms, especially since the advent of deep learning. Although leaderboards should not be over-claimed, ...they often provide the most objective measure of performance and are therefore important guides for research. We present
MOTChallenge
, a benchmark for single-camera Multiple Object Tracking (MOT) launched in late 2014, to collect existing and new data and create a framework for the standardized evaluation of multiple object tracking methods. The benchmark is focused on multiple people tracking, since pedestrians are by far the most studied object in the tracking community, with applications ranging from robot navigation to self-driving cars. This paper collects the first three releases of the benchmark: (i)
MOT15
, along with numerous state-of-the-art results that were submitted in the last years, (ii)
MOT16
, which contains new challenging videos, and (iii)
MOT17
, that extends
MOT16
sequences with more precise labels and evaluates tracking performance on three different object detectors. The second and third release not only offers a significant increase in the number of labeled boxes, but also provide labels for multiple object classes beside pedestrians, as well as the level of visibility for every single object of interest. We finally provide a categorization of state-of-the-art trackers and a broad error analysis. This will help newcomers understand the related work and research trends in the MOT community, and hopefully shed some light into potential future research directions.
Advances in animal motion tracking and pose recognition have been a game changer in the study of animal behavior. Recently, an increasing number of works go ‘deeper’ than tracking, and address ...automated recognition of animals’ internal states such as emotions and pain with the aim of improving animal welfare, making this a timely moment for a systematization of the field. This paper provides a comprehensive survey of computer vision-based research on recognition of pain and emotional states in animals, addressing both facial and bodily behavior analysis. We summarize the efforts that have been presented so far within this topic—classifying them across different dimensions, highlight challenges and research gaps, and provide best practice recommendations for advancing the field, and some future directions for research.
We propose a technique for producing ‘visual explanations’ for decisions from a large class of Convolutional Neural Network (CNN)-based models, making them more transparent and explainable. Our ...approach—Gradient-weighted Class Activation Mapping (Grad-CAM), uses the gradients of any target concept (say ‘dog’ in a classification network or a sequence of words in captioning network) flowing into the final convolutional layer to produce a coarse localization map highlighting the important regions in the image for predicting the concept. Unlike previous approaches, Grad-CAM is applicable to a wide variety of CNN model-families: (1) CNNs with fully-connected layers (
e.g.
VGG), (2) CNNs used for structured outputs (
e.g.
captioning), (3) CNNs used in tasks with multi-modal inputs (
e.g.
visual question answering) or reinforcement learning, all
without architectural changes or re-training
. We combine Grad-CAM with existing fine-grained visualizations to create a high-resolution class-discriminative visualization, Guided Grad-CAM, and apply it to image classification, image captioning, and visual question answering (VQA) models, including ResNet-based architectures. In the context of image classification models, our visualizations (a) lend insights into failure modes of these models (showing that seemingly unreasonable predictions have reasonable explanations), (b) outperform previous methods on the ILSVRC-15 weakly-supervised localization task, (c) are robust to adversarial perturbations, (d) are more faithful to the underlying model, and (e) help achieve model generalization by identifying dataset bias. For image captioning and VQA, our visualizations show that even non-attention based models learn to localize discriminative regions of input image. We devise a way to identify important neurons through Grad-CAM and combine it with neuron names (Bau et al. in Computer vision and pattern recognition, 2017) to provide textual explanations for model decisions. Finally, we design and conduct human studies to measure if Grad-CAM explanations help users establish appropriate trust in predictions from deep networks and show that Grad-CAM helps untrained users successfully discern a ‘stronger’ deep network from a ‘weaker’ one even when both make identical predictions. Our code is available at
https://github.com/ramprs/grad-cam/
, along with a demo on CloudCV (Agrawal et al., in: Mobile cloud visual media computing, pp 265–290. Springer, 2015) (
http://gradcam.cloudcv.org
) and a video at
http://youtu.be/COjUB9Izk6E
.