Visual object tracking has become one of the most active research topics in computer vision, which has been growing in commercial development as well as academic research. Many visual trackers have ...been proposed in the last two decades. Recent studies of computer vision for dynamic scenes include motion detection, object classification, environment modeling, tracking of moving objects, understanding of object behaviors, object identification, and data fusion from multiple sensors. This paper provides an in-depth overview of recent object tracking research. Object tracking tasks in realistic scenario often face challenging problems such as camera motion, occlusion, illumination effect, clutter, and similar appearance. A variety of tracker techniques have been published, which combine multiple techniques to solve multiple visual tracking sub-problems. This paper also reviews the latest research trend in object tracking based on convolutional neural networks, which is receiving growing attention. Finally, the paper discusses the future challenges and research directions for the object tracking problems that still need extensive studies in coming years.
Group Normalization Wu, Yuxin; He, Kaiming
International journal of computer vision,
03/2020, Letnik:
128, Številka:
3
Journal Article
Recenzirano
Batch Normalization (BN) is a milestone technique in the development of deep learning, enabling various networks to train. However, normalizing along the batch dimension introduces problems—BN’s ...error increases rapidly when the batch size becomes smaller, caused by inaccurate batch statistics estimation. This limits BN’s usage for training larger models and transferring features to computer vision tasks including detection, segmentation, and video, which require small batches constrained by memory consumption. In this paper, we present Group Normalization (GN) as a simple alternative to BN. GN divides the channels into groups and computes within each group the mean and variance for normalization. GN’s computation is independent of batch sizes, and its accuracy is stable in a wide range of batch sizes. On ResNet-50 trained in ImageNet, GN has 10.6% lower error than its BN counterpart when using a batch size of 2; when using typical batch sizes, GN is comparably good with BN and outperforms other normalization variants. Moreover, GN can be naturally transferred from pre-training to fine-tuning. GN can outperform its BN-based counterparts for object detection and segmentation in COCO (
https://github.com/facebookresearch/Detectron/blob/master/projects/GN
), and for video classification in Kinetics, showing that GN can effectively replace the powerful BN in a variety of tasks. GN can be easily implemented by a few lines of code in modern libraries.
Multi-object tracking (MOT) has been notoriously difficult to evaluate. Previous metrics overemphasize the importance of either detection or association. To address this, we present a novel MOT ...evaluation metric, higher order tracking accuracy (HOTA), which explicitly balances the effect of performing accurate detection, association and localization into a single unified metric for comparing trackers. HOTA decomposes into a family of sub-metrics which are able to evaluate each of five basic error types separately, which enables clear analysis of tracking performance. We evaluate the effectiveness of HOTA on the MOTChallenge benchmark, and show that it is able to capture important aspects of MOT performance not previously taken into account by established metrics. Furthermore, we show HOTA scores better align with human visual evaluation of tracking performance.
Computer vision systems in real-world applications need to be robust to partial occlusion while also being explainable. In this work, we show that black-box deep convolutional neural networks (DCNNs) ...have only limited robustness to partial occlusion. We overcome these limitations by unifying DCNNs with part-based models into Compositional Convolutional Neural Networks (CompositionalNets)—an interpretable deep architecture with innate robustness to partial occlusion. Specifically, we propose to replace the fully connected classification head of DCNNs with a differentiable compositional model that can be trained end-to-end. The structure of the compositional model enables CompositionalNets to decompose images into objects and context, as well as to further decompose object representations in terms of individual parts and the objects’ pose. The generative nature of our compositional model enables it to localize occluders and to recognize objects based on their non-occluded parts. We conduct extensive experiments in terms of image classification and object detection on images of artificially occluded objects from the PASCAL3D+ and ImageNet dataset, and real images of partially occluded vehicles from the MS-COCO dataset. Our experiments show that CompositionalNets made from several popular DCNN backbones (VGG-16, ResNet50, ResNext) improve by a large margin over their non-compositional counterparts at classifying and detecting partially occluded objects. Furthermore, they can localize occluders accurately despite being trained with class-level supervision only. Finally, we demonstrate that CompositionalNets provide human interpretable predictions as their individual components can be understood as detecting parts and estimating an objects’ viewpoint.
Knowledge Distillation: A Survey Gou, Jianping; Yu, Baosheng; Maybank, Stephen J. ...
International journal of computer vision,
06/2021, Letnik:
129, Številka:
6
Journal Article
Recenzirano
Odprti dostop
In recent years, deep neural networks have been successful in both industry and academia, especially for computer vision tasks. The great success of deep learning is mainly due to its scalability to ...encode large-scale data and to maneuver billions of model parameters. However, it is a challenge to deploy these cumbersome deep models on devices with limited resources, e.g., mobile phones and embedded devices, not only because of the high computational complexity but also the large storage requirements. To this end, a variety of model compression and acceleration techniques have been developed. As a representative type of model compression and acceleration, knowledge distillation effectively learns a small student model from a large teacher model. It has received rapid increasing attention from the community. This paper provides a comprehensive survey of knowledge distillation from the perspectives of knowledge categories, training schemes, teacher–student architecture, distillation algorithms, performance comparison and applications. Furthermore, challenges in knowledge distillation are briefly reviewed and comments on future research are discussed and forwarded.
With the growing volume of online information, recommender systems have been an effective strategy to overcome information overload. The utility of recommender systems cannot be overstated, given ...their widespread adoption in many web applications, along with their potential impact to ameliorate many problems related to over-choice. In recent years, deep learning has garnered considerable interest in many research fields such as computer vision and natural language processing, owing not only to stellar performance but also to the attractive property of learning feature representations from scratch. The influence of deep learning is also pervasive, recently demonstrating its effectiveness when applied to information retrieval and recommender systems research. The field of deep learning in recommender system is flourishing. This article aims to provide a comprehensive review of recent research efforts on deep learning-based recommender systems. More concretely, we provide and devise a taxonomy of deep learning-based recommendation models, along with a comprehensive summary of the state of the art. Finally, we expand on current trends and provide new perspectives pertaining to this new and exciting development of the field.
Derived from rapid advances in computer vision and machine learning, video analysis tasks have been moving from inferring the present state to predicting the future state. Vision-based action ...recognition and prediction from videos are such tasks, where action recognition is to infer human actions (present state) based upon complete action executions, and action prediction to predict human actions (future state) based upon incomplete action executions. These two tasks have become particularly prevalent topics recently because of their explosively emerging real-world applications, such as visual surveillance, autonomous driving vehicle, entertainment, and video retrieval, etc. Many attempts have been devoted in the last a few decades in order to build a robust and effective framework for action recognition and prediction. In this paper, we survey the complete state-of-the-art techniques in action recognition and prediction. Existing models, popular algorithms, technical difficulties, popular action databases, evaluation protocols, and promising future directions are also provided with systematic discussions.
Semantic understanding of visual scenes is one of the holy grails of computer vision. Despite efforts of the community in data collection, there are still few image datasets covering a wide range of ...scenes and object categories with pixel-wise annotations for scene understanding. In this work, we present a densely annotated dataset
ADE20K
, which spans diverse annotations of scenes, objects, parts of objects, and in some cases even parts of parts. Totally there are 25k images of the complex everyday scenes containing a variety of objects in their natural spatial context. On average there are 19.5 instances and 10.5 object classes per image. Based on ADE20K, we construct benchmarks for scene parsing and instance segmentation. We provide baseline performances on both of the benchmarks and re-implement state-of-the-art models for open source. We further evaluate the effect of synchronized batch normalization and find that a reasonably large batch size is crucial for the semantic segmentation performance. We show that the networks trained on ADE20K are able to segment a wide variety of scenes and objects.
Estimation of the human pose from a monocular camera has been an emerging research topic in the computer vision community with many applications. Recently, benefiting from the deep learning ...technologies, a significant amount of research efforts have advanced the monocular human pose estimation both in 2D and 3D areas. Although there have been some works to summarize different approaches, it still remains challenging for researchers to have an in-depth view of how these approaches work from 2D to 3D. In this article, we provide a comprehensive and holistic 2D-to-3D perspective to tackle this problem. First, we comprehensively summarize the 2D and 3D representations of human body. Then, we summarize the mainstream and milestone approaches for these human body presentations since the year 2014 under unified frameworks. Especially, we provide insightful analyses for the intrinsic connections and methods evolution from 2D to 3D pose estimation. Furthermore, we analyze the solutions for challenging cases, such as the lack of data, the inherent ambiguity between 2D and 3D, and the complex multi-person scenarios. Next, we summarize the benchmarks, evaluation metrics, and the quantitative performance of popular approaches. Finally, we discuss the challenges and give deep thinking of promising directions for future research. We believe this survey will provide the readers (researchers, engineers, developers, etc.) with a deep and insightful understanding of monocular human pose estimation.
Semantic segmentation is an extensively studied task in computer vision, with numerous methods proposed every year. Thanks to the advent of deep learning in semantic segmentation, the performance on ...existing benchmarks is close to saturation. A natural question then arises: Does the superior performance on the closed (and frequently re-used) test sets transfer to the open visual world with unconstrained variations? In this paper, we take steps toward answering the question by exposing failures of existing semantic segmentation methods in the open visual world under the constraint of very limited human labeling effort. Inspired by previous research on model falsification, we start from an arbitrarily large image set, and automatically sample a small image set by maximizing the discrepancy (MAD) between two segmentation methods. The selected images have the greatest potential in falsifying either (or both) of the two methods. We also explicitly enforce several conditions to diversify the exposed failures, corresponding to different underlying root causes. A segmentation method, whose failures are more difficult to be exposed in the MAD competition, is considered better. We conduct a thorough MAD diagnosis of ten PASCAL VOC semantic segmentation algorithms. With detailed analysis of experimental results, we point out strengths and weaknesses of the competing algorithms, as well as potential research directions for further advancement in semantic segmentation. The codes are publicly available at
https://github.com/QTJiebin/MAD_Segmentation
.