Distilled Siamese Networks for Visual Tracking Shen, Jianbing; Liu, Yuanpei; Dong, Xingping ...
IEEE transactions on pattern analysis and machine intelligence,
12/2022, Volume:
44, Issue:
12
Journal Article
Peer reviewed
Open access
In recent years, Siamese network based trackers have significantly advanced the state-of-the-art in real-time tracking. Despite their success, Siamese trackers tend to suffer from high memory costs, ...which restrict their applicability to mobile devices with tight memory budgets. To address this issue, we propose a distilled Siamese tracking framework to learn small, fast and accurate trackers (students), which capture critical knowledge from large Siamese trackers (teachers) by a teacher-students knowledge distillation model. This model is intuitively inspired by the one teacher versus multiple students learning method typically employed in schools. In particular, our model contains a single teacher-student distillation module and a student-student knowledge sharing mechanism. The former is designed using a tracking-specific distillation strategy to transfer knowledge from a teacher to students. The latter is utilized for mutual learning between students to enable in-depth knowledge understanding. Extensive empirical evaluations on several popular Siamese trackers demonstrate the generality and effectiveness of our framework. Moreover, the results on five tracking benchmarks show that the proposed distilled trackers achieve compression rates of up to 18× and frame-rates of 265 FPS, while obtaining comparable tracking accuracy compared to base models.
With efficient appearance learning models, discriminative correlation filter (DCF) has been proven to be very successful in recent video object tracking benchmarks and competitions. However, the ...existing DCF paradigm suffers from two major issues, i.e., spatial boundary effect and temporal filter degradation. To mitigate these challenges, we propose a new DCF-based tracking method. The key innovations of the proposed method include adaptive spatial feature selection and temporal consistent constraints, with which the new tracker enables joint spatial-temporal filter learning in a lower dimensional discriminative manifold. More specifically, we apply structured spatial sparsity constraints to multi-channel filters. Consequently, the process of learning spatial filters can be approximated by the lasso regularization. To encourage temporal consistency, the filter model is restricted to lie around its historical value and updated locally to preserve the global structure in the manifold. Last, a unified optimization framework is proposed to jointly select temporal consistency preserving spatial features and learn discriminative filters with the augmented Lagrangian method. Qualitative and quantitative evaluations have been conducted on a number of well-known benchmarking datasets such as OTB2013, OTB50, OTB100, Temple-Colour, UAV123, and VOT2018. The experimental results demonstrate the superiority of the proposed method over the state-of-the-art approaches.
Accurate and robust visual object tracking is one of the most challenging and fundamental computer vision problems. It entails estimating the trajectory of the target in an image sequence, given only ...its initial location, and segmentation, or its rough approximation in the form of a bounding box. Discriminative Correlation Filters (DCFs) and deep Siamese Networks (SNs) have emerged as dominating tracking paradigms, which have led to significant progress. Following the rapid evolution of visual object tracking in the last decade, this survey presents a systematic and thorough review of more than 90 DCFs and Siamese trackers, based on results in nine tracking benchmarks. First, we present the background theory of both the DCF and Siamese tracking core formulations. Then, we distinguish and comprehensively review the shared as well as specific open research challenges in both these tracking paradigms. Furthermore, we thoroughly analyze the performance of DCF and Siamese trackers on nine benchmarks, covering different experimental aspects of visual tracking: datasets, evaluation metrics, performance, and speed comparisons. We finish the survey by presenting recommendations and suggestions for distinguished open challenges based on our analysis.
The end-to-end image fusion framework has achieved promising performance, with dedicated convolutional networks aggregating the multi-modal local appearance. However, long-range dependencies are ...directly neglected in existing CNN fusion approaches, impeding balancing the entire image-level perception for complex scenario fusion. In this paper, therefore, we propose an infrared and visible image fusion algorithm based on the transformer module and adversarial learning. Inspired by the global interaction power, we use the transformer technique to learn the effective global fusion relations. In particular, shallow features extracted by CNN are interacted in the proposed transformer fusion module to refine the fusion relationship within the spatial scope and across channels simultaneously. Besides, adversarial learning is designed in the training process to improve the output discrimination via imposing competitive consistency from the inputs, reflecting the specific characteristics in infrared and visible images. The experimental performance demonstrates the effectiveness of the proposed modules, with superior improvement against the state-of-the-art, generalising a novel paradigm via transformer and adversarial learning in the fusion task.
The Siamese tracking framework has attracted much attention due to its scalability and efficiency in recent years. However, it is less effective in recognizing arbitrary targets with various ...variations, especially in complex scenarios with background distractors and illumination variations. To this end, we propose a Siamese Residual Network to formulate the characteristics of a specific given target for visual tracking, consisting of an identity branch and a residual branch. The identity branch consists of a generic offline-trained similarity-matching network, which distinguishes the target from the background at the class level. To complement the identity branch for handling complex scenarios and dramatic target appearance variations, we develop a residual branch learned from the samples of exact target states and online distractors under the guidance of the identity branch. These two branches representing arbitrary targets with both class-level and sample-level features achieve accurate target localizations under complicated scenarios. In addition, we propose an adaptive KL-based scheme for updating the residual branch effectively by avoiding overfitting to a long-retained target appearance. Extensive experimental results on OTB-2013, OTB-2015, VOT2016, VOT-2018, VOT-2019, Temple-Color-128, and LaSOT show that the proposed method performs against state-of-the-art trackers.
Visual object tracking is a significant computer vision task which can be applied to many domains, such as visual surveillance, human computer interaction, and video compression. Despite extensive ...research on this topic, it still suffers from difficulties in handling complex object appearance changes caused by factors such as illumination variation, partial occlusion, shape deformation, and camera motion. Therefore, effective modeling of the 2D appearance of tracked objects is a key issue for the success of a visual tracker. In the literature, researchers have proposed a variety of 2D appearance models. To help readers swiftly learn the recent advances in 2D appearance models for visual object tracking, we contribute this survey, which provides a detailed review of the existing 2D appearance models. In particular, this survey takes a module-based architecture that enables readers to easily grasp the key points of visual object tracking. In this survey, we first decompose the problem of appearance modeling into two different processing stages: visual representation and statistical modeling. Then, different 2D appearance models are categorized and discussed with respect to their composition modules. Finally, we address several issues of interest as well as the remaining challenges for future research on this topic. The contributions of this survey are fourfold. First, we review the literature of visual representations according to their feature-construction mechanisms (i.e., local and global). Second, the existing statistical modeling schemes for tracking-by-detection are reviewed according to their model-construction mechanisms: generative, discriminative, and hybrid generative-discriminative. Third, each type of visual representations or statistical modeling techniques is analyzed and discussed from a theoretical or practical viewpoint. Fourth, the existing benchmark resources (e.g., source codes and video datasets) are examined in this survey.
Aim of this work is to introduce a novel visual object tracking model based on siamese network and vision transformer. Tracking is performed by multiple tokens exploiting the learning and ...memorization capabilities of the vision transformers. Therefore, the tracking problem is divided into multiple sub-tasks and experiments by using multiple tokens for learning each individual sub-task. This makes possible to learn a robust characterization of the problem with an explainable architecture, understanding the motivation of the choice that the neural network does. This is due to the attention in the transformer that uses the representational capacity of tokens that allows one to identify, simply with respect to different architectures and methodologies, where all the interest is focused. Several experiments are performed on benchmark data proving to be among the most performing trackers compared with the state of the art in explainability, precision, robustness and speed.
•Proxy token guided transformer-based baseline for vision-language tracking.•Dense annotated long-term vision-language tracking dataset.•Extensive experiments on a new long-term vision-language ...tracking dataset.
Tracking by vision-language is an emergent topic. Previous researchers mainly adopt CNN and sequential models for video and language encoding, however, their methods are limited by poor generalization performance. To address this problem, this paper presents a novel vision-language tracking framework based on Transformer. Specifically, our proposed framework contains the image encoder, language encoder, cross-modal fusion module, and task-specific heads. We adopt the residual network and BERT for image and language embedding, respectively. More importantly, we propose a proxy token guided cross-modal fusion module based on the transformer network, which can link the vision and language features effectively and efficiently. The proxy token acts as a proxy for word embeddings and interacts with the visual feature. By absorbing vision information, the proxy token is used to modulate word embeddings and make them attend to the visual feature. Finally, we get the organically fused features via a dynamic modal aggregation method and feed them into the task-specific heads for tracking. Extensive experiments demonstrate that our method set new state-of-the-art on multiple language-assisted tracking datasets, including OTB-LANG, LaSOT, TNL2K, and a newly proposed Ref-LTB50 annotated with dense language specifications. Source code of this paper will be publicly available.