This paper presents a hunt-inspired Transformer for visual object tracking, dubbed as HuntFormer. The HuntFormer focuses on robust target detection and identification, simulating natural hunting ...processes. Specifically, the HuntFormer comprises two essential module designs including a predictor for detection and a verifier for identification. The predictor emulates the detection stage by designing a motion trajectory guided particle filter, which identifies potential target locations by predicting the motion state within a particle filtering framework. The predictor utilizes spatio-temporal correlation scores between dynamic target templates and the search region to guide the learning process to generate a set of reliable particles. This enables the base tracker to narrow its search range to focus on the target, and swiftly re-detect the target in case of model drift. Once the target is re-detected, the verifier assesses the detection result as a reliable tracked item. The verifier initially maintains a dynamic memory that stores reliable target templates and their corresponding locations in the motion trajectory. It then models the uncertainty of appearance information within this memory probabilistically. The output uncertainty score determines whether the memory gets updated or not. Ultimately, the predictor and the verifier collaborate, ensuring a robust tracking outcome. Extensive evaluations on six challenging benchmark datasets demonstrate HuntFormer’s favorable performance against various state-of-the-art trackers. Notably, in the VOT-LT2022 tracking challenge, the HuntFormer won the third place with an F-score of 0.598, closely competing for the second place with an F-score of 0.600.
•The novel tracker is inspired by hunting and contains a predictor and a validator.•Predictor based on trajectory-guided particle filtering for searching the region.•Validator with uncertainty modelling to reduce overconfidence issues in tracking.
Transformer has shown its great strength in visual object tracking due to its effective attention mechanism, but most prevailing transformer-based trackers only explore temporal information frame by ...frame, thus overlooking the rich context information inherent in videos. To alleviate this problem, we propose a transformer-based tracker via learning immediate appearance change information in videos, called IAC-tracker. The proposed tracker enhances the perception of the immediate motion state to improve the performance of single target tracking. IAC-tracker contains three key components: a spatial information extractor (SIE) with a superior attention mechanism to progressively extract spatial information, a temporal information extractor (TIE) with a designed temporal attention mechanism to progressively learn target immediate appearance change, and a novel spatial–temporal context enhanced fusion module integrating the information from SIE and TIE to prepare for the final prediction head. Comparison experiments with state-of-the-art trackers on six challenging datasets demonstrate the superior performance of IAC-tracker with real-time running speed.
•A transformer tracking framework modeling spatial–temporal features is proposed.•A temporal information extractor is proposed to learn immediate appearance change.•A spatial–temporal context enhanced fusion module is proposed to integrate features.
The essence of Siamese trackers is the similarity matching between a target template deep feature and a search region deep feature. With the successful application of the Transformer in the vision ...community, the similarity matching manner is moving from convolution matching to Transformer matching. While this transition achieves a performance boost, we explore that there exists an intuitive complementarity between convolution matching and Transformer matching. Therefore, employing only one of the two matchings is suboptimal for the trackers, and exploiting their complementarity holds great potential. To this end, we present a Matching Knowledge Fusion (MKF) module that efficiently integrates a convolution matching and an enhanced Transformer matching to exploit the explored matching complementarity. Furthermore, aiming at the issue that the noisy and ambiguous attention weights of Transformer matching lead to the degradation of matching results, a novel mechanism of utilizing complementary matching knowledge to correct the attention weights is proposed. Based on the Matching Knowledge Fusion module, we build a simple but effective tracker, dubbed MKFTrack. Extensive experiments demonstrate the favorable performance of our tracker against state-of-the-art ones.
Hyperparameters are numerical pre-sets whose values are assigned prior to the commencement of a learning process. Selecting appropriate hyperparameters is often critical for achieving satisfactory ...performance in many vision problems, such as deep learning-based visual object tracking. However, it is often difficult to determine their optimal values, especially if they are specific to each video input. Most hyperparameter optimization algorithms tend to search a generic range and are imposed blindly on all sequences. In this paper, we propose a novel dynamical hyperparameter optimization method that adaptively optimizes hyperparameters for a given sequence using an action-prediction network leveraged on continuous deep Q-learning. Since the observation space for object tracking is significantly more complex than those in traditional control problems, existing continuous deep Q-learning algorithms cannot be directly applied. To overcome this challenge, we introduce an efficient heuristic strategy to handle high dimensional state space, while also accelerating the convergence behavior. The proposed algorithm is applied to improve two representative trackers, a Siamese-based one and a correlation-filter-based one, to evaluate its generalizability. Their superior performances on several popular benchmarks are clearly demonstrated. Our source code is available at https://github.com/shenjianbing/dqltracking .
Distilled Siamese Networks for Visual Tracking Shen, Jianbing; Liu, Yuanpei; Dong, Xingping ...
IEEE transactions on pattern analysis and machine intelligence,
12/2022, Volume:
44, Issue:
12
Journal Article
Peer reviewed
Open access
In recent years, Siamese network based trackers have significantly advanced the state-of-the-art in real-time tracking. Despite their success, Siamese trackers tend to suffer from high memory costs, ...which restrict their applicability to mobile devices with tight memory budgets. To address this issue, we propose a distilled Siamese tracking framework to learn small, fast and accurate trackers (students), which capture critical knowledge from large Siamese trackers (teachers) by a teacher-students knowledge distillation model. This model is intuitively inspired by the one teacher versus multiple students learning method typically employed in schools. In particular, our model contains a single teacher-student distillation module and a student-student knowledge sharing mechanism. The former is designed using a tracking-specific distillation strategy to transfer knowledge from a teacher to students. The latter is utilized for mutual learning between students to enable in-depth knowledge understanding. Extensive empirical evaluations on several popular Siamese trackers demonstrate the generality and effectiveness of our framework. Moreover, the results on five tracking benchmarks show that the proposed distilled trackers achieve compression rates of up to 18× and frame-rates of 265 FPS, while obtaining comparable tracking accuracy compared to base models.
With efficient appearance learning models, discriminative correlation filter (DCF) has been proven to be very successful in recent video object tracking benchmarks and competitions. However, the ...existing DCF paradigm suffers from two major issues, i.e., spatial boundary effect and temporal filter degradation. To mitigate these challenges, we propose a new DCF-based tracking method. The key innovations of the proposed method include adaptive spatial feature selection and temporal consistent constraints, with which the new tracker enables joint spatial-temporal filter learning in a lower dimensional discriminative manifold. More specifically, we apply structured spatial sparsity constraints to multi-channel filters. Consequently, the process of learning spatial filters can be approximated by the lasso regularization. To encourage temporal consistency, the filter model is restricted to lie around its historical value and updated locally to preserve the global structure in the manifold. Last, a unified optimization framework is proposed to jointly select temporal consistency preserving spatial features and learn discriminative filters with the augmented Lagrangian method. Qualitative and quantitative evaluations have been conducted on a number of well-known benchmarking datasets such as OTB2013, OTB50, OTB100, Temple-Colour, UAV123, and VOT2018. The experimental results demonstrate the superiority of the proposed method over the state-of-the-art approaches.
The end-to-end image fusion framework has achieved promising performance, with dedicated convolutional networks aggregating the multi-modal local appearance. However, long-range dependencies are ...directly neglected in existing CNN fusion approaches, impeding balancing the entire image-level perception for complex scenario fusion. In this paper, therefore, we propose an infrared and visible image fusion algorithm based on the transformer module and adversarial learning. Inspired by the global interaction power, we use the transformer technique to learn the effective global fusion relations. In particular, shallow features extracted by CNN are interacted in the proposed transformer fusion module to refine the fusion relationship within the spatial scope and across channels simultaneously. Besides, adversarial learning is designed in the training process to improve the output discrimination via imposing competitive consistency from the inputs, reflecting the specific characteristics in infrared and visible images. The experimental performance demonstrates the effectiveness of the proposed modules, with superior improvement against the state-of-the-art, generalising a novel paradigm via transformer and adversarial learning in the fusion task.
Accurate and robust visual object tracking is one of the most challenging and fundamental computer vision problems. It entails estimating the trajectory of the target in an image sequence, given only ...its initial location, and segmentation, or its rough approximation in the form of a bounding box. Discriminative Correlation Filters (DCFs) and deep Siamese Networks (SNs) have emerged as dominating tracking paradigms, which have led to significant progress. Following the rapid evolution of visual object tracking in the last decade, this survey presents a systematic and thorough review of more than 90 DCFs and Siamese trackers, based on results in nine tracking benchmarks. First, we present the background theory of both the DCF and Siamese tracking core formulations. Then, we distinguish and comprehensively review the shared as well as specific open research challenges in both these tracking paradigms. Furthermore, we thoroughly analyze the performance of DCF and Siamese trackers on nine benchmarks, covering different experimental aspects of visual tracking: datasets, evaluation metrics, performance, and speed comparisons. We finish the survey by presenting recommendations and suggestions for distinguished open challenges based on our analysis.
The Siamese tracking framework has attracted much attention due to its scalability and efficiency in recent years. However, it is less effective in recognizing arbitrary targets with various ...variations, especially in complex scenarios with background distractors and illumination variations. To this end, we propose a Siamese Residual Network to formulate the characteristics of a specific given target for visual tracking, consisting of an identity branch and a residual branch. The identity branch consists of a generic offline-trained similarity-matching network, which distinguishes the target from the background at the class level. To complement the identity branch for handling complex scenarios and dramatic target appearance variations, we develop a residual branch learned from the samples of exact target states and online distractors under the guidance of the identity branch. These two branches representing arbitrary targets with both class-level and sample-level features achieve accurate target localizations under complicated scenarios. In addition, we propose an adaptive KL-based scheme for updating the residual branch effectively by avoiding overfitting to a long-retained target appearance. Extensive experimental results on OTB-2013, OTB-2015, VOT2016, VOT-2018, VOT-2019, Temple-Color-128, and LaSOT show that the proposed method performs against state-of-the-art trackers.