Deep learning has shown remarkable success in remote sensing change detection (CD), aiming to identify semantic change regions between co-registered satellite image pairs acquired at distinct time ...stamps. However, existing convolutional neural network (CNN) and transformer-based frameworks often struggle to accurately segment semantic change regions. Moreover, transformers-based methods with standard self-attention suffer from quadratic computational complexity with respect to the image resolution, making them less practical for CD tasks with limited training data. To address these issues, we propose an efficient change detection framework, ELGC-Net, which leverages rich contextual information to precisely estimate change regions while reducing the model size. Our ELGC-Net comprises a Siamese encoder, fusion modules, and a decoder. The focus of our design is the introduction of an Efficient Local-Global Context Aggregator (ELGCA) module within the encoder, capturing enhanced global context and local spatial information through a novel pooled-transpose (PT) attention and depthwise convolution, respectively. The PT attention employs pooling operations for robust feature extraction and minimizes computational cost with transposed attention. Extensive experiments on three challenging CD datasets demonstrate that ELGC-Net outperforms existing methods. Compared to the recent transformer-based CD approach (ChangeFormer), ELGC-Net achieves a 1.4% gain in intersection over union (IoU) metric on the LEVIR-CD dataset, while significantly reducing trainable parameters. Our proposed ELGC-Net sets a new state-of-the-art performance in remote sensing change detection benchmarks. Finally, we also introduce ELGC-Net-LW, a lighter variant with significantly reduced computational complexity, suitable for resource-constrained settings, while achieving comparable performance. Our source code is publicly available at https://github.com/techmn/elgcnet.
Current transformer-based change detection (CD) approaches either employ a pretrained model trained on a large-scale image classification ImageNet dataset or rely on first pretraining on another CD ...dataset and then fine-tuning on the target benchmark. This current strategy is driven by the fact that transformers typically require a large amount of training data to learn inductive biases, which is insufficient in standard CD datasets due to their small size. We develop an end-to-end CD approach with transformers that is trained from scratch and yet achieves state-of-the-art performance on five benchmarks. Instead of using conventional self-attention that struggles to capture inductive biases when trained from scratch, our architecture utilizes a shuffled sparse-attention operation that focuses on selected sparse informative regions to capture the inherent characteristics of the CD data. Moreover, we introduce a change-enhanced feature fusion (CEFF) module to fuse the features from input image pairs by performing a per-channel re-weighting. Our CEFF module aids in enhancing the relevant semantic changes while suppressing the noisy ones. Extensive experiments on five CD datasets reveal the merits of the proposed contributions, achieving gains as high as 1.35% in intersection over union (IoU) score, compared to the best-published results in the literature. The code is available at https://github.com/mustansarfiaz/ScratchFormer .
We propose to improve the visual object tracking by introducing a soft mask based low-level feature fusion technique. The proposed technique is further strengthened by integrating channel and spatial ...attention mechanisms. The proposed approach is integrated within a Siamese framework to demonstrate its effectiveness for visual object tracking. The proposed soft mask is used to give more importance to the target regions as compared to the other regions to enable effective target feature representation and to increase discriminative power. The low-level feature fusion improves the tracker robustness against distractors. The channel attention is used to identify more discriminative channels for better target representation. The spatial attention complements the soft mask based approach to better localize the target objects in challenging tracking scenarios. We evaluated our proposed approach over five publicly available benchmark datasets and performed extensive comparisons with 39 state-of-the-art tracking algorithms. The proposed tracker demonstrates excellent performance compared to the existing state-of-the-art trackers.
CNN-based trackers, especially those based on Siamese networks, have recently attracted considerable attention because of their relatively good performance and low computational cost. For many ...Siamese trackers, learning a generic object model from a large-scale dataset is still a challenging task. In the current study, we introduce input noise as regularization in the training data to improve generalization of the learned model. We propose an Input-Regularized Channel Attentional Siamese (IRCA-Siam) tracker which exhibits improved generalization compared to the current state-of-the-art trackers. In particular, we exploit offline learning by introducing additive noise for input data augmentation to mitigate the overfitting problem. We propose feature fusion from noisy and clean input channels which improves the target localization. Channel attention integrated with our framework helps finding more useful target features resulting in further performance improvement. Our proposed IRCA-Siam enhances the discrimination of the tracker/background and improves fault tolerance and generalization. An extensive experimental evaluation on six benchmark datasets including OTB2013, OTB2015, TC128, UAV123, VOT2016 and VOT2017 demonstrate superior performance of the proposed IRCA-Siam tracker compared to the 30 existing state-of-the-art trackers.
Template based learning, particularly Siamese networks, has recently become popular due to balancing accuracy and speed. However, preserving tracker robustness against challenging scenarios with ...real-time speed is a primary concern for visual object tracking. Siamese trackers confront difficulties handling target appearance changes continually due to less discrimination ability learning between target and background information. This paper presents stacked channel-spatial attention within Siamese networks to improve tracker robustness without sacrificing fast-tracking speed. The proposed channel attention strengthens target-specific channels increasing their weight while reducing the importance of irrelevant channels with lower weights. Spatial attention is focusing on the most informative region of the target feature map. We integrate the proposed channel and spatial attention modules to enhance tracking performance with end-to-end learning. The proposed tracking framework learns what and where to highlight important target information for efficient tracking. Experimental results on widely used OTB100, OTB50, VOT2016, VOT2017/18, TC-128, and UAV123 benchmarks verified the proposed tracker achieved outstanding performance compared with state-of-the-art trackers.
In recent years, visual object tracking has become a very active research area. An increasing number of tracking algorithms are being proposed each year. It is because tracking has wide applications ...in various real-world problems such as human-computer interaction, autonomous vehicles, robotics, surveillance, and security just to name a few. In the current study, we review latest trends and advances in the tracking area and evaluate the robustness of different trackers based on the feature extraction methods. The first part of this work includes a comprehensive survey of the recently proposed trackers. We broadly categorize trackers into Correlation Filter based Trackers (CFTs) and Non-CFTs. Each category is further classified into various types based on the architecture and the tracking mechanism. In the second part of this work, we experimentally evaluated 24 recent trackers for robustness and compared handcrafted and deep feature based trackers. We observe that trackers using deep features performed better, though in some cases a fusion of both increased performance significantly. To overcome the drawbacks of the existing benchmarks, a new benchmark Object Tracking and Temple Color (OTTC) has also been proposed and used in the evaluation of different algorithms. We analyze the performance of trackers over 11 different challenges in OTTC and 3 other benchmarks. Our study concludes that Discriminative Correlation Filter (DCF) based trackers perform better than the others. Our study also reveals that inclusion of different types of regularizations over DCF often results in boosted tracking performance. Finally, we sum up our study by pointing out some insights and indicating future trends in the visual object tracking field.
Person search (PS) is a challenging computer vision problem where the objective is to achieve joint optimization for pedestrian detection and re-identification (ReID). Although previous advancements ...have shown promising performance in the field under fully and weakly supervised learning fashion, there exists a major gap in investigating the domain adaptation ability of PS models. In this paper, we propose a diligent domain adaptive mixer (DDAM) for person search (DDAP-PS) framework that aims to bridge a gap to improve knowledge transfer from the labeled source domain to the unlabeled target domain. Specifically, we introduce a novel DDAM module that generates moderate mixed-domain representations by combining source and target domain representations. The proposed DDAM module encourages domain mixing to minimize the distance between the two extreme domains, thereby enhancing the ReID task. To achieve this, we introduce two bridge losses and a disparity loss. The objective of the two bridge losses is to guide the moderate mixed-domain representations to maintain an appropriate distance from both the source and target domain representations. The disparity loss aims to prevent the moderate mixed-domain representations from being biased towards either the source or target domains, thereby avoiding overfitting. Furthermore, we address the conflict between the two subtasks, localization and ReID, during domain adaptation. To handle this cross-task conflict, we forcefully decouple the norm-aware embedding, which aids in better learning of the moderate mixed-domain representation. We conduct experiments to validate the effectiveness of our proposed method. Our approach demonstrates favorable performance on the challenging PRW and CUHK-SYSU datasets. Our source code is publicly available at \url{https://github.com/mustansarfiaz/DDAM-PS}.
Recently, transformers have been widely used in medical image segmentation to capture long-range and global dependencies using self-attention. However, they often struggle to learn the local details ...which limit their ability to capture irregular shapes and sizes of the tissues and indistinct boundaries between the tissues, which are critical for accurate segmentation. To alleviate this issue, we propose a network named GA2Net, which comprises an encoder, a bottleneck, and a decoder. The encoder computes multi-scale features. In the bottleneck, we propose a hierarchical-gated features aggregation (HGFA) which introduces a novel spatial gating mechanism to enrich the multi-scale features. To effectively learn the shapes and sizes of the tissues, we apply deep supervision in the bottleneck. GA2Net proposes to use adaptive aggregation (AA) within the decoder, to adjust the receptive fields for each location in the feature map, by replacing the traditional concatenation/summation operations in skip connections in U-Net like architecture. Furthermore, we propose mask-guided feature attention (MGFA) modules within the decoder which strives to learn the salient features using foreground priors to adequately grasp the intricate structural and contour information of the tissues. We also apply intermediate supervision for each stage of the decoder, which further improves the capability of the model to better locate the boundaries of the tissues. Our extensive experimental results illustrate that our GA2-Net significantly outperforms the existing state-of-the-art methods over eight medical image segmentation datasets i.e., five polyps, a skin lesion, a multiple myeloma cell segmentation, and a cardiac MRI scan datasets. We then perform an extensive ablation study to validate the capabilities of our method. Code is available at https://github.com/mustansarfiaz/ga2net.
Display omitted
•We propose GA2Net to capture complex shapes of the tissues for better segmentation.•Our HGFA enhances the most relevant feature information for pixel-precise segmentation using deep supervision.•Adaptive aggregation adjusts the receptive fields for each stage features.•Our MGFA modules in decoder are crucial to obtaining accurate boundaries of the tissue objects.•Experiments on eight medical segmentation benchmarks demonstrate merits of our contributions.
Existing deep learning approaches leave out the semantic cues that are crucial in semantic segmentation present in complex scenarios including cluttered backgrounds and translucent objects, etc. To ...handle these challenges, we propose a feature amplification network (FANet) as a backbone network that incorporates semantic information using a novel feature enhancement module at multi-stages. To achieve this, we propose an adaptive feature enhancement (AFE) block that benefits from both a spatial context module (SCM) and a feature refinement module (FRM) in a parallel fashion. SCM aims to exploit larger kernel leverages for the increased receptive field to handle scale variations in the scene. Whereas our novel FRM is responsible for generating semantic cues that can capture both low-frequency and high-frequency regions for better segmentation tasks. We perform experiments over challenging real-world ZeroWaste-f dataset which contains background-cluttered and translucent objects. Our experimental results demonstrate the state-of-the-art performance compared to existing methods.