Traditional feature matching methods, such as scale-invariant feature transform (SIFT), usually use image intensity or gradient information to detect and describe feature points; however, both ...intensity and gradient are sensitive to nonlinear radiation distortions (NRD). To solve this problem, this paper proposes a novel feature matching algorithm that is robust to large NRD. The proposed method is called radiation-variation insensitive feature transform (RIFT). There are three main contributions in RIFT. First, RIFT uses phase congruency (PC) instead of image intensity for feature point detection. RIFT considers both the number and repeatability of feature points and detects both corner points and edge points on the PC map. Second, RIFT originally proposes a maximum index map (MIM) for feature description. The MIM is constructed from the log-Gabor convolution sequence and is much more robust to NRD than traditional gradient map. Thus, RIFT not only largely improves the stability of feature detection but also overcomes the limitation of gradient information for feature description. Third, RIFT analyses the inherent influence of rotations on the values of the MIM and realises rotation invariance. We use six different types of multi-modal image datasets to evaluate RIFT, including optical-optical, infrared-optical, synthetic aperture radar (SAR)-optical, depth-optical, map-optical, and day-night datasets. Experimental results show that RIFT is superior to SIFT and SAR-SIFT on multi-modal images. To the best of our knowledge, RIFT is the first feature matching algorithm that can achieve good performance on all the abovementioned types of multi-modal images. The source code of RIFT and the multi-modal image datasets are publicly available.
The goal of cross-view image matching based on geo-localization is to determine the location of a given ground-view image (front view) by matching it with a group of satellite-view images (vertical ...view) with geographic tags. Due to the rapid development of unmanned aerial vehicle (UAV) technology in recent years, it has provided a real viewpoint close to 45 degrees (oblique view) to bridge the visual gap between views. However, existing methods ignore the direct geometric space correspondence of UAV-satellite views, and only use brute force for feature matching, leading to inferior performance. In this context, we propose an end-to-end cross-view matching method that integrates cross-view synthesis module and geo-localization module, which fully considers the spatial correspondence of UAV-satellite views and the surrounding area information. To be specific, the cross-view synthesis module includes two parts: the oblique view of UAV is first converted to the vertical view by perspective projection transformation (PPT), which makes the UAV image closer to the satellite image; then we use conditional generative adversarial nets (CGAN) to synthesize the UAV image with vertical view style, which is close to the real satellite image by learning the converted UAV as the input image and the real satellite image as the label. Geo-localization module refers to existing local pattern network (LPN), which explicitly considers the surrounding environment of the target building. These modules are integrated in a single architecture called PCL, which mutually reinforce each other. Our method is superior to the existing UAV-satellite cross-view methods, which improves by about 5%.
Passive millimeter-wave (PMMW) imager has become a common device for detecting concealed objects in security screening. However, the current detection methods rely primarily on two-dimensional (2D) ...information, which is inefficient for objects with similar grayscale values to the human body. Exploiting the 3D features of the objects will be conducive to tackling this challenge, where image matching is generally the first step before performing the 3D reconstruction. However, the existing matching methods struggle to produce reliable keypoints in textureless and noisy PMMW images, since they mainly focus on texture regions with distinct features, such as corners and lines. To tackle this challenge, we present a weakly supervised framework to learn keypoints in PMMW images, named as PMMWPoint. Specifically, we first pre-train a diffusion model on unlabeled PMMW images in an unsupervised manner for better modeling the intricate characteristics of PMMW images. Then we employ a few labeled keypoints as guidance for network to detect keypoints from textureless regions. To compensate for the sparsity of manual labeling, a self-paced keypoint augmentation strategy is introduced to progressively increase the quantity and quality of ground-truth labels during training. In addition, we propose an improved contrastive loss for better descriptor learning by integrating the information from detection and description branches. Extensive experiments prove the superiority of PMMWPoint in producing dense and accurate keypoints, which deliver a notable improvement of +4.8% and +1.2% in homography accuracy over unsupervised baselines for PMMW and Hpatches datasets, respectively. Through our investigation, our work represents the first effort to address the problem of PMMW image matching.
Severe nonlinear radiation distortion (NRD) is the bottleneck problem of multimodal image matching. Although many efforts have been made in the past few years, such as the radiation-variation ...insensitive feature transform (RIFT) and the histogram of orientated phase congruency (HOPC), almost all these methods are based on frequency-domain information that suffers from high computational overhead and memory footprint. In this article, we propose a simple but very effective multimodal feature matching algorithm in the spatial domain, called locally normalized image feature transform (LNIFT). We first propose a local normalization filter to convert original images into normalized images for feature detection and description, which largely reduces the NRD between multimodal images. We demonstrate that normalized matching pairs have a much larger correlation coefficient than the original ones. We then detect oriented FAST and rotated brief (ORB) keypoints on the normalized images and use an adaptive nonmaximal suppression (ANMS) strategy to improve the distribution of keypoints. We also describe keypoints on the normalized images based on a histogram of oriented gradient (HOG), such as a descriptor. Our LNIFT achieves rotation invariance the same as ORB without any additional computational overhead. Thus, LNIFT can be performed in near real-time on images with 1024 <inline-formula> <tex-math notation="LaTeX">\times 1024 </tex-math></inline-formula> pixels (only costs 0.32 s with 2500 keypoints). Four multimodal image datasets with a total of 4000 matching pairs are used for comprehensive evaluations, including synthetic aperture radar (SAR)-optical, infrared-optical, and depth-optical datasets. Experimental results show that LNIFT is far superior to RIFT in terms of efficiency (0.49 s versus 47.8 s on a <inline-formula> <tex-math notation="LaTeX">1024 \times 1024 </tex-math></inline-formula> image), success rate (99.9% versus 79.85%), and number of correct matches (309 versus 119). The source code and datasets will be publicly available at https://ljy-rs.github.io/web .
Feature matching, which finds the reliable corresponding relationship between two or more feature sets, is a fundamental technique in many consumer device applications. In this paper, we focus on the ...mismatch removal from putative matches considering the consumer devices with wide range of usage scenarios. To achieve this goal, existing methods are limited by the complex transformation between images and hard to model in many real-world tasks. We introduce a novel mismatch removal algorithm, namely Local Similarity Measurement and Multi-granularity Matching (LSM-MGM) for feature matching under different application scenarios. An effective measurement for similarity between match vector in small neighborhood is designed and the local similarity representation is construct to accumulate the measurement information of neighborhood. And finally, a matching strategy is applied under multiple scale and angle matching neighborhood structure-preserving. The experimental results demonstrate that the introduced algorithm exhibits superior stability and performance compared to several state-of-the-art methods in terms of mismatch removal on multiple scenarios image datasets.
As a fundamental and critical task in various visual applications, image matching can identify then correspond the same or similar structure/content from two or more images. Over the past decades, ...growing amount and diversity of methods have been proposed for image matching, particularly with the development of deep learning techniques over the recent years. However, it may leave several open questions about which method would be a suitable choice for specific applications with respect to different scenarios and task requirements and how to design better image matching methods with superior performance in accuracy, robustness and efficiency. This encourages us to conduct a comprehensive and systematic review and analysis for those classical and latest techniques. Following the feature-based image matching pipeline, we first introduce feature detection, description, and matching techniques from handcrafted methods to trainable ones and provide an analysis of the development of these methods in theory and practice. Secondly, we briefly introduce several typical image matching-based applications for a comprehensive understanding of the significance of image matching. In addition, we also provide a comprehensive and objective comparison of these classical and latest techniques through extensive experiments on representative datasets. Finally, we conclude with the current status of image matching technologies and deliver insightful discussions and prospects for future works. This survey can serve as a reference for (but not limited to) researchers and engineers in image matching and related fields.
Matching multi-modal remote sensing images (MRSI) is a challenging task. Due to significant nonlinear radiation differences (NRD), traditional image matching methods cannot achieve satisfactory ...results. Current research shows that structural information can get more robust matching results compared with texture information (i.e., gradient features) from images. In order to better explore the structural information of images, this paper proposes a MRSIs matching method using structure saliency features, called weighted structure saliency feature (WSSF). Two strategies are investigated and integrated in WSSF to improve the matching performance. The scale space is constructed based on the pointwise shape-adaptive texture scale filtering, which can better retain the structure features, and the second-order Gaussian steerable filtering, edge confidence map and phase features are combined to establish the structural saliency map combined with second-order Gaussian steerable filtering, which is much more robust to NRD than traditional gradient map. The performance of the proposed method was evaluated on a total of 120 image pairs from two MRSI datasets, and compared with state-of-the-art matching methods including the histogram of the orientation of weighted phase (HOWP), locally normalized image feature transform (LNIFT), co-occurrence filter space matching (CoFSM), radiation-variation insensitive feature transform (RIFT), local phase sharpness orientation (LPSO), and position-scale-orientation SIFT (PSO-SIFT). Experimental results indicate that WSSF obtains satisfactory and reliable results in terms of success rate and matching accuracy. Compared with the above six methods, the matching accuracy of WSSF is improved by more than 20.275%, and the success rate is improved by over 5.833%. The source code will be publicly available in https://github.com/WGY-RS/WSSF.
Per-pixel ground-truth depth data is challenging to acquire at scale. To overcome this limitation, self-supervised learning has emerged as a promising alternative for training models to perform ...monocular depth estimation. In this paper, we propose a set of improvements, which together result in both quantitatively and qualitatively improved depth maps compared to competing self-supervised methods. Research on self-supervised monocular training usually explores increasingly complex architectures, loss functions, and image formation models, all of which have recently helped to close the gap with fully-supervised methods. We show that a surprisingly simple model, and associated design choices, lead to superior predictions. In particular, we propose (i) a minimum reprojection loss, designed to robustly handle occlusions, (ii) a full-resolution multi-scale sampling method that reduces visual artifacts, and (iii) an auto-masking loss to ignore training pixels that violate camera motion assumptions. We demonstrate the effectiveness of each component in isolation, and show high quality, state-of-the-art results on the KITTI benchmark.
For the speed of traditional SIFT algorithm in the feature extraction and matching is slow, the article proposes an improved RANSAC features image matching method based on speeded up robust features ...(SURF). First of all, detect images features and extract with SURF method, use the fast library for approximate nearest neighbours-based matcher method to perform initial matching on image feature points. Improve the RANSAC algorithm to increase the probability of correct matching points being sampled. Experimental results show that the improved RANSAC algorithm has high matching accuracy, good robustness, and short running time. It lays the foundation for the subsequent fast image stitching.