The past decade has witnessed significant progress on detecting objects in aerial images that are often distributed with large-scale variations and arbitrary orientations. However, most of existing ...methods rely on heuristically defined anchors with different scales, angles, and aspect ratios, and usually suffer from severe misalignment between anchor boxes (ABs) and axis-aligned convolutional features, which lead to the common inconsistency between the classification score and localization accuracy. To address this issue, we propose a single-shot alignment network (S 2 A-Net) consisting of two modules: a feature alignment module (FAM) and an oriented detection module (ODM). The FAM can generate high-quality anchors with an anchor refinement network and adaptively align the convolutional features according to the ABs with a novel alignment convolution. The ODM first adopts active rotating filters to encode the orientation information and then produces orientation-sensitive and orientation-invariant features to alleviate the inconsistency between classification score and localization accuracy. Besides, we further explore the approach to detect objects in large-size images, which leads to a better trade-off between speed and accuracy. Extensive experiments demonstrate that our method can achieve the state-of-the-art performance on two commonly used aerial objects' data sets (i.e., DOTA and HRSC2016) while keeping high efficiency.
One fundamental problem in Earth Vision is to accurately find the locations and identify the categories of the interesting objects in the aerial images, for which oriented bounding boxes (OBBs) are ...usually employed to depict better the objects emerging with arbitrary orientations. However, the regression of the OBBs always suffers from the ambiguous problem in the definition of the regression targets, which often reduces the convergency efficiency and decreases the detection accuracy. Although there are some methods like the binary segmentation map that can handle this problem, it brings a new problem of ambiguous background pixels in the OBBs. In this article, we propose to cast the OBB regression as a center-probability-map (CenterMap)-prediction problem, thus largely eliminating the ambiguities on the target definitions and the background pixels. The predicted CenterMaps are then used to generate the OBBs. The CenterMap OBB representation is simple, yet effective. Furthermore, to distinguish better the interesting objects from the cluttered background, a weighted pseudosegmentation-guided attention network is adopted to provide the object-level features for predicting the horizontal bounding boxes and the OBBs. The experimental results on three widely used data sets, i.e., DOTA, HRSC2016, and UCAS-AOD, demonstrate the effectiveness of our proposed method.
In contrast with nature scenes, aerial scenes are often composed of many objects crowdedly distributed on the surface in bird's view, the description of which usually demands more discriminative ...features as well as local semantics. However, when applied to scene classification, most of the existing convolution neural networks (ConvNets) tend to depict global semantics of images, and the loss of low- and mid-level features can hardly be avoided, especially when the model goes deeper. To tackle these challenges, in this paper, we propose a multiple-instance densely-connected ConvNet (MIDC-Net) for aerial scene classification. It regards aerial scene classification as a multiple-instance learning problem so that local semantics can be further investigated. Our classification model consists of an instance-level classifier, a multiple instance pooling and followed by a bag-level classification layer. In the instance-level classifier, we propose a simplified dense connection structure to effectively preserve features from different levels. The extracted convolution features are further converted into instance feature vectors. Then, we propose a trainable attention-based multiple instance pooling. It highlights the local semantics relevant to the scene label and outputs the bag-level probability directly. Finally, with our bag-level classification layer, this multiple instance learning framework is under the direct supervision of bag labels. Experiments on three widely-utilized aerial scene benchmarks demonstrate that our proposed method outperforms many state-of-the-art methods by a large margin with much fewer parameters.
As a new method of Earth observation, video satellite is capable of monitoring specific events on the Earth's surface continuously by providing high-temporal resolution remote sensing images. The ...video observations enable a variety of new satellite applications such as object tracking and road traffic monitoring. In this article, we address the problem of fast object tracking in satellite videos, by developing a novel tracking algorithm based on correlation filters embedded with motion estimations. Based on the kernelized correlation filter (KCF), the proposed algorithm provides the following improvements: 1) proposing a novel motion estimation (ME) algorithm by combining the Kalman filter and motion trajectory averaging and mitigating the boundary effects of KCF by using this ME algorithm and 2) solving the problem of tracking failure when a moving object is partially or completely occluded. The experimental results demonstrate that our algorithm can track the moving object in satellite videos with 95% accuracy.
Urban water is important for the urban ecosystem. Accurate and efficient detection of urban water with remote sensing data is of great significance for urban management and planning. In this article, ...we proposed a new method by combining Google Earth Engine (GEE) with a multiscale convolutional neural network (MSCNN) to extract urban water from Landsat images, which can be summarized as "offline training and online prediction" (OTOP). That is, the training of MSCNN is completed offline, and the process of urban water extraction is implemented on GEE with the trained parameters of MSCNN. The OTOP can give full play to the respective advantages of GEE and the convolutional neural network (CNN), and can make the use of deep learning method in GEE more flexible. The proposed method can process the available satellite images with high performance, without data download and storage, and the overall performance of urban water extraction in the test areas is also higher than that of the modified normalized difference water index (MNDWI) and random forest classifier. The results of the extended validation in the other major cities of China also showed that OTOP is robust and can be used to extract different types of urban water, which benefits from the structural design and training of MSCNN. Therefore, OTOP is especially suitable for the study of large-scale and long-term urban water change detection in the background of urbanization.
In recent years, large amount of high spatial-resolution remote sensing (HRRS) images are available for land-cover mapping. However, due to the complex information brought by the increased spatial ...resolution and the data disturbances caused by different conditions of image acquisition, it is often difficult to find an efficient method for achieving accurate land-cover classification with high-resolution and heterogeneous remote sensing images. In this paper, we propose a scheme to apply deep model obtained from labeled land-cover dataset to classify unlabeled HRRS images. The main idea is to rely on deep neural networks for presenting the contextual information contained in different types of land-covers and propose a pseudo-labeling and sample selection scheme for improving the transferability of deep models. More precisely, a deep Convolutional Neural Networks (CNNs) is first pre-trained with a well-annotated land-cover dataset, referred to as the source data. Then, given a target image with no labels, the pre-trained CNN model is utilized to classify the image in a patch-wise manner. The patches with high confidence are assigned with pseudo-labels and employed as the queries to retrieve related samples from the source data. The pseudo-labels confirmed with the retrieved results are regarded as supervised information for fine-tuning the pre-trained deep model. To obtain a pixel-wise land-cover classification with the target image, we rely on the fine-tuned CNN and develop a hybrid classification by combining patch-wise classification and hierarchical segmentation. In addition, we create a large-scale land-cover dataset containing 150 Gaofen-2 satellite images for CNN pre-training. Experiments on multi-source HRRS images, including Gaofen-2, Gaofen-1, Jilin-1, Ziyuan-3, Sentinel-2A, and Google Earth platform data, show encouraging results and demonstrate the applicability of the proposed scheme to land-cover classification with multi-source HRRS images.
•A method to learn transferable deep model for 5-class land-cover (LC) classification.•A labeled dataset consisting of 150 Gaofen-2 images for LC classification.•It improves LC classification performance about 20% using multi-source RS images.•The method shows good transferability on different sensors and geolocations.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP
The past years have witnessed great progress on remote sensing (RS) image interpretation and its wide applications. With RS images becoming more accessible than ever before, there is an increasing ...demand for the automatic interpretation of these images. In this context, the benchmark datasets serve as an essential prerequisites for developing and testing intelligent interpretation algorithms. After reviewing existing benchmark datasets in the research community of RS image interpretation, this article discusses the problem of how to efficiently prepare a suitable benchmark dataset for RS image interpretation. Specifically, we first analyze the current challenges of developing intelligent algorithms for RS image interpretation with bibliometric investigations. We then present the general guidances on creating benchmark datasets in efficient manners. Following the presented guidances, we also provide an example on building RS image dataset, i.e., Million Aerial Image Dataset (Online. Available: https://captain-whu.github.io/DiRS/ ), a new large-scale benchmark dataset containing a million instances for RS image scene classification. Several challenges and perspectives in RS image annotation are finally discussed to facilitate the research in benchmark dataset construction. We do hope this article will provide the RS community an overall perspective on constructing large-scale and practical image datasets for further research, especially data-driven ones.
Unsupervised pre-training aims at learning transferable features that are beneficial for downstream tasks. However, most state-of-the-art unsupervised methods concentrate on learning global ...representations for image-level classification tasks instead of discriminative local region representations, which limits their transferability to region-level downstream tasks, such as object detection. To improve the transferability of pre-trained features to object detection, we present Deeply Unsupervised Patch Re-ID (DUPR), a simple yet effective method for unsupervised visual representation learning. The patch Re-ID task treats individual patch as a pseudo-identity and contrastively learns its correspondence in two views, enabling us to obtain discriminative local features for object detection. Then the proposed patch Re-ID is performed in a deeply unsupervised manner, appealing to object detection, which usually requires multi-level feature maps. Extensive experiments demonstrate that DUPR outperforms state-of-the-art unsupervised pre-trainings and even the ImageNet supervised pre-training on various downstream tasks related to object detection.
Extracting building footprints from aerial images is essential for precise urban mapping with photogrammetric computer vision technologies. Existing approaches mainly assume that the roof and ...footprint of a building are well overlapped, which may not hold in off-nadir aerial images as there is often a big offset between them. In this paper, we propose an offset vector learning scheme, which turns the building footprint extraction problem in off-nadir images into an instance-level joint prediction problem of the building roof and its corresponding " roof to footprint " offset vector. Thus the footprint can be estimated by translating the predicted roof mask according to the predicted offset vector. We further propose a simple but effective feature-level offset augmentation module, which can significantly refine the offset vector prediction by introducing little extra cost. Moreover, a new dataset, Buildings in Off-Nadir Aerial Images (BONAI), is created and released in this paper. It contains 268,958 building instances across 3,300 aerial images with fully annotated instance-level roof, footprint, and corresponding offset vector for each building. Experiments on the BONAI dataset demonstrate that our method achieves the state-of-the-art, outperforming other competitors by 3.37 to 7.39 points in F1-score. The codes, datasets, and trained models are available at https://github.com/jwwangchn/BONAI.git .
Matching local features between two overlapped images is a fundamental task in photogrammetry and remote sensing. However, images acquired by multiple sensors often differ substantially in ...properties, thus posing a great challenge to the robustness and flexibility of feature matching methods. In this article, we propose a locally non-linear affine verification (LAV) method for robust multisensor image matching. The main idea of the LAV is the development of a nonlinear regression formulation that practically models the nonlinear deviation of a real surface around a point from its tangent plane during affine verification. Specifically, we start by selecting a restricted set of reliable and well-distributed putative matches as the matching seeds and assign them with neighbors to construct search spaces. In each search space, the regression seeks the smoothest affine model consistent with the latent correct matches, thereby deriving a set of affine parameters to verify correspondence hypotheses for true matches. The verification can be extended to all nearest neighbor matches to discover additional inlier matches. Evaluation on multisensor image datasets with different extents of variations in viewpoint, scale, illumination, and appearance shows that the proposed LAV consistently outperforms existing methods. LAV can achieve a considerable number of high-quality matches, in cases where existing methods provide few or no correct matches.