Hyperspectral image (HSI) classification refers to identifying land-cover categories of pixels based on spectral signatures and spatial information of HSIs. In recent deep learning-based methods, to ...explore the spatial information of HSIs, the HSI patch is usually cropped from original HSI as the input. And <inline-formula> <tex-math notation="LaTeX">3 \times 3 </tex-math></inline-formula> convolution is utilized as a key component to capture spatial features for HSI classification. However, the <inline-formula> <tex-math notation="LaTeX">3 \times 3 </tex-math></inline-formula> convolution is sensitive to the spatial rotation of inputs, which results in that recent methods perform worse in rotated HSIs. To alleviate this problem, a rotation-invariant attention network (RIAN) is proposed for HSI classification. First, a center spectral attention (CSpeA) module is designed to avoid the influence of other categories of pixels to suppress redundant spectral bands. Then, a rectified spatial attention (RSpaA) module is proposed to replace <inline-formula> <tex-math notation="LaTeX">3 \times 3 </tex-math></inline-formula> convolution for extracting rotation-invariant spectral-spatial features from HSI patches. The CSpeA module, the <inline-formula> <tex-math notation="LaTeX">1 \times 1 </tex-math></inline-formula> convolution and the RSpaA module are utilized to build the proposed RIAN for HSI classification. Experimental results demonstrate that RIAN is invariant to the spatial rotation of HSIs and has superior performance, e.g., achieving an overall accuracy of 86.53% (1.04% improvement) on the Houston database. The codes of this work are available at https://github.com/spectralpublic/RIAN .
In recent years, with the rapid development of deep learning, super-resolution methods based on convolutional neural networks (CNNs) have made great progress. However, the parameters and the required ...consumption of computing resources of these methods are also increasing to the point that such methods are difficult to implement on devices with low computing power. To address this issue, we propose a lightweight single image super-resolution network with an expectation-maximization attention mechanism (EMASRN) for better balancing performance and applicability. Specifically, a progressive multi-scale feature extraction block (PMSFE) is proposed to extract feature maps of different sizes. Furthermore, we propose an HR-size expectation-maximization attention block (HREMAB) that directly captures the long-range dependencies of HR-size feature maps. We also utilize a feedback network to feed the high-level features of each generation into the next generation's shallow network. Compared with the existing lightweight single image super-resolution (SISR) methods, our EMASRN reduces the number of parameters by almost one-third. The experimental results demonstrate the superiority of our EMASRN over state-of-the-art lightweight SISR methods in terms of both quantitative metrics and visual quality. The source code can be downloaded at https://github.com/xyzhu1/EMASRN .
Referring image segmentation aims at segmenting the foreground masks of the entities that can well match the description given in the natural language expression. Previous approaches tackle this ...problem using implicit feature interaction and fusion between visual and linguistic modalities, but usually fail to explore informative words of the expression to well align features from the two modalities for accurately identifying the referred entity. In this paper, we propose a Cross-Modal Progressive Comprehension (CMPC) module and a Text-Guided Feature Exchange (TGFE) module to effectively address the challenging task. Concretely, the CMPC module first employs entity and attribute words to perceive all the related entities that might be considered by the expression. Then, the relational words are adopted to highlight the correct entity as well as suppress other irrelevant ones by multimodal graph reasoning. In addition to the CMPC module, we further leverage a simple yet effective TGFE module to integrate the reasoned multimodal features from different levels with the guidance of textual information. In this way, features from multi-levels could communicate with each other and be refined based on the textual context. We conduct extensive experiments on four popular referring segmentation benchmarks and achieve new state-of-the-art performances. Code is available at https://github.com/spyflying/CMPC-Refseg.
Scene text image contains two levels of contents: visual texture and semantic information. Although the previous scene text recognition methods have made great progress over the past few years, the ...research on mining semantic information to assist text recognition attracts less attention, only RNN-like structures are explored to implicitly model semantic information. However, we observe that RNN based methods have some obvious shortcomings, such as time-dependent decoding manner and one-way serial transmission of semantic context, which greatly limit the help of semantic information and the computation efficiency. To mitigate these limitations, we propose a novel end-to-end trainable framework named semantic reasoning network (SRN) for accurate scene text recognition, where a global semantic reasoning module (GSRM) is introduced to capture global semantic context through multi-way parallel transmission. The state-of-the-art results on 7 public benchmarks, including regular text, irregular text and non-Latin long text, verify the effectiveness and robustness of the proposed method. In addition, the speed of SRN has significant advantages over the RNN based methods, demonstrating its value in practical use.
The localized faults of rolling bearings can be diagnosed by the extraction of the impulsive feature. However, the approximately periodic impulses may be submerged in strong interferences generated ...by other components and the background noise. To address this issue, this paper explores a new impulsive feature extraction method based on the sparse representation. According to the vibration model of an impulse generated by the bearing fault, a novel impulsive wavelet is constructed, which satisfies the admissibility condition. As a result, this family of model-based impulsive wavelets can form a Parseval frame. With the model-based impulsive wavelet basis and Fourier basis, a convex optimization problem is formulated to extract the repetitive impulses. Based on the splitting idea, an iterative thresholding shrinkage algorithm is proposed to solve this problem, and it has a fast convergence rate. Via the simulated signal and real vibration signals with bearing fault information, the performance of the proposed approach for repetitive impulsive feature extraction is validated and compared with the noted spectral kurtosis method, the optimized spectral kurtosis method based on simulated annealing, and the resonance-based signal decomposition method. The results demonstrate its advantage and superiority in weak repetitive transient feature extraction.
Recent advances in adaptive object detection have achieved compelling results in virtue of adversarial feature adaptation to mitigate the distributional shifts along the detection pipeline. Whilst ...adversarial adaptation significantly enhances the transferability of feature representations, the feature discriminability of object detectors remains less investigated. Moreover, transferability and discriminability may come at a contradiction in adversarial adaptation given the complex combinations of objects and the differentiated scene layouts between domains. In this paper, we propose a Hierarchical Transferability Calibration Network (HTCN) that hierarchically (local-region/image/instance) calibrates the transferability of feature representations for harmonizing transferability and discriminability. The proposed model consists of three components: (1) Importance Weighted Adversarial Training with input Interpolation (IWAT-I), which strengthens the global discriminability by re-weighting the interpolated image-level features; (2) Context-aware Instance-Level Alignment (CILA) module, which enhances the local discriminability by capturing the underlying complementary effect between the instance-level feature and the global context information for the instance-level feature alignment; (3) local feature masks that calibrate the local transferability to provide semantic guidance for the following discriminative pattern alignment. Experimental results show that HTCN significantly outperforms the state-of-the-art methods on benchmark datasets.
This paper looks at this intriguing question: are single images with their details lost during deraining, reversible to their artifact-free status? We propose an end-to-end detail-recovery image ...deraining network (termed a DRDNet) to solve the problem. Unlike existing image deraining approaches that attempt to meet the conflicting goal of simultaneously deraining and preserving details in a unified framework, we propose to view rain removal and detail recovery as two seperate tasks, so that each part could specialize rather than trade-off between two conflicting goals. Specifically, we introduce two parallel sub-networks with a comprehensive loss function which synergize to derain and recover the lost details caused by deraining. For complete rain removal, we present a rain residual network with the squeeze-and-excitation (SE) operation to remove rain streaks from the rainy images. For detail recovery, we construct a specialized detail repair network consisting of welldesigned blocks, named structure detail context aggregation block (SDCAB), to encourage the lost details to return for eliminating image degradations. Moreover, the detail recovery branch of our proposed detail repair framework is detachable and can be incorporated into existing deraining methods to boost their performances. DRD-Net has been validated on several well-known benchmark datasets in terms of deraining robustness and detail accuracy. Comparisons show clear visual and numerical improvements of our method over the state-of-the-arts.
Abstract
The main challenge found in building an automatic transliteration system for Balinese palm leaf manuscript (Lontar) collections is that the recognition error in a small portion of glyphs of ...Balinese script can affect the results of transliteration widely. This is due to the fundamental nature of Balinese script which is a complex alphasyllabic script. This paper presents an initial proposition for a general scheme and model for suggesting several possible transliteration with text generation and LSTM for Lontar collection. The
Edit-Insert-Replace
model was proposed to be applied on the existing word collection dataset and a Bidirectional LSTM model with a specific feature extraction method was built for the training process of post transliteration suggestion module. This module will help in suggesting several possible transliterations based on the initial transliteration from the previous system.
The Visual Question Answering (VQA) system is the process of finding useful information from images related to the question to answer the question correctly. It can be widely used in the fields of ...visual assistance, automated security surveillance, and intelligent interaction between robots and humans. However, the accuracy of VQA has not been ideal, and the main difficulty in its research is that the image features cannot well represent the scene and object information, and the text information cannot be fully represented. This paper used multi-scale feature extraction and fusion methods in the image feature characterization and text information representation sections of the VQA system, respectively to improve its accuracy. Firstly, aiming at the image feature representation problem, multi-scale feature extraction and fusion method were adopted, and the image features output of different network layers were extracted by a pre-trained deep neural network, and the optimal scheme of feature fusion method was found through experiments. Secondly, for the representation of sentences, a multi-scale feature method was introduced to characterize and fuse the word-level, phrase-level, and sentence-level features of sentences. Finally, the VQA model was improved using the multi-scale feature extraction and fusion method. The results show that the addition of multi-scale feature extraction and fusion improves the accuracy of the VQA model.
Objective: This paper investigates the hypothesis that focal seizures can be predicted using scalp electroencephalogram (EEG) data. Our first aim is to learn features that distinguish between the ...interictal and preictal regions. The second aim is to define a prediction horizon in which the prediction is as accurate and as early as possible, clearly two competing objectives. Methods: Convolutional filters on the wavelet transformation of the EEG signal are used to define and learn quantitative signatures for each period: interictal, preictal, and ictal. The optimal seizure prediction horizon is also learned from the data as opposed to making an a priori assumption. Results: Computational solutions to the optimization problem indicate a 10-min seizure prediction horizon. This result is verified by measuring Kullback-Leibler divergence on the distributions of the automatically extracted features. Conclusion: The results on the EEG database of 204 recordings demonstrate that (i) the preictal phase transition occurs approximately ten minutes before seizure onset, and (ii) the prediction results on the test set are promising, with a sensitivity of 87.8% and a low false prediction rate of 0.142 FP/h. Our results significantly outperform a random predictor and other seizure prediction algorithms. Significance: We demonstrate that a robust set of features can be learned from scalp EEG that characterize the preictal state of focal seizures.