Spherical Zero-Shot Learning Shen, Jiayi; Xiao, Zehao; Zhen, Xiantong ...
IEEE transactions on circuits and systems for video technology,
02/2022, Volume:
32, Issue:
2
Journal Article
Peer reviewed
Zero-shot Learning (ZSL) is a highly non-trivial task to generalize from seen to unseen classes. In this paper, we propose spherical zero-shot learning (SZSL) to address the major challenges in ZSL. ...By decoupling the similarity metric in the spherical embedding space into radius and angle, our SZSL can map classes to hyperspherical surfaces of different radiuses, which greatly increases its flexibility. Specifically, we introduce the spherical alignment on angles to spread classes as uniformly as possible to alleviate the hubness problem and simultaneously preserve the inter-class semantic structure to make the alignment more reasonable. We also introduce the spherical calibration with a minimum entropy based regularizer by adopting a larger radius for unseen classes than seen classes to reduce the prediction bias. Extensive experiments on five middle-scale benchmarks and large-scale ImageNet dataset demonstrate that the proposed approach consistently achieves superior performance for the traditional and generalized settings of ZSL.
Few-shot learning has recently generated increasing popularity in machine learning, which addresses the fundamental yet challenging problem of learning to adapt to new tasks with the limited data. In ...this paper, we propose a new probabilistic framework that learns to fast adapt with external memory. We model the classifier parameters as distributions that are inferred from the support set and directly applied to the query set for prediction. The model is optimized by formulating as a variational inference problem. The probabilistic modeling enables better handling prediction uncertainty due to the limited data. We impose a discriminative constraint on the feature representations by exploring the class structure, which can improve the classification performance. We further introduce a memory unit to store task-specific information extracted from the support set and used for the query set to achieve explicit adaption to individual tasks. By episodic training, the model learns to acquire the capability of adapting to specific tasks, which guarantees its performance on new related tasks. We conduct extensive experiments on widely-used benchmarks for few-shot recognition. Our method achieves new state-of-the-art performance and largely surpassing previous methods by large margins. The ablation study further demonstrates the effectiveness of the proposed discriminative learning and memory unit.
Multi-Target Regression via Robust Low-Rank Learning Zhen, Xiantong; Yu, Mengyang; He, Xiaofei ...
IEEE transactions on pattern analysis and machine intelligence,
02/2018, Volume:
40, Issue:
2
Journal Article
Peer reviewed
Open access
Multi-target regression has recently regained great popularity due to its capability of simultaneously learning multiple relevant regression tasks and its wide applications in data mining, computer ...vision and medical image analysis, while great challenges arise from jointly handling inter-target correlations and input-output relationships. In this paper, we propose Multi-layer Multi-target Regression (MMR) which enables simultaneously modeling intrinsic inter-target correlations and nonlinear input-output relationships in a general framework via robust low-rank learning. Specifically, the MMR can explicitly encode inter-target correlations in a structure matrix by matrix elastic nets (MEN); the MMR can work in conjunction with the kernel trick to effectively disentangle highly complex nonlinear input-output relationships; the MMR can be efficiently solved by a new alternating optimization algorithm with guaranteed convergence. The MMR leverages the strength of kernel methods for nonlinear feature learning and the structural advantage of multi-layer learning architectures for inter-target correlation modeling. More importantly, it offers a new multi-layer learning paradigm for multi-target regression which is endowed with high generality, flexibility and expressive ability. Extensive experimental evaluation on 18 diverse real-world datasets demonstrates that our MMR can achieve consistently high performance and outperforms representative state-of-the-art algorithms, which shows its great effectiveness and generality for multivariate prediction.
Image deraining is an important yet challenging image processing task. Though deterministic image deraining methods are developed with encouraging performance, they are infeasible to learn flexible ...representations for probabilistic inference and diverse predictions. Besides, rain intensity varies both in spatial locations and across color channels, making this task more difficult. In this paper, we propose a Conditional Variational Image Deraining (CVID) network for better deraining performance, leveraging the exclusive generative ability of Conditional Variational Auto-Encoder (CVAE) on providing diverse predictions for the rainy image. To perform spatially adaptive deraining, we propose a spatial density estimation (SDE) module to estimate a rain density map for each image. Since rain density varies across different color channels, we also propose a channel-wise (CW) deraining scheme. Experiments on synthesized and real-world datasets show that the proposed CVID network achieves much better performance than previous deterministic methods on image deraining. Extensive ablation studies validate the effectiveness of the proposed SDE module and CW scheme in our CVID network. The code is available at https://github.com/Yingjun-Du/VID .
Local Feature Discriminant Projection Yu, Mengyang; Shao, Ling; Zhen, Xiantong ...
IEEE transactions on pattern analysis and machine intelligence,
2016-Sept.-1, 2016-09-00, 2016-9-1, 20160901, Volume:
38, Issue:
9
Journal Article
Peer reviewed
Open access
In this paper, we propose a novel subspace learning algorithm called Local Feature Discriminant Projection (LFDP) for supervised dimensionality reduction of local features. LFDP is able to ...efficiently seek a subspace to improve the discriminability of local features for classification. We make three novel contributions. First, the proposed LFDP is a general supervised subspace learning algorithm which provides an efficient way for dimensionality reduction of large-scale local feature descriptors. Second, we introduce the Differential Scatter Discriminant Criterion (DSDC) to the subspace learning of local feature descriptors which avoids the matrix singularity problem. Third, we propose a generalized orthogonalization method to impose on projections, leading to a more compact and less redundant subspace. Extensive experimental validation on three benchmark datasets including UIUC-Sports, Scene-15 and MIT Indoor demonstrates that the proposed LFDP outperforms other dimensionality reduction methods and achieves state-of-the-art performance for image classification.
High-level image representations have drawn increasing attention in visual recognition, e.g., scene classification, since the invention of the object bank. The object bank represents an image as a ...response map of a large number of pretrained object detectors and has achieved superior performance for visual recognition. In this paper, based on the object bank representation, we propose the object-to-class (O2C) distances to model scene images. In particular, four variants of O2C distances are presented, and with the O2C distances, we can represent the images using the object bank by lower-dimensional but more discriminative spaces, called distance spaces, which are spanned by the O2C distances. Due to the explicit computation of O2C distances based on the object bank, the obtained representations can possess more semantic meanings. To combine the discriminant ability of the O2C distances to all scene classes, we further propose to kernalize the distance representation for the final classification. We have conducted extensive experiments on four benchmark data sets, UIUC-Sports, Scene-15, MIT Indoor, and Caltech-101, which demonstrate that the proposed approaches can significantly improve the original object bank approach and achieve the state-of-the-art performance.
Deep Ensemble Machine for Video Classification Zheng, Jiewan; Cao, Xianbin; Zhang, Baochang ...
IEEE transaction on neural networks and learning systems,
02/2019, Volume:
30, Issue:
2
Journal Article
Video classification has been extensively researched in computer vision due to its wide spread applications. However, it remains an outstanding task because of the great challenges in effective ...spatial-temporal feature extraction and efficient classification with high-dimensional video representations. To address these challenges, in this paper, we propose an end-to-end learning framework called deep ensemble machine (DEM) for video classification. Specifically, to establish effective spatio-temporal features, we propose using two deep convolutional neural networks (CNNs), i.e., vision and graphics group and C3-D to extract heterogeneous spatial and temporal features for complementary representations. To achieve efficient classification, we propose ensemble learning based on random projections aiming to transform high-dimensional features into a set of lower dimensional compact features in subspaces; an ensemble of classifiers is trained on the subspaces and combined with a weighting layer during the backpropagation. To further enhance the performance, we introduce rectified linear encoding (RLE) inspired from error-correcting output coding to encode the initial outputs of classifiers, followed by a softmax layer to produce the final classification results. DEM combines the strengths of deep CNNs and ensemble learning, which establishes a new end-to-end learning architecture for more accurate and efficient video classification. We show the great effectiveness of DEM by extensive experiments on four data sets for diverse video classification tasks including action recognition and dynamic scene classification. Results have shown that DEM achieves high performance on all tasks with an improvement of up to 13% on CIFAR10 data set over the baseline model.
The power line is one of the most hazardous obstacles for low-altitude aircrafts. As aircrafts usually encounter scenes like never before during the flight, cross-scene power line detection is the ...key for their flight safety. However, compared to regular object detection tasks, cross-scene power line detection is extremely challenging due to its weak visual appearance and widespread existence. In this letter, we propose a cross-scene power line detection method based on attentional information fusion networks. Specifically, we construct a fully convolutional network with attention and information fusion mechanism for cross-scene detection. The two main modules make full use of the semantic and location information, which enables the model to focus more on power lines rather than the unexpected scenes. To the best of author knowledge, our method establishes the first end-to-end convolutional architecture for pixelwise power line detection. Experimental results have shown that our method outperforms previous methods by large margins for cross-scene power line detection.
Abnormal crowd behavior detection has recently attracted increasing attention due to its wide applications in computer vision research areas. However, it is still an extremely challenging task due to ...the great variability of abnormal behavior coupled with huge ambiguity and uncertainty of video contents. To tackle these challenges, we propose a new probabilistic framework named variational abnormal behavior detection (VABD), which can detect abnormal crowd behavior in video sequences. We make three major contributions: (1) We develop a new probabilistic latent variable model that combines the strengths of the U-Net and conditional variational auto-encoder, which also are the backbone of our model; (2) We propose a motion loss based on an optical flow network to impose the motion consistency of generated video frames and input video frames; (3) We embed a Wasserstein generative adversarial network at the end of the backbone network to enhance the framework performance. VABD can accurately discriminate abnormal video frames from video sequences. Experimental results on UCSD, CUHK Avenue, IITB-Corridor, and ShanghaiTech datasets show that VABD outperforms the state-of-the-art algorithms on abnormal crowd behavior detection. Without data augmentation, our VABD achieves 72.24% in terms of AUC on IITB-Corridor, which surpasses the state-of-the-art methods by nearly 5%.
We present a novel descriptor, called spatio-temporal Laplacian pyramid coding (STLPC), for holistic representation of human actions. In contrast to sparse representations based on detected local ...interest points, STLPC regards a video sequence as a whole with spatio-temporal features directly extracted from it, which prevents the loss of information in sparse representations. Through decomposing each sequence into a set of band-pass-filtered components, the proposed pyramid model localizes features residing at different scales, and therefore is able to effectively encode the motion information of actions. To make features further invariant and resistant to distortions as well as noise, a bank of 3-D Gabor filters is applied to each level of the Laplacian pyramid, followed by max pooling within filter bands and over spatio-temporal neighborhoods. Since the convolving and pooling are performed spatio-temporally, the coding model can capture structural and motion information simultaneously and provide an informative representation of actions. The proposed method achieves superb recognition rates on the KTH, the multiview IXMAS, the challenging UCF Sports, and the newly released HMDB51 datasets. It outperforms state of the art methods showing its great potential on action recognition.