Human faces in surveillance videos often suffer from severe image blur, dramatic pose variations, and occlusion. In this paper, we propose a comprehensive framework based on Convolutional Neural ...Networks (CNN) to overcome challenges in video-based face recognition (VFR). First, to learn blur-robust face representations, we artificially blur training data composed of clear still images to account for a shortfall in real-world video training data. Using training data composed of both still images and artificially blurred data, CNN is encouraged to learn blur-insensitive features automatically. Second, to enhance robustness of CNN features to pose variations and occlusion, we propose a Trunk-Branch Ensemble CNN model (TBE-CNN), which extracts complementary information from holistic face images and patches cropped around facial components. TBE-CNN is an end-to-end model that extracts features efficiently by sharing the low- and middle-level convolutional layers between the trunk and branch networks. Third, to further promote the discriminative power of the representations learnt by TBE-CNN, we propose an improved triplet loss function. Systematic experiments justify the effectiveness of the proposed techniques. Most impressively, TBE-CNN achieves state-of-the-art performance on three popular video face databases: PaSC, COX Face, and YouTube Faces. With the proposed techniques, we also obtain the first place in the BTAS 2016 Video Person Recognition Evaluation.
Face images appearing in multimedia applications, e.g., social networks and digital entertainment, usually exhibit dramatic pose, illumination, and expression variations, resulting in considerable ...performance degradation for traditional face recognition algorithms. This paper proposes a comprehensive deep learning framework to jointly learn face representation using multimodal information. The proposed deep learning structure is composed of a set of elaborately designed convolutional neural networks (CNNs) and a three-layer stacked auto-encoder (SAE). The set of CNNs extracts complementary facial features from multimodal data. Then, the extracted features are concatenated to form a high-dimensional feature vector, whose dimension is compressed by SAE. All of the CNNs are trained using a subset of 9,000 subjects from the publicly available CASIA-WebFace database, which ensures the reproducibility of this work. Using the proposed single CNN architecture and limited training data, 98.43% verification rate is achieved on the LFW database. Benefitting from the complementary information contained in multimodal data, our small ensemble system achieves higher than 99.0% recognition rate on LFW using publicly available training set.
Knowledge Distillation: A Survey Gou, Jianping; Yu, Baosheng; Maybank, Stephen J. ...
International journal of computer vision,
06/2021, Volume:
129, Issue:
6
Journal Article
Peer reviewed
Open access
In recent years, deep neural networks have been successful in both industry and academia, especially for computer vision tasks. The great success of deep learning is mainly due to its scalability to ...encode large-scale data and to maneuver billions of model parameters. However, it is a challenge to deploy these cumbersome deep models on devices with limited resources, e.g., mobile phones and embedded devices, not only because of the high computational complexity but also the large storage requirements. To this end, a variety of model compression and acceleration techniques have been developed. As a representative type of model compression and acceleration, knowledge distillation effectively learns a small student model from a large teacher model. It has received rapid increasing attention from the community. This paper provides a comprehensive survey of knowledge distillation from the perspectives of knowledge categories, training schemes, teacher–student architecture, distillation algorithms, performance comparison and applications. Furthermore, challenges in knowledge distillation are briefly reviewed and comments on future research are discussed and forwarded.
The click feature of an image, defined as the user click frequency vector of the image on a predefined word vocabulary, is known to effectively reduce the semantic gap for fine-grained image ...recognition. Unfortunately, user click frequency data are usually absent in practice. It remains challenging to predict the click feature from the visual feature, because the user click frequency vector of an image is always noisy and sparse. In this paper, we devise a H ierarchical D eep W ord E mbedding (HDWE) model by integrating sparse constraints and an improved RELU operator to address click feature prediction from visual features. HDWE is a coarse-to-fine click feature predictor that is learned with the help of an auxiliary image dataset containing click information. It can therefore discover the hierarchy of word semantics. We evaluate HDWE on three dog and one bird image datasets, in which Clickture-Dog and Clickture-Bird are utilized as auxiliary datasets to provide click data, respectively. Our empirical studies show that HDWE has 1) higher recognition accuracy, 2) a larger compression ratio, and 3) good one-shot learning ability and scalability to unseen categories.
Traditional image representations are not suited to conventional classification methods such as the linear discriminant analysis (LDA) because of the undersample problem (USP): the dimensionality of ...the feature space is much higher than the number of training samples. Motivated by the successes of the two-dimensional LDA (2DLDA) for face recognition, we develop a general tensor discriminant analysis (GTDA) as a preprocessing step for LDA. The benefits of GTDA, compared with existing preprocessing methods such as the principal components analysis (PCA) and 2DLDA, include the following: 1) the USP is reduced in subsequent classification by, for example, LDA, 2) the discriminative information in the training tensors is preserved, and 3) GTDA provides stable recognition rates because the alternating projection optimization algorithm to obtain a solution of GTDA converges, whereas that of 2DLDA does not. We use human gait recognition to validate the proposed GTDA. The averaged gait images are utilized for gait representation. Given the popularity of Gabor-function-based image decompositions for image understanding and object recognition, we develop three different Gabor-function-based image representations: 1) GaborD is the sum of Gabor filter responses over directions, 2) GaborS is the sum of Gabor filter responses over scales, and 3) GaborSD is the sum of Gabor filter responses over scales and directions. The GaborD, GaborS, and GaborSD representations are applied to the problem of recognizing people from their averaged gait images. A large number of experiments were carried out to evaluate the effectiveness (recognition rate) of gait recognition based on first obtaining a Gabor, GaborD, GaborS, or GaborSD image representation, then using GDTA to extract features and, finally, using LDA for classification. The proposed methods achieved good performance for gait recognition based on image sequences from the University of South Florida (USF) HumanID Database. Experimental comparisons are made with nine state-of-the-art classification methods in gait recognition.
Geometric Mean for Subspace Selection Dacheng Tao; Xuelong Li; Xindong Wu ...
IEEE transactions on pattern analysis and machine intelligence,
02/2009, Volume:
31, Issue:
2
Journal Article
Peer reviewed
Subspace selection approaches are powerful tools in pattern classification and data visualization. One of the most important subspace approaches is the linear dimensionality reduction step in the ...Fisher's linear discriminant analysis (FLDA), which has been successfully employed in many fields such as biometrics, bioinformatics, and multimedia information management. However, the linear dimensionality reduction step in FLDA has a critical drawback: for a classification task with c classes, if the dimension of the projected subspace is strictly lower than c - 1, the projection to a subspace tends to merge those classes, which are close together in the original feature space. If separate classes are sampled from Gaussian distributions, all with identical covariance matrices, then the linear dimensionality reduction step in FLDA maximizes the mean value of the Kullback-Leibler (KL) divergences between different classes. Based on this viewpoint, the geometric mean for subspace selection is studied in this paper. Three criteria are analyzed: 1) maximization of the geometric mean of the KL divergences, 2) maximization of the geometric mean of the normalized KL divergences, and 3) the combination of 1 and 2. Preliminary experimental results based on synthetic data, UCI Machine Learning Repository, and handwriting digits show that the third criterion is a potential discriminative subspace selection method, which significantly reduces the class separation problem in comparing with the linear dimensionality reduction step in FLDA and its several representative extensions.
Hierarchical neural networks have been shown to be effective in learning representative image features and recognizing object classes. However, most existing networks combine the low/middle level ...cues for classification without accounting for any spatial structures. For applications such as understanding a scene, how the visual cues are spatially distributed in an image becomes essential for successful analysis. This paper extends the framework of deep neural networks by accounting for the structural cues in the visual signals. In particular, two kinds of neural networks have been proposed. First, we develop a multitask deep convolutional network, which simultaneously detects the presence of the target and the geometric attributes (location and orientation) of the target with respect to the region of interest. Second, a recurrent neuron layer is adopted for structured visual detection. The recurrent neurons can deal with the spatial distribution of visible cues belonging to an object whose shape or structure is difficult to explicitly define. Both the networks are demonstrated by the practical task of detecting lane boundaries in traffic scenes. The multitask convolutional neural network provides auxiliary geometric information to help the subsequent modeling of the given lane structures. The recurrent neural network automatically detects lane boundaries, including those areas containing no marks, without any explicit prior knowledge or secondary modeling.
Single image haze removal is a challenging ill-posed problem. Existing methods use various constraints/priors to get plausible dehazing solutions. The key to achieve haze removal is to estimate a ...medium transmission map for an input hazy image. In this paper, we propose a trainable end-to-end system called DehazeNet, for medium transmission estimation. DehazeNet takes a hazy image as input, and outputs its medium transmission map that is subsequently used to recover a haze-free image via atmospheric scattering model. DehazeNet adopts convolutional neural network-based deep architecture, whose layers are specially designed to embody the established assumptions/priors in image dehazing. Specifically, the layers of Maxout units are used for feature extraction, which can generate almost all haze-relevant features. We also propose a novel nonlinear activation function in DehazeNet, called bilateral rectified linear unit, which is able to improve the quality of recovered haze-free image. We establish connections between the components of the proposed DehazeNet and those used in existing methods. Experiments on benchmark images show that DehazeNet achieves superior performance over existing methods, yet keeps efficient and easy to use.
The rapid development of computer hardware and Internet technology makes large scale data dependent models computationally tractable, and opens a bright avenue for annotating images through ...innovative machine learning algorithms. Semisupervised learning (SSL) therefore received intensive attention in recent years and was successfully deployed in image annotation. One representative work in SSL is Laplacian regularization (LR), which smoothes the conditional distribution for classification along the manifold encoded in the graph Laplacian, however, it is observed that LR biases the classification function toward a constant function that possibly results in poor generalization. In addition, LR is developed to handle uniformly distributed data (or single-view data), although instances or objects, such as images and videos, are usually represented by multiview features, such as color, shape, and texture. In this paper, we present multiview Hessian regularization (mHR) to address the above two problems in LR-based image annotation. In particular, mHR optimally combines multiple HR, each of which is obtained from a particular view of instances, and steers the classification function that varies linearly along the data manifold. We apply mHR to kernel least squares and support vector machines as two examples for image annotation. Extensive experiments on the PASCAL VOC'07 dataset validate the effectiveness of mHR by comparing it with baseline algorithms, including LR and HR.
Human keypoint detection from a single image is very challenging due to occlusion, blur, illumination, and scale variance. In this paper, we address this problem from three aspects by devising an ...efficient network structure, proposing three effective training strategies, and exploiting four useful postprocessing techniques. First, we find that context information plays an important role in reasoning human body configuration and invisible keypoints. Inspired by this, we propose a cascaded context mixer (CCM), which efficiently integrates spatial and channel context information and progressively refines them. Then, to maximize CCM’s representation capability, we develop a hard-negative person detection mining strategy and a joint-training strategy by exploiting abundant unlabeled data. It enables CCM to learn discriminative features from massive diverse poses. Third, we present several sub-pixel refinement techniques for postprocessing keypoint predictions to improve detection accuracy. Extensive experiments on the MS COCO keypoint detection benchmark demonstrate the superiority of the proposed method over representative state-of-the-art (SOTA) methods. Our single model achieves comparable performance with the winner of the 2018 COCO Keypoint Detection Challenge. The final ensemble model sets a new SOTA on this benchmark. The source code will be released at
https://github.com/chaimi2013/CCM
.