We present an algorithm for simultaneous face detection, landmarks localization, pose estimation and gender recognition using deep convolutional neural networks (CNN). The proposed method called, ...HyperFace, fuses the intermediate layers of a deep CNN using a separate CNN followed by a multi-task learning algorithm that operates on the fused features. It exploits the synergy among the tasks which boosts up their individual performances. Additionally, we propose two variants of HyperFace: (1) HyperFace-ResNet that builds on the ResNet-101 model and achieves significant improvement in performance, and (2) Fast-HyperFace that uses a high recall fast face detector for generating region proposals to improve the speed of the algorithm. Extensive experiments show that the proposed models are able to capture both global and local information in faces and performs significantly better than many competitive algorithms for each of these four tasks.
Discriminative appearance features are effective for recognizing actions in a fixed view, but may not generalize well to a new view. In this paper, we present two effective approaches to learn ...dictionaries for robust action recognition across views. In the first approach, we learn a set of view-specific dictionaries where each dictionary corresponds to one camera view. These dictionaries are learned simultaneously from the sets of correspondence videos taken at different views with the aim of encouraging each video in the set to have the same sparse representation. In the second approach, we additionally learn a common dictionary shared by different views to model view-shared features. This approach represents the videos in each view using a view-specific dictionary and the common dictionary. More importantly, it encourages the set of videos taken from the different views of the same action to have the similar sparse representations. The learned common dictionary not only has the capability to represent actions from unseen views, but also makes our approach effective in a semi-supervised setting where no correspondence videos exist and only a few labeled videos exist in the target view. The extensive experiments using three public datasets demonstrate that the proposed approach outperforms recently developed approaches for cross-view action recognition.
Visual tracking using multiple features has been proved as a robust approach because features could complement each other. Since different types of variations such as illumination, occlusion, and ...pose may occur in a video sequence, especially long sequence videos, how to properly select and fuse appropriate features has become one of the key problems in this approach. To address this issue, this paper proposes a new joint sparse representation model for robust feature-level fusion. The proposed method dynamically removes unreliable features to be fused for tracking by using the advantages of sparse representation. In order to capture the non-linear similarity of features, we extend the proposed method into a general kernelized framework, which is able to perform feature fusion on various kernel spaces. As a result, robust tracking performance is obtained. Both the qualitative and quantitative experimental results on publicly available videos show that the proposed method outperforms both sparse representation-based and fusion based-trackers.
We present a learning-based method to super-resolve face images using a kernel principal component analysis-based prior model. A prior probability is formulated based on the energy lying outside the ...span of principal components identified in a higher-dimensional feature space. This is used to regularize the reconstruction of the high-resolution image. We demonstrate with experiments that including higher-order correlations results in significant improvements
Road scene analysis is a challenging problem that has applications in autonomous navigation of vehicles. An integral component of this system is the robust detection and tracking of lane markings. It ...is a hard problem primarily due to large appearance variations in lane markings caused by factors such as occlusion (traffic on the road), shadows (from objects like trees), and changing lighting conditions of the scene (transition from day to night). In this paper, we address these issues through a learning-based approach using visual inputs from a camera mounted in front of a vehicle. We propose the following: 1) a pixel-hierarchy feature descriptor to model the contextual information shared by lane markings with the surrounding road region; 2) a robust boosting algorithm to select relevant contextual features for detecting lane markings; and 3) particle filters to track the lane markings, without knowledge of vehicle speed, by assuming the lane markings to be static through the video sequence and then learning the possible road scene variations from the statistics of tracked model parameters. We investigate the effectiveness of our algorithm on challenging daylight and night-time road video sequences.
Achieving the upper limits of face identification accuracy in forensic applications can minimize errors that have profound social and personal consequences. Although forensic examiners identify faces ...in these applications, systematic tests of their accuracy are rare. How can we achieve the most accurate face identification: using people and/or machines working alone or in collaboration? In a comprehensive comparison of face identification by humans and computers, we found that forensic facial examiners, facial reviewers, and superrecognizers were more accurate than fingerprint examiners and students on a challenging face identification test. Individual performance on the test varied widely. On the same test, four deep convolutional neural networks (DCNNs), developed between 2015 and 2017, identified faces within the range of human accuracy. Accuracy of the algorithms increased steadily over time, with the most recent DCNN scoring above the median of the forensic facial examiners. Using crowd-sourcing methods, we fused the judgments of multiple forensic facial examiners by averaging their rating-based identity judgments. Accuracy was substantially better for fused judgments than for individuals working alone. Fusion also served to stabilize performance, boosting the scores of lower-performing individuals and decreasing variability. Single forensic facial examiners fused with the best algorithm were more accurate than the combination of two examiners. Therefore, collaboration among humans and between humans and machines offers tangible benefits to face identification accuracy in important applications. These results offer an evidence-based roadmap for achieving the most accurate face identification possible.
Complex visual data contain discriminative structures that are difficult to be fully captured by any single feature descriptor. While recent work on domain adaptation focuses on adapting a single ...hand-crafted feature, it is important to perform adaptation of a hierarchy of features to exploit the richness of visual data. We propose a novel framework for domain adaptation using a sparse and hierarchical network (DASH-N). Our method jointly learns a hierarchy of features together with transformations that rectify the mismatch between different domains. The building block of DASH-N is the latent sparse representation. It employs a dimensionality reduction step that can prevent the data dimension from increasing too fast as one traverses deeper into the hierarchy. The experimental results show that our method compares favorably with the competing state-of-the-art methods. In addition, it is shown that a multi-layer DASH-N performs better than a single-layer DASH-N.
Keypoint detection is one of the most important pre-processing steps in tasks such as face modeling, recognition and verification. In this paper, we present an iterative method for Keypoint ...Estimation and Pose prediction of unconstrained faces by Learning Efficient H-CNN Regressors (KEPLER) for addressing the unconstrained face alignment problem. Recent state-of-the-art methods have shown improvements in facial keypoint detection by employing Convolution Neural Networks (CNNs). Although a simple feed forward neural network can learn the mapping between input and output spaces, it does not learn the inherent structural dependencies that well. We present a novel architecture called H-CNN (Heatmap-CNN) acting on an N-dimensional input image which captures informative structured global and local features and thus favors accurate keypoint detecion in in-the wild face images. H-CNN is jointly trained on the visibility, fiducials and 3D-pose of the face. As the iterations proceed, the error decreases making the gradients small and thus requiring efficient training of deep networks to mitigate this. KEPLER performs global corrections in pose and fiducials for the first four iterations followed by local corrections at a later stage. As a by-product, KEPLER also provides robust estimate of 3D pose (pitch, yaw and roll) of the face. We also show that without using any 3D information, KEPLER outperforms recent state-of-the-art methods for alignment on challenging datasets such as AFW 1 and AFLW 2.
Display omitted
•Present a cascade regression method for unconstrained face alignment•Channeled Inception Net is presented for training in multi-task framework.•Constrained training and local error correction significantly improve the performance.•Impressive results on AFLW, AFW and COFW datasets
This paper presents an approach for viewpoint invariant human action recognition, an area that has received scant attention so far, relative to the overall body of work in human action recognition. ...It has been established previously that there exist no invariants for 3D to 2D projection. However, there exist a wealth of techniques in 2D invariance that can be used to advantage in 3D to 2D projection. We exploit these techniques and model actions in terms of view-invariant canonical body poses and trajectories in 2D invariance space, leading to a simple and effective way to represent and recognize human actions from a general viewpoint. We first evaluate the approach theoretically and show why a straightforward application of the 2D invariance idea will not work. We describe strategies designed to overcome inherent problems in the straightforward approach and outline the recognition algorithm. We then present results on 2D projections of publicly available human motion capture data as well on manually segmented real image sequences. In addition to robustness to viewpoint change, the approach is robust enough to handle different people, minor variabilities in a given action, and the speed of aciton (and hence, frame-rate) while encoding sufficient distinction among actions.PUBLICATION ABSTRACT