Human faces in surveillance videos often suffer from severe image blur, dramatic pose variations, and occlusion. In this paper, we propose a comprehensive framework based on Convolutional Neural ...Networks (CNN) to overcome challenges in video-based face recognition (VFR). First, to learn blur-robust face representations, we artificially blur training data composed of clear still images to account for a shortfall in real-world video training data. Using training data composed of both still images and artificially blurred data, CNN is encouraged to learn blur-insensitive features automatically. Second, to enhance robustness of CNN features to pose variations and occlusion, we propose a Trunk-Branch Ensemble CNN model (TBE-CNN), which extracts complementary information from holistic face images and patches cropped around facial components. TBE-CNN is an end-to-end model that extracts features efficiently by sharing the low- and middle-level convolutional layers between the trunk and branch networks. Third, to further promote the discriminative power of the representations learnt by TBE-CNN, we propose an improved triplet loss function. Systematic experiments justify the effectiveness of the proposed techniques. Most impressively, TBE-CNN achieves state-of-the-art performance on three popular video face databases: PaSC, COX Face, and YouTube Faces. With the proposed techniques, we also obtain the first place in the BTAS 2016 Video Person Recognition Evaluation.
Face images appearing in multimedia applications, e.g., social networks and digital entertainment, usually exhibit dramatic pose, illumination, and expression variations, resulting in considerable ...performance degradation for traditional face recognition algorithms. This paper proposes a comprehensive deep learning framework to jointly learn face representation using multimodal information. The proposed deep learning structure is composed of a set of elaborately designed convolutional neural networks (CNNs) and a three-layer stacked auto-encoder (SAE). The set of CNNs extracts complementary facial features from multimodal data. Then, the extracted features are concatenated to form a high-dimensional feature vector, whose dimension is compressed by SAE. All of the CNNs are trained using a subset of 9,000 subjects from the publicly available CASIA-WebFace database, which ensures the reproducibility of this work. Using the proposed single CNN architecture and limited training data, 98.43% verification rate is achieved on the LFW database. Benefitting from the complementary information contained in multimodal data, our small ensemble system achieves higher than 99.0% recognition rate on LFW using publicly available training set.
Face images captured in unconstrained environments usually contain significant pose variation, which dramatically degrades the performance of algorithms designed to recognize frontal faces. This ...paper proposes a novel face identification framework capable of handling the full range of pose variations within ±90° of yaw. The proposed framework first transforms the original pose-invariant face recognition problem into a partial frontal face recognition problem. A robust patch-based face representation scheme is then developed to represent the synthesized partial frontal faces. For each patch, a transformation dictionary is learnt under the proposed multi-task learning scheme. The transformation dictionary transforms the features of different poses into a discriminative subspace. Finally, face matching is performed at patch level rather than at the holistic level. Extensive and systematic experimentation on FERET, CMU-PIE, and Multi-PIE databases shows that the proposed method consistently outperforms single-task-based baselines as well as state-of-the-art methods for the pose problem. We further extend the proposed algorithm for the unconstrained face verification problem and achieve top-level performance on the challenging LFW data set.
Human–Object Interaction (HOI) detection is important to human-centric scene understanding tasks. Existing works tend to assume that the same verb has similar visual characteristics in different HOI ...categories, an approach that ignores the diverse semantic meanings of the verb. To address this issue, in this paper, we propose a novel Polysemy Deciphering Network (PD-Net) that decodes the visual polysemy of verbs for HOI detection in three distinct ways. First, we refine features for HOI detection to be polysemy-aware through the use of two novel modules: namely, Language Prior-guided Channel Attention (LPCA) and Language Prior-based Feature Augmentation (LPFA). LPCA highlights important elements in human and object appearance features for each HOI category to be identified; moreover, LPFA augments human pose and spatial features for HOI detection using language priors, enabling the verb classifiers to receive language hints that reduce intra-class variation for the same verb. Second, we introduce a novel Polysemy-Aware Modal Fusion module, which guides PD-Net to make decisions based on feature types deemed more important according to the language priors. Third, we propose to relieve the verb polysemy problem through sharing verb classifiers for semantically similar HOI categories. Furthermore, to expedite research on the verb polysemy problem, we build a new benchmark dataset named HOI-VerbPolysemy (HOI-VP), which includes common verbs (predicates) that have diverse semantic meanings in the real world. Finally, through deciphering the visual polysemy of verbs, our approach is demonstrated to outperform state-of-the-art methods by significant margins on the HICO-DET, V-COCO, and HOI-VP databases. Code and data in this paper are available at
https://github.com/MuchHair/PD-Net
.
Automatic facial age estimation can be used in a wide range of real-world applications. However, this process is challenging due to the randomness and slowness of the aging process. Accordingly, in ...this paper, we propose a novel method aimed at overcoming the challenges associated with facial age estimation. First, we propose a novel age encoding method, referred to as 'Soft-ranking', which encodes two important properties of facial age, <inline-formula> <tex-math notation="LaTeX">{i.e.} </tex-math></inline-formula>, the ordinal property and the correlation between adjacent ages. Therefore, Soft-ranking provides a richer supervision signal for training deep models. Moreover, we carefully analyze existing evaluation protocols for age estimation, finding that the overlap in identity between the training and testing sets affects the relative performance of different age encoding methods. Moreover, we achieve state-of-the-art performance on four most popular age databases, <inline-formula> <tex-math notation="LaTeX">{i.e.} </tex-math></inline-formula>, Morph II, AgeDB, CLAP2015, and CLAP2016.
Pose-invariant face recognition (PIFR) refers to the ability that recognizes face images with arbitrary pose variations. Among existing PIFR algorithms, pose normalization has been proved to be an ...effective approach which preserves texture fidelity, but usually depends on precise 3D face models or at high computational cost. In this paper, we propose an highly efficient PIFR algorithm that effectively handles the main challenges caused by pose variation. First, a dense grid of 3D facial landmarks are projected to each 2D face image, which enables feature extraction in an pose adaptive manner. Second, for the local patch around each landmark, an optimal warp is estimated based on homography to correct texture deformation caused by pose variations. The reconstructed frontal-view patches are then utilized for face recognition with traditional face descriptors. The homography-based normalization is highly efficient and the synthesized frontal face images are of high quality. Finally, we propose an effective approach for occlusion detection, which enables face recognition with visible patches only. Therefore, the proposed algorithm effectively handles the main challenges in PIFR. Experimental results on four popular face databases demonstrate that the propose approach performs well on both constrained and unconstrained environments.
•We propose a highly efficient and accurate pose normalization approach for pose-invariant face recognition.•This is the first time that homography is utilized for face synthesis.•The proposed approach covers the full range of pose variations within ±90° of yaw.•The proposed approach outperforms existing methods on four popular face databases.
Automated optical inspection (AOI) has been widely used in industrial Quality Assurance (QA) procedures. Multi-task inspection in high-speed AOI systems is becoming a significant problem in the ...design. In this paper, the design of an AOI system for E-shaped magnetic core elements is briefly described and several novel algorithms are proposed to realize defects detection by this system. First, this paper proposes a robust k-tSL-center clustering method to classify the interfaces of the element into normal and damaged areas. Second, a modified Active Shape Model (ASM) method is adopted to perform shape distortion detection in real-time. Performance evaluations are carried out on an E-shaped Magnetic Core Image Database, in which all images are captured by the designed AOI system. Experimental results show that the proposed methods are more efficient, robust and accurate than state-of-the-art methods in this application.
Brain tumor segmentation from Magnetic Resonance Imaging scans is vital for both the diagnosis and treatment of brain cancers. It is widely accepted that accurate segmentation depends on multi-level ...information. However, exiting deep architectures for brain tumor segmentation fail to explicitly encourage the models to learn high-quality hierarchical features. In this paper, we propose a series of approaches to enhance the quality of the learnt hierarchical features. Our contributions incorporate four aspects. First, we extend the popular DeepMedic model to Multi-Level DeepMedic to make use of multi-level information for more accurate segmentation. Second, we propose a novel dual-force training scheme to promote the quality of multi-level features learnt from deep models. It is a general training scheme and can be applied to many exiting architectures, e.g., DeepMedic and U-Net. Third, we design a label distribution-based loss function as an auxiliary classifier to encourage the high-level layers of deep models to learn more abstract information. Finally, we propose a novel Multi-Layer Perceptron-based post-processing approach to refine the prediction results of deep models. Extensive experiments are conducted on two most recent brain tumor segmentation datasets, i.e., BRATS 2017 and BRATS 2015 datasets. Results on the two databases indicate that the proposed approaches consistently promote the segmentation performance of the two popular deep models.
Class imbalance has emerged as one of the major challenges for medical image segmentation. The model cascade (MC) strategy, a popular scheme, significantly alleviates the class imbalance issue via ...running a set of individual deep models for coarse-to-fine segmentation. Despite its outstanding performance, however, this method leads to undesired system complexity and also ignores the correlation among the models. To handle these flaws in the MC approach, we propose in this paper a light-weight deep model, i.e., the One-pass Multi-task Network (OM-Net) to solve class imbalance better than MC does, while requiring only one-pass computation for brain tumor segmentation. First, OM-Net integrates the separate segmentation tasks into one deep model, which consists of shared parameters to learn joint features, as well as task-specific parameters to learn discriminative features. Second, to more effectively optimize OM-Net, we take advantage of the correlation among tasks to design both an online training data transfer strategy and a curriculum learning-based training strategy. Third, we further propose sharing prediction results between tasks, which enables us to design a cross-task guided attention (CGA) module. By following the guidance of the prediction results provided by the previous task, CGA can adaptively recalibrate channel-wise feature responses based on the category-specific statistics. Finally, a simple yet effective post-processing method is introduced to refine the segmentation results of the proposed attention network. Extensive experiments are conducted to demonstrate the effectiveness of the proposed techniques. Most impressively, we achieve state-of-the-art performance on the BraTS 2015 testing set and BraTS 2017 online validation set. Using these proposed approaches, we also won joint third place in the BraTS 2018 challenge among 64 participating teams. The code is publicly available at https://github.com/chenhong-zhou/OM-Net.