Convolutional Neural Networks (ConvNets) have recently shown promising performance in many computer vision tasks, especially image-based recognition. How to effectively apply ConvNets to ...sequence-based data is still an open problem. This paper proposes an effective yet simple method to represent spatio-temporal information carried in 3D skeleton sequences into three 2D images by encoding the joint trajectories and their dynamics into color distribution in the images, referred to as Joint Trajectory Maps (JTM), and adopts ConvNets to learn the discriminative features for human action recognition. Such an image-based representation enables us to fine-tune existing ConvNets models for the classification of skeleton sequences without training the networks afresh. The three JTMs are generated in three orthogonal planes and provide complimentary information to each other. The final recognition is further improved through multiplicative score fusion of the three JTMs. The proposed method was evaluated on four public benchmark datasets, the large NTU RGB+D Dataset, MSRC-12 Kinect Gesture Dataset (MSRC-12), G3D Dataset and UTD Multimodal Human Action Dataset (UTD-MHAD) and achieved the state-of-the-art results.
It is a feasible and promising way to utilize deep neural networks to learn and extract valuable features from synthetic aperture radar (SAR) images for SAR automatic target recognition (ATR). ...However, it is too difficult to effectively train the deep neural networks with limited raw SAR images. In this paper, we propose a new approach to do SAR ATR, in which a multiview deep learning framework was employed. Based on the multiview SAR ATR pattern, we first present a flexible mean to generate adequate multiview SAR data, which can guarantee a large amount of inputs for network training without needing many raw SAR images. Then, a unique deep convolutional neural network containing a parallel network topology with multiple inputs is adopted. The features of input SAR images from different views will be learned by the proposed network layer by layer; meanwhile, the learned features from the distinct views are fused in different layers progressively. Therefore, the proposed framework is able to achieve a superior recognition performance, and requires only a small number of raw SAR images for network training samples generation. Experimental results have shown the superiority of the proposed framework based on the Moving and Stationary Target Acquisition and Recognition data set.
This paper describes novel event-based spatio-temporal features called time-surfaces and how they can be used to create a hierarchical event-based pattern recognition architecture. Unlike existing ...hierarchical architectures for pattern recognition, the presented model relies on a time oriented approach to extract spatio-temporal features from the asynchronously acquired dynamics of a visual scene. These dynamics are acquired using biologically inspired frameless asynchronous event-driven vision sensors. Similarly to cortical structures, subsequent layers in our hierarchy extract increasingly abstract features using increasingly large spatio-temporal windows. The central concept is to use the rich temporal information provided by events to create contexts in the form of time-surfaces which represent the recent temporal activity within a local spatial neighborhood. We demonstrate that this concept can robustly be used at all stages of an event-based hierarchical model. First layer feature units operate on groups of pixels, while subsequent layer feature units operate on the output of lower level feature units. We report results on a previously published 36 class character recognition task and a four class canonical dynamic card pip task, achieving near 100 percent accuracy on each. We introduce a new seven class moving face recognition task, achieving 79 percent accuracy.
Traditional image representations are not suited to conventional classification methods such as the linear discriminant analysis (LDA) because of the undersample problem (USP): the dimensionality of ...the feature space is much higher than the number of training samples. Motivated by the successes of the two-dimensional LDA (2DLDA) for face recognition, we develop a general tensor discriminant analysis (GTDA) as a preprocessing step for LDA. The benefits of GTDA, compared with existing preprocessing methods such as the principal components analysis (PCA) and 2DLDA, include the following: 1) the USP is reduced in subsequent classification by, for example, LDA, 2) the discriminative information in the training tensors is preserved, and 3) GTDA provides stable recognition rates because the alternating projection optimization algorithm to obtain a solution of GTDA converges, whereas that of 2DLDA does not. We use human gait recognition to validate the proposed GTDA. The averaged gait images are utilized for gait representation. Given the popularity of Gabor-function-based image decompositions for image understanding and object recognition, we develop three different Gabor-function-based image representations: 1) GaborD is the sum of Gabor filter responses over directions, 2) GaborS is the sum of Gabor filter responses over scales, and 3) GaborSD is the sum of Gabor filter responses over scales and directions. The GaborD, GaborS, and GaborSD representations are applied to the problem of recognizing people from their averaged gait images. A large number of experiments were carried out to evaluate the effectiveness (recognition rate) of gait recognition based on first obtaining a Gabor, GaborD, GaborS, or GaborSD image representation, then using GDTA to extract features and, finally, using LDA for classification. The proposed methods achieved good performance for gait recognition based on image sequences from the University of South Florida (USF) HumanID Database. Experimental comparisons are made with nine state-of-the-art classification methods in gait recognition.
Recently still image-based human action recognition has become an active research topic in computer vision and pattern recognition. It focuses on identifying a person׳s action or behavior from a ...single image. Unlike the traditional action recognition approaches where videos or image sequences are used, a still image contains no temporal information for action characterization. Thus the prevailing spatiotemporal features for video-based action analysis are not appropriate for still image-based action recognition. It is more challenging to perform still image-based action recognition than the video-based one, given the limited source of information as well as the cluttered background for images collected from the Internet. On the other hand, a large number of still images exist over the Internet. Therefore it is demanding to develop robust and efficient methods for still image-based action recognition to understand the web images better for image retrieval or search. Based on the emerging research in recent years, it is time to review the existing approaches to still image-based action recognition and inspire more efforts to advance the field of research. We present a detailed overview of the state-of-the-art methods for still image-based action recognition, and categorize and describe various high-level cues and low-level features for action analysis in still images. All related databases are introduced with details. Finally, we give our views and thoughts for future research.
•A comprehensive survey of the research works on still image-based action recognition is conducted, for the first time.•Categorization of existing approaches into different categories with a clear organization is done.•Some thoughts for future research are presented to inspire new efforts in the field.
In the unsupervised open set domain adaptation (UOSDA), the target domain contains unknown classes that are not observed in the source domain. Researchers in this area aim to train a classifier to ...accurately: 1) recognize unknown target data (data with unknown classes) and 2) classify other target data. To achieve this aim, a previous study has proven an upper bound of the target-domain risk, and the open set difference, as an important term in the upper bound, is used to measure the risk on unknown target data. By minimizing the upper bound, a shallow classifier can be trained to achieve the aim. However, if the classifier is very flexible e.g., deep neural networks (DNNs), the open set difference will converge to a negative value when minimizing the upper bound, which causes an issue where most target data are recognized as unknown data. To address this issue, we propose a new upper bound of target-domain risk for UOSDA, which includes four terms: source-domain risk, <inline-formula> <tex-math notation="LaTeX">\epsilon </tex-math></inline-formula>-open set difference (<inline-formula> <tex-math notation="LaTeX">\Delta _\epsilon </tex-math></inline-formula>), distributional discrepancy between domains, and a constant. Compared with the open set difference, <inline-formula> <tex-math notation="LaTeX">\Delta _\epsilon </tex-math></inline-formula> is more robust against the issue when it is being minimized, and thus we are able to use very flexible classifiers (i.e., DNNs). Then, we propose a new principle-guided deep UOSDA method that trains DNNs via minimizing the new upper bound. Specifically, source-domain risk and <inline-formula> <tex-math notation="LaTeX">\Delta _\epsilon </tex-math></inline-formula> are minimized by gradient descent, and the distributional discrepancy is minimized via a novel open set conditional adversarial training strategy. Finally, compared with the existing shallow and deep UOSDA methods, our method shows the state-of-the-art performance on several benchmark datasets, including digit recognition modified National Institute of Standards and Technology database (MNIST), the Street View House Number (SVHN), U.S. Postal Service (USPS), object recognition (Office-31, Office-Home), and face recognition pose, illumination, and expression (PIE).
In this paper, we propose a novel deep learning framework, called spatial-temporal recurrent neural network (STRNN), to integrate the feature learning from both spatial and temporal information of ...signal sources into a unified spatial-temporal dependency model. In STRNN, to capture those spatially co-occurrent variations of human emotions, a multidirectional recurrent neural network (RNN) layer is employed to capture long-range contextual cues by traversing the spatial regions of each temporal slice along different directions. Then a bi-directional temporal RNN layer is further used to learn the discriminative features characterizing the temporal dependencies of the sequences, where sequences are produced from the spatial RNN layer. To further select those salient regions with more discriminative ability for emotion recognition, we impose sparse projection onto those hidden states of spatial and temporal domains to improve the model discriminant ability. Consequently, the proposed two-layer RNN model provides an effective way to make use of both spatial and temporal dependencies of the input signals for emotion recognition. Experimental results on the public emotion datasets of electroencephalogram and facial expression demonstrate the proposed STRNN method is more competitive over those state-of-the-art methods.
Achieving the upper limits of face identification accuracy in forensic applications can minimize errors that have profound social and personal consequences. Although forensic examiners identify faces ...in these applications, systematic tests of their accuracy are rare. How can we achieve the most accurate face identification: using people and/or machines working alone or in collaboration? In a comprehensive comparison of face identification by humans and computers, we found that forensic facial examiners, facial reviewers, and superrecognizers were more accurate than fingerprint examiners and students on a challenging face identification test. Individual performance on the test varied widely. On the same test, four deep convolutional neural networks (DCNNs), developed between 2015 and 2017, identified faces within the range of human accuracy. Accuracy of the algorithms increased steadily over time, with the most recent DCNN scoring above the median of the forensic facial examiners. Using crowd-sourcing methods, we fused the judgments of multiple forensic facial examiners by averaging their rating-based identity judgments. Accuracy was substantially better for fused judgments than for individuals working alone. Fusion also served to stabilize performance, boosting the scores of lower-performing individuals and decreasing variability. Single forensic facial examiners fused with the best algorithm were more accurate than the combination of two examiners. Therefore, collaboration among humans and between humans and machines offers tangible benefits to face identification accuracy in important applications. These results offer an evidence-based roadmap for achieving the most accurate face identification possible.
Recent studies have demonstrated the power of recurrent neural networks for machine translation, image captioning and speech recognition. For the task of capturing temporal structure in video, ...however, there still remain numerous open research questions. Current research suggests using a simple temporal feature pooling strategy to take into account the temporal aspect of video. We demonstrate that this method is not sufficient for gesture recognition, where temporal information is more discriminative compared to general video classification tasks. We explore deep architectures for gesture recognition in video and propose a new end-to-end trainable neural network architecture incorporating temporal convolutions and bidirectional recurrence. Our main contributions are twofold; first, we show that recurrence is crucial for this task; second, we show that adding temporal convolutions leads to significant improvements. We evaluate the different approaches on the Montalbano gesture recognition dataset, where we achieve state-of-the-art results.
With the increasing demand for information security and security regulations all over the world, biometric recognition technology has been widely used in our everyday life. In this regard, multimodal ...biometrics technology has gained interest and became popular due to its ability to overcome a number of significant limitations of unimodal biometric systems. In this paper, a new multimodal biometric human identification system is proposed, which is based on a deep learning algorithm for recognizing humans using biometric modalities of iris, face, and finger vein. The structure of the system is based on convolutional neural networks (CNNs) which extract features and classify images by softmax classifier. To develop the system, three CNN models were combined; one for iris, one for face, and one for finger vein. In order to build the CNN model, the famous pertained model VGG-16 was used, the Adam optimization method was applied and categorical cross-entropy was used as a loss function. Some techniques to avoid overfitting were applied, such as image augmentation and dropout techniques. For fusing the CNN models, different fusion approaches were employed to explore the influence of fusion approaches on recognition performance, therefore, feature and score level fusion approaches were applied. The performance of the proposed system was empirically evaluated by conducting several experiments on the SDUMLA-HMT dataset, which is a multimodal biometrics dataset. The obtained results demonstrated that using three biometric traits in biometric identification systems obtained better results than using two or one biometric traits. The results also showed that our approach comfortably outperformed other state-of-the-art methods by achieving an accuracy of 99.39%, with a feature level fusion approach and an accuracy of 100% with different methods of score level fusion.