Automated Facial Expression Recognition (FER) in the wild using deep neural networks is still challenging due to intra-class variations and inter-class similarities in facial images. Deep Metric ...Learning (DML) is among the widely used methods to deal with these issues by improving the discriminative power of the learned embedded features. This paper proposes an Adaptive Correlation (Ad-Corre) Loss to guide the network towards generating embedded feature vectors with high correlation for within-class samples and less correlation for between-class samples. Ad-Corre consists of 3 components called Feature Discriminator, Mean Discriminator, and Embedding Discriminator. We design the Feature Discriminator component to guide the network to create the embedded feature vectors to be highly correlated if they belong to a similar class, and less correlated if they belong to different classes. In addition, the Mean Discriminator component leads the network to make the mean embedded feature vectors of different classes to be less similar to each other. We use Xception network as the backbone of our model, and contrary to previous work, we propose an embedding feature space that contains <inline-formula> <tex-math notation="LaTeX">k </tex-math></inline-formula> feature vectors. Then, the Embedding Discriminator component penalizes the network to generate the embedded feature vectors, which are dissimilar. We trained our model using the combination of our proposed loss functions called Ad-Corre Loss jointly with the cross-entropy loss. We achieved a very promising recognition accuracy on AffectNet, RAF-DB, and FER-2013. Our extensive experiments and ablation study indicate the power of our method to cope well with challenging FER tasks in the wild. The code is available on Github.
Facial landmark detection is a vital step for numerous facial image analysis applications. Although some deep learning-based methods have achieved good performances in this task, they are often not ...suitable for running on mobile devices. Such methods rely on networks with many parameters, which makes the training and inference time-consuming. Training lightweight neural networks such as MobileNets are often challenging, and the models might have low accuracy. Inspired by knowledge distillation (KD), this paper presents a novel loss function to train a lightweight Student network (e.g., MobileNetV2) for facial landmark detection. We use two Teacher networks, a Tolerant-Teacher and a Tough-Teacher in conjunction with the Student network. The Tolerant-Teacher is trained using Soft-landmarks created by active shape models, while the Tough-Teacher is trained using the ground truth (aka Hard-landmarks) landmark points. To utilize the facial landmark points predicted by the Teacher networks, we define an Assistive Loss (ALoss) for each Teacher network. Moreover, we define a loss function called KD-Loss that utilizes the facial landmark points predicted by the two pre-trained Teacher networks (EfficientNet-b3) to guide the lightweight Student network towards predicting the Hard-landmarks. Our experimental results on three challenging facial datasets show that the proposed architecture will result in a better-trained Student network that can extract facial landmark points with high accuracy.
•Applying knowledge distillation for facial landmark points detection.•Training lightweight convolutional neural networks for efficient face alignment.•Using two Teachers (a tough network and a tolerant network) for training a Student network for face alignment.
This paper presents a deep learning method using Natural Language Processing (NLP) techniques, to distinguish between Mild Cognitive Impairment (MCI) and Normal Cognitive (NC) conditions in older ...adults. We propose a framework that analyzes transcripts generated from video interviews collected within the I-CONECT study project, a randomized controlled trial aimed at improving cognitive functions through video chats. Our proposed NLP framework consists of two Transformer-based modules, namely Sentence Embedding (SE) and Sentence Cross Attention (SCA). First, the SE module captures contextual relationships between words within each sentence. Subsequently, the SCA module extracts temporal features from a sequence of sentences. This feature is then used by a Multi-Layer Perceptron (MLP) for the classification of subjects into MCI or NC. To build a robust model, we propose a novel loss function, called InfoLoss, that considers the reduction in entropy by observing each sequence of sentences to ultimately enhance the classification accuracy. The results of our comprehensive model evaluation using the I-CONECT dataset show that our framework can distinguish between MCI and NC with an average area under the curve of 84.75%.
•Introducing a novel deep learning method for cognitive impairment detection.•Employs Natural Language Processing to analyze speech patterns.•Distinguishing Mild Cognitive Impairment from Normal Cognitive conditions.•Utilizing Transformer-based modules to capture contextual relationships.•Extracting temporal features from video interview transcripts.•Introducing InfoLoss to improve classification accuracy.
Sagittal cervical spine alignment measured on X-Ray is a key objective measure for clinicians caring for patients with a multitude of presenting symptoms. Despite its applications, there has been no ...research available in this field yet. This paper presents a framework for automatic detection of the Sagittal cervical spine landmark point. Inspired by UNet, we propose an encoder-decoder Convolutional Neural Network (CNN) called PoseNet. In developing our model, we first review the weaknesses of widely used regression loss functions such as the L1, and L2 losses. To address these issues, we propose a novel loss function specifically designed to improve the accuracy of the localization task under challenging situations (extreme neck pose, low or high brightness and illumination, X-Ray noises, etc.) We validate our model and loss function on a dataset of X-Ray images. The results show that our framework is capable of performing precise sagittal cervical spine landmark point detection even for challenging X-Ray images.
Background
Cognitive decline can affect speech, language, head pose, eye gaze, and facial expressions in older adults. Artificial Intelligence (AI) can assist in automated prediction and monitoring ...of the progress of cognitive decline. Our previous work shows that utilizing unimodal linguistic features or facial features separately can lead to the prediction of cognitive impairment. We hypothesize that utilizing multimodal features from both audio and video can lead to a more accurate AI algorithm in differentiating those with mild cognitive impairment (MCI) from those with normal cognition (NC).
Method
We utilize deep learning methods (DL), specifically Transformers to extract both linguistic and facial features to predict whether an older subject is MCI or NC. Data is collected through Internet‐Based Conversational Engagement Clinical Trial (I‐CONECT) (NCT02871921) aiming to determine the effects of social interactions (video chats) on cognitive functions. Videos and respective transcribed audios of three themes; Summertime (30 participants), Halloween (32 participants), and Self‐care (30 participants) are selected based on “Good” video qualities. Dataset is balanced (half are diagnosed with MCI).
Facial features are extracted from video frames through a convolutional Autoencoder, and then a Bidirectional Encoder Representations from Transformers (BERT) model captures temporal facial information. Linguistic features are generated by the DistilBERT language model being deployed on the transcribed audio. We used a 10‐fold cross‐validation approach to train and test the models separately. For each modality, the probability scores given by the transformers are fused using a majority voting method.
Result
Accuracy, F1 score, precision, recall and area under the curve (AUC) are the evaluation metrics in this study. Score‐level fusion outperforms previous non‐hybrid methods of MCI prediction. Audio‐visual fusion leads to promising results for Summertime, Halloween and Self‐care with accuracy of 84.8%, 85%, and 84.5% and AUC of 86%, 86.1%, and 85.7%,respectively, which are marked improvements from the previous findings (48% accuracy using linguistic features alone and 67.8% using facial features alone).
Conclusion
The results of our study show that the proposed multimodal approach improves the accuracy of the AI/ML model in distinguishing MCI from NC. That is, audio‐visual fusion could outperform other single methods.
Capsule Network is powerful at defining the positional relationship between features in deep neural networks for visual recognition tasks, but it is computationally expensive and not suitable for ...running on mobile devices. The bottleneck is in the computational complexity of the Dynamic Routing mechanism used between the capsules. On the other hand, XNOR-Net is fast and computationally efficient, though it suffers from low accuracy due to information loss in the binarization process. To address the computational burdens of the Dynamic Routing mechanism, this paper proposes new Fully Connected (FC) layers by xnorizing the linear projection outside or inside the Dynamic Routing within the CapsFC layer. Specifically, our proposed FC layers have two versions, XnODR (Xnorize the Linear Projection Outside Dynamic Routing) and XnIDR (Xnorize the Linear Projection Inside Dynamic Routing). To test the generalization of both XnODR and XnIDR, we insert them into two different networks, MobileNetV2 and ResNet-50. Our experiments on three datasets, MNIST, CIFAR-10, and MultiMNIST validate their effectiveness. The results demonstrate that both XnODR and XnIDR help networks to have high accuracy with lower FLOPs and fewer parameters (e.g., 96.14% correctness with 2.99M parameters and 311.74M FLOPs on CIFAR-10).
Although deep neural networks have achieved reasonable accuracy in solving face alignment, it is still a challenging task, specifically when dealing with facial images, under occlusion, or extreme ...head poses. Heatmap-based Regression (HBR) and Coordinate-based Regression (CBR) are among the two mainly used methods for face alignment. CBR methods require less computer memory, though their performance is less than HBR methods. In this paper, we propose an Adaptive Coordinate-based Regression (ACR) loss to improve the accuracy of CBR for face alignment. Inspired by the Active Shape Model (ASM), we generate Smooth-Face objects, a set of facial landmark points with fewer variations compared to the ground truth landmark points. We then introduce a method to estimate the level of difficulty in predicting each landmark point for the network by comparing the distribution of the ground truth landmark points and the corresponding Smooth-Face objects. Our proposed ACR Loss can adaptively modify its curvature and the influence of the loss based on the difficulty level of predicting each landmark point in a face. Accordingly, the ACR Loss guides the network toward more challenging points than easier points, which improves the accuracy of the face alignment task. Our extensive evaluation shows the capabilities of the proposed ACR Loss in predicting facial landmark points in various facial images.
Active Shape Model (ASM) is a statistical model of object shapes that represents a target structure. ASM can guide machine learning algorithms to fit a set of points representing an object (e.g., ...face) onto an image. This paper presents a lightweight Convolutional Neural Network (CNN) architecture with a loss function being assisted by ASM for face alignment and estimating head pose in the wild. We use ASM to first guide the network towards learning a smoother distribution of the facial landmark points. Inspired by transfer learning, during the training process, we gradually harden the regression problem and guide the network towards learning the original landmark points distribution. We define multi-tasks in our loss function that are responsible for detecting facial landmark points as well as estimating the face pose. Learning multiple correlated tasks simultaneously builds synergy and improves the performance of individual tasks. We compare the performance of our proposed model called ASMNet with MobileNetV2 (which is about 2 times bigger than ASMNet) in both the face alignment and pose estimation tasks. Experimental results on challenging datasets show that by using the proposed ASM assisted loss function, the ASMNet performance is comparable with MobileNetV2 in the face alignment task. In addition, for face pose estimation, ASMNet performs much better than MobileNetV2. ASMNet achieves an acceptable performance for facial landmark points detection and pose estimation while having a significantly smaller number of parameters and floating-point operations compared to many CNN-based models.
Although deep neural networks have achieved reasonable accuracy in solving face alignment, it is still a challenging task, specifically when we deal with facial images, under occlusion, or extreme ...head poses. Heatmap-based Regression (HBR) and Coordinate-based Regression (CBR) are among the two mainly used methods for face alignment. CBR methods require less computer memory, though their performance is less than HBR methods. In this paper, we propose an Adaptive Coordinate-based Regression (ACR) loss to improve the accuracy of CBR for face alignment. Inspired by the Active Shape Model (ASM), we generate Smooth-Face objects, a set of facial landmark points with less variations compared to the ground truth landmark points. We then introduce a method to estimate the level of difficulty in predicting each landmark point for the network by comparing the distribution of the ground truth landmark points and the corresponding Smooth-Face objects. Our proposed ACR Loss can adaptively modify its curvature and the influence of the loss based on the difficulty level of predicting each landmark point in a face. Accordingly, the ACR Loss guides the network toward challenging points than easier points, which improves the accuracy of the face alignment task. Our extensive evaluation shows the capabilities of the proposed ACR Loss in predicting facial landmark points in various facial images.
Capsule Network is powerful at defining the positional relationship between features in deep neural networks for visual recognition tasks, but it is computationally expensive and not suitable for ...running on mobile devices. The bottleneck is in the computational complexity of the Dynamic Routing mechanism used between the capsules. On the other hand, XNOR-Net is fast and computationally efficient, though it suffers from low accuracy due to information loss in the binarization process. To address the computational burdens of the Dynamic Routing mechanism, this paper proposes new Fully Connected (FC) layers by xnorizing the linear projection outside or inside the Dynamic Routing within the CapsFC layer. Specifically, our proposed FC layers have two versions, XnODR (Xnorize the Linear Projection Outside Dynamic Routing) and XnIDR (Xnorize the Linear Projection Inside Dynamic Routing). To test the generalization of both XnODR and XnIDR, we insert them into two different networks, MobileNetV2 and ResNet-50. Our experiments on three datasets, MNIST, CIFAR-10, and MultiMNIST validate their effectiveness. The results demonstrate that both XnODR and XnIDR help networks to have high accuracy with lower FLOPs and fewer parameters (e.g., 96.14% correctness with 2.99M parameters and 311.74M FLOPs on CIFAR-10).