Emotion Recognition is attracting the attention of the research community due to the multiple areas where it can be applied, such as in healthcare or in road safety systems. In this paper, we propose ...a multimodal emotion recognition system that relies on speech and facial information. For the speech-based modality, we evaluated several transfer-learning techniques, more specifically, embedding extraction and Fine-Tuning. The best accuracy results were achieved when we fine-tuned the CNN-14 of the PANNs framework, confirming that the training was more robust when it did not start from scratch and the tasks were similar. Regarding the facial emotion recognizers, we propose a framework that consists of a pre-trained Spatial Transformer Network on saliency maps and facial images followed by a bi-LSTM with an attention mechanism. The error analysis reported that the frame-based systems could present some problems when they were used directly to solve a video-based task despite the domain adaptation, which opens a new line of research to discover new ways to correct this mismatch and take advantage of the embedded knowledge of these pre-trained models. Finally, from the combination of these two modalities with a late fusion strategy, we achieved 80.08% accuracy on the RAVDESS dataset on a subject-wise 5-CV evaluation, classifying eight emotions. The results revealed that these modalities carry relevant information to detect users’ emotional state and their combination enables improvement of system performance.
In this survey, a systematic literature review of the state-of-the-art on emotion expression recognition from facial images is presented. The paper has as main objective arise the most commonly used ...strategies employed to interpret and recognize facial emotion expressions, published over the past few years. For this purpose, a total of 51 papers were analyzed over the literature totaling 94 distinct methods, collected from well-established scientific databases (ACM Digital Library, IEEE Xplore, Science Direct and Scopus), whose works were categorized according to its main construction concept. From the analyzed works, it was possible to categorize them into two main trends: classical and those approaches specifically designed by the use of neural networks. The obtained statistical analysis demonstrated a marginally better recognition precision for the classical approaches when faced to neural networks counterpart, but with a reduced capacity of generalization. Additionally, the present study verified the most popular datasets for facial expression and emotion recognition showing the pros and cons each and, thereby, demonstrating a real demand for reliable data-sources regarding artificial and natural experimental environments.
Speech emotion recognition (SER) is a difficult task due to the complexity of emotions. The SER performances are heavily dependent on the effectiveness of emotional features extracted from the ...speech. However, most emotional features are sensitive to emotionally irrelevant factors, such as the speaker, speaking styles, and environment. In this letter, we assume that calculating the deltas and delta-deltas for personalized features not only preserves the effective emotional information but also reduces the influence of emotionally irrelevant factors, leading to reduce misclassification. In addition, SER often suffers from the silent frames and emotionally irrelevant frames. Meanwhile, attention mechanism has exhibited outstanding performances in learning relevant feature representations for specific tasks. Inspired by this, we propose a three-dimensional attention-based convolutional recurrent neural networks to learn discriminative features for SER, where the Mel-spectrogram with deltas and delta-deltas are used as input. Experiments on IEMOCAP and Emo-DB corpus demonstrate the effectiveness of the proposed method and achieve the state-of-the-art performance in terms of unweighted average recall.
Speech emotion recognition involves analyzing vocal changes caused by emotions with acoustic analysis and determining the features to be used for emotion recognition. The number of features obtained ...by acoustic analysis reaches very high values depending on the number of acoustic parameters used and statistical variations of these parameters. Not all of these features are effective for emotion recognition; in addition, different emotions may effect different vocal features. For this reason, feature selection methods are used to increase the emotional recognition success and reduce workload with fewer features. There is no certainty that existing feature selection methods increase/decrease emotion recognition success; some of these methods increase the total workload. In this study, a new statistical feature selection method is proposed based on the changes in emotions on acoustic features. The success of the proposed method is compared with other methods mostly used in literature. The comparison was made based on number of feature and emotion recognition success. According to the results obtained, the proposed method provides a significant reduction in the number of features, as well as increasing the classification success.
Recent advances in transformer-based architectures have shown promise in several machine learning tasks. In the audio domain, such architectures have been successfully utilised in the field of speech ...emotion recognition (SER). However, existing works have not evaluated the influence of model size and pre-training data on downstream performance, and have shown limited attention to generalisation , robustness , fairness , and efficiency . The present contribution conducts a thorough analysis of these aspects on several pre-trained variants of wav2vec 2.0 and HuBERT that we fine-tuned on the dimensions arousal, dominance, and valence of MSP-Podcast, while additionally using IEMOCAP and MOSI to test cross-corpus generalisation. To the best of our knowledge, we obtain the top performance for valence prediction without use of explicit linguistic information, with a concordance correlation coefficient (CCC) of. 638 on MSP-Podcast. Our investigations reveal that transformer-based architectures are more robust compared to a CNN-based baseline and fair with respect to gender groups, but not towards individual speakers. Finally, we show that their success on valence is based on implicit linguistic information, which explains why they perform on-par with recent multimodal approaches that explicitly utilise textual information. To make our findings reproducible, we release the best performing model to the community.
Emotion recognition via gait analysis is an active and key area of research because of its significant academic and commercial potential. With recent developments in hardware technology, the use of ...inertial sensors allows researchers to effectively capture the human motion data for gait analysis. To this end, the aim of this paper is to identify emotions from the inertial signals of human gait recorded by means of body mounted smartphones. We extracted a manually-crafted set of features computed from the human inertial gait data which are used to train and precisely predict the human emotions. Specifically, we collected the inertial gait data of 40 volunteers by means of smartphone's on-board inertial measurement units (3D accelerometer, 3D gyroscope) attached at the chest in six basic emotions including sad, happy, anger, surprise, disgust and fear . Using stride based segmentation, the raw signals are first decomposed into individual strides. For each stride, a set of 296 spectro-temporal features are computed, which are fed into two supervised learning predictors namely Support Vector Machines and Random Forest. The classification results obtained with the proposed methodology and validated with k-fold validation procedure show classification accuracy of 95% for binary emotions and 86% for all six categories of emotions.
•A new Deep Fully Connected model for facial emotion recognition.•The model evaluated based on multiple hyperparameters and real world datasets.•The application of the model on the real-time events.
...Humans use facial expressions to show their emotional states. However, facial expression recognition has remained a challenging and interesting problem in computer vision. In this paper we present our approach which is the extension of our previous work for facial emotion recognition 1. The aim of this work is to classify each image into one of six facial emotion classes. The proposed model is based on single Deep Convolutional Neural Networks (DNNs), which contain convolution layers and deep residual blocks. In the proposed model, firstly the image label to all faces has been set for the training. Secondly, the images go through proposed DNN model. This model trained on two datasets Extended Cohn–Kanade (CK+) and Japanese Female Facial Expression (JAFFE) Dataset. The overall results show that, the proposed DNN model can outperform the recent state-of-the-art approaches for emotion recognition. Even the proposed model has accuracy improvement in comparison with our previous model.
In this paper, a multichannel EEG emotion recognition method based on a novel dynamical graph convolutional neural networks (DGCNN) is proposed. The basic idea of the proposed EEG emotion recognition ...method is to use a graph to model the multichannel EEG features and then perform EEG emotion classification based on this model. Different from the traditional graph convolutional neural networks (GCNN) methods, the proposed DGCNN method can dynamically learn the intrinsic relationship between different electroencephalogram (EEG) channels, represented by an adjacency matrix, via training a neural network so as to benefit for more discriminative EEG feature extraction. Then, the learned adjacency matrix is used to learn more discriminative features for improving the EEG emotion recognition. We conduct extensive experiments on the SJTU emotion EEG dataset (SEED) and DREAMER dataset. The experimental results demonstrate that the proposed method achieves better recognition performance than the state-of-the-art methods, in which the average recognition accuracy of 90.4 percent is achieved for subject dependent experiment while 79.95 percent for subject independent cross-validation one on the SEED database, and the average accuracies of 86.23, 84.54 and 85.02 percent are respectively obtained for valence, arousal and dominance classifications on the DREAMER database.
•Current methodologies in Speech Emotion Recognition (SER) technologies are reviewed.•The review includes all related areas of SER; emotional models, databases, features, preprocessing methods, ...supporting modalities, and classifiers.•The review covers the recent developments in classifiers, such as the usage of convolutional and recurrent neural networks.•A list of existing challenges is also discussed.
Speech is the most natural way of expressing ourselves as humans. It is only natural then to extend this communication medium to computer applications. We define speech emotion recognition (SER) systems as a collection of methodologies that process and classify speech signals to detect the embedded emotions. SER is not a new field, it has been around for over two decades, and has regained attention thanks to the recent advancements. These novel studies make use of the advances in all fields of computing and technology, making it necessary to have an update on the current methodologies and techniques that make SER possible. We have identified and discussed distinct areas of SER, provided a detailed survey of current literature of each, and also listed the current challenges.