Speech is the most significant mode of communication among human beings and a potential method for human-computer interaction (HCI) by using a microphone sensor. Quantifiable emotion recognition ...using these sensors from speech signals is an emerging area of research in HCI, which applies to multiple applications such as human-reboot interaction, virtual reality, behavior assessment, healthcare, and emergency call centers to determine the speaker's emotional state from an individual's speech. In this paper, we present major contributions for; (i) increasing the accuracy of speech emotion recognition (SER) compared to state of the art and (ii) reducing the computational complexity of the presented SER model. We propose an artificial intelligence-assisted deep stride convolutional neural network (DSCNN) architecture using the plain nets strategy to learn salient and discriminative features from spectrogram of speech signals that are enhanced in prior steps to perform better. Local hidden patterns are learned in convolutional layers with special strides to down-sample the feature maps rather than pooling layer and global discriminative features are learned in fully connected layers. A SoftMax classifier is used for the classification of emotions in speech. The proposed technique is evaluated on Interactive Emotional Dyadic Motion Capture (IEMOCAP) and Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) datasets to improve accuracy by 7.85% and 4.5%, respectively, with the model size reduced by 34.5 MB. It proves the effectiveness and significance of the proposed SER technique and reveals its applicability in real-world applications.
•A lightweight model using a one-dimensional CNN for real-time SER system is proposed.•A multi-learning trick (MLT) is proposed for utilizing UFLBs, and stacked GRUs setup.•Proposed model have ...peculiar ability to parallel learn spatial and temporal features.•A 1D dilated CNN architecture is explored, in order to enhance the usage of features.•We evaluated our model on benchmark corpora and improve the current baseline methods.
Speech is the most dominant source of communication among humans, and it is an efficient way for human–computer interaction (HCI) to exchange information. Nowadays, speech emotion recognition (SER) is an active research area that plays a crucial role in real-time applications. In this era, the SER system has lacked real-time speech processing. To address this problem, we propose an end-to-end real-time SER model that is based on a one-dimensional dilated convolutional neural network (DCNN). Our model used a multi-learning strategy to parallel extract spatial salient emotional features and learn long term contextual dependencies from the speech signals. We used residual blocks with a skip connection (RBSC) module, in order to find a correlation, the emotional cues, and the sequence learning (Seq_L) module, to learn the long term contextual dependencies in the input features. Furthermore, we used a fusion layer to concatenate these learned features for the final emotion recognition task. Our model structure is quite simple, and it is capable of automatically learning salient discriminative features from the speech signals. We evaluated our model using benchmark IEMOCAP and EMO-DB datasets and obtained a high recognition accuracy, which were 73% and 90%, respectively. The experimental results indicated the significance and the efficiency of our proposed model have shown excessive assistance with the implementation of a real-time SER system. Hence, our model is capable of processing original speech signals for the emotion recognition that utilizes lightweight dilated CNN architecture that implements the multi-learning trick (MLT) approach.
Artificial intelligence (AI) and machine learning (ML) are employed to make systems smarter. Today, the speech emotion recognition (SER) system evaluates the emotional state of the speaker by ...investigating his/her speech signal. Emotion recognition is a challenging task for a machine. In addition, making it smarter so that the emotions are efficiently recognized by AI is equally challenging. The speech signal is quite hard to examine using signal processing methods because it consists of different frequencies and features that vary according to emotions, such as anger, fear, sadness, happiness, boredom, disgust, and surprise. Even though different algorithms are being developed for the SER, the success rates are very low according to the languages, the emotions, and the databases. In this paper, we propose a new lightweight effective SER model that has a low computational complexity and a high recognition accuracy. The suggested method uses the convolutional neural network (CNN) approach to learn the deep frequency features by using a plain rectangular filter with a modified pooling strategy that have more discriminative power for the SER. The proposed CNN model was trained on the extracted frequency features from the speech data and was then tested to predict the emotions. The proposed SER model was evaluated over two benchmarks, which included the interactive emotional dyadic motion capture (IEMOCAP) and the berlin emotional speech database (EMO-DB) speech datasets, and it obtained 77.01% and 92.02% recognition results. The experimental results demonstrated that the proposed CNN-based SER system can achieve a better recognition performance than the state-of-the-art SER systems.
Emotional state recognition of a speaker is a difficult task for machine learning algorithms which plays an important role in the field of speech emotion recognition (SER). SER plays a significant ...role in many real-time applications such as human behavior assessment, human-robot interaction, virtual reality, and emergency centers to analyze the emotional state of speakers. Previous research in this field is mostly focused on handcrafted features and traditional convolutional neural network (CNN) models used to extract high-level features from speech spectrograms to increase the recognition accuracy and overall model cost complexity. In contrast, we introduce a novel framework for SER using a key sequence segment selection based on redial based function network (RBFN) similarity measurement in clusters. The selected sequence is converted into a spectrogram by applying the STFT algorithm and passed into the CNN model to extract the discriminative and salient features from the speech spectrogram. Furthermore, we normalize the CNN features to ensure precise recognition performance and feed them to the deep bi-directional long short-term memory (BiLSTM) to learn the temporal information for recognizing the final state of emotion. In the proposed technique, we process the key segments instead of the whole utterance to reduce the computational complexity of the overall model and normalize the CNN features before their actual processing, so that it can easily recognize the Spatio-temporal information. The proposed system is evaluated over different standard dataset including IEMOCAP, EMO-DB, and RAVDESS to improve the recognition accuracy and reduce the processing time of the model, respectively. The robustness and effectiveness of the suggested SER model is proved from the experimentations when compared to state-of-the-art SER methods with an achieve up to 72.25%, 85.57%, and 77.02% accuracy over IEMOCAP, EMO-DB, and RAVDESS dataset, respectively.
Artificial intelligence, deep learning, and machine learning are dominant sources to use in order to make a system smarter. Nowadays, the smart speech emotion recognition (SER) system is a basic ...necessity and an emerging research area of digital audio signal processing. However, SER plays an important role with many applications that are related to human–computer interactions (HCI). The existing state-of-the-art SER system has a quite low prediction performance, which needs improvement in order to make it feasible for the real-time commercial applications. The key reason for the low accuracy and the poor prediction rate is the scarceness of the data and a model configuration, which is the most challenging task to build a robust machine learning technique. In this paper, we addressed the limitations of the existing SER systems and proposed a unique artificial intelligence (AI) based system structure for the SER that utilizes the hierarchical blocks of the convolutional long short-term memory (ConvLSTM) with sequence learning. We designed four blocks of ConvLSTM, which is called the local features learning block (LFLB), in order to extract the local emotional features in a hierarchical correlation. The ConvLSTM layers are adopted for input-to-state and state-to-state transition in order to extract the spatial cues by utilizing the convolution operations. We placed four LFLBs in order to extract the spatiotemporal cues in the hierarchical correlational form speech signals using the residual learning strategy. Furthermore, we utilized a novel sequence learning strategy in order to extract the global information and adaptively adjust the relevant global feature weights according to the correlation of the input features. Finally, we used the center loss function with the softmax loss in order to produce the probability of the classes. The center loss increases the final classification results and ensures an accurate prediction as well as shows a conspicuous role in the whole proposed SER scheme. We tested the proposed system over two standard, interactive emotional dyadic motion capture (IEMOCAP) and ryerson audio visual database of emotional speech and song (RAVDESS) speech corpora, and obtained a 75% and an 80% recognition rate, respectively.
Abstract
While the precise diagnosis of early stage epiretinal membrane (ERM) at the time of cataract surgery and evaluation of risk factors for development or progression of ERM after cataract ...surgery is increasingly important, there is only limited information. In the present study, we evaluated the risk factors for onset or progression of ERM on spectral domain optical coherence tomography (SD-OCT) after cataract surgery. The univariate analysis showed that eyes with partial posterior vitreous detachment (PVD;
p
< 0.001), hyper-reflective foci (HF) on the inner retinal surface (
p
< 0.001), vitreoschisis (
p
= 0.014), and discrete margin of different retinal reflectivity (DMDRR;
p
= 0.007) on ultra-widefield fundus photography (UWF-FP) had significant risk for the onset or progression of ERM after cataract surgery. The multivariate analysis showed that partial PVD (HR, 3.743; 95% confidence interval CI, 1.956–7.162;
p
< 0.001), HF (HR, 2.330; 95% CI, 1.281–4.239;
p
= 0.006), and DMDRR on UWF-FP (HR, 3.392; 95% CI, 1.522–7.558;
p
= 0.003) were the independent risk factors for the onset or progression of ERM after cataract surgery after adjustment for other confounding factors. Our study shows that the onset or progression of ERM after cataract surgery depends on an abnormal vitreoretinal interface (VRI) represented by partial PVD, HF on SD-OCT, and DMDRR on UWF-FP, not on age, axial length, or presence of ERM at the time of surgery. A meticulous funduscopic evaluation of the VRI would help to predict the ERM risk before cataract surgery.
Industrial and building sectors demand efficient smart energy strategies, techniques of optimization, and efficient management for reducing global energy consumption due to the increasing world ...population. Nowadays, various artificial intelligence (AI) based methods are utilized to perform optimal energy forecasting, different simulation tools, and engineering methods to predict future demand based on historical data. Nevertheless, nonlinear energy demand modeling is still unfledged for a better solution to handle short-term and long-term dependencies and avoid static nature because it is purely on historical data-driven. In this paper, we propose an ensemble deep learning-based approach to predict and forecast energy demand and consumption by using chronological dependencies. Our system initially processes the data, cleaning, normalization, and transformation to ensure the model performance. Furthermore, the preprocess data is fed to proposed ensemble model to extract hybrid discriminative features by using convolution neural network (CNN), stacked, and bi-directional long-short term memory (LSTM) architectures. We trained our proposed system on the historical data to forecast the energy demand and consumption with a different time interval. In the proposed technique, we utilized the concept of active learning based on moving windows to ensure and improve the prediction performance of the system. The proposed system could be applicable to employ energy consumption in industrial and building sectors to demonstrate their significance and effectiveness. We evaluated the proposed system by using benchmark, residential UCI, and local Korean commercial building datasets. We conducted different extensive experimentation to show the error rate and used various kinds of evaluation matrices, which indicate the lower error rate of the proposed system.
Cardiac disorders are critical and must be diagnosed in the early stage using routine auscultation examination with high precision. Cardiac auscultation is a technique to analyze and listen to heart ...sound using electronic stethoscope, an electronic stethoscope is a device which provides the digital recording of the heart sound called phonocardiogram (PCG). This PCG signal carries useful information about the functionality and status of the heart and hence several signal processing and machine learning technique can be applied to study and diagnose heart disorders. Based on PCG signal, the heart sound signal can be classified to two main categories i.e., normal and abnormal categories. We have created database of 5 categories of heart sound signal (PCG signals) from various sources which contains one normal and 4 are abnormal categories. This study proposes an improved, automatic classification algorithm for cardiac disorder by heart sound signal. We extract features from phonocardiogram signal and then process those features using machine learning techniques for classification. In features extraction, we have used Mel Frequency Cepstral Coefficient (MFCCs) and Discrete Wavelets Transform (DWT) features from the heart sound signal, and for learning and classification we have used support vector machine (SVM), deep neural network (DNN) and centroid displacement based k nearest neighbor. To improve the results and classification accuracy, we have combined MFCCs and DWT features for training and classification using SVM and DWT. From our experiments it has been clear that results can be greatly improved when Mel Frequency Cepstral Coefficient and Discrete Wavelets Transform features are fused together and used for classification via support vector machine, deep neural network and k-neareast neighbor(KNN). The methodology discussed in this paper can be used to diagnose heart disorders in patients up to 97% accuracy. The code and dataset can be accessed at “https://github.com/yaseen21khan/Classification-of-Heart-Sound-Signal-Using-Multiple-Features-/blob/master/README.md”.
Speech signals are being used as a primary input source in human-computer interaction (HCI) to develop several applications, such as automatic speech recognition (ASR), speech emotion recognition ...(SER), gender, and age recognition. Classifying speakers according to their age and gender is a challenging task in speech processing owing to the disability of the current methods of extracting salient high-level speech features and classification models. To address these problems, we introduce a novel end-to-end age and gender recognition convolutional neural network (CNN) with a specially designed multi-attention module (MAM) from speech signals. Our proposed model uses MAM to extract spatial and temporal salient features from the input data effectively. The MAM mechanism uses a rectangular shape filter as a kernel in convolution layers and comprises two separate time and frequency attention mechanisms. The time attention branch learns to detect temporal cues, whereas the frequency attention module extracts the most relevant features to the target by focusing on the spatial frequency features. The combination of the two extracted spatial and temporal features complements one another and provide high performance in terms of age and gender classification. The proposed age and gender classification system was tested using the Common Voice and locally developed Korean speech recognition datasets. Our suggested model achieved 96%, 73%, and 76% accuracy scores for gender, age, and age-gender classification, respectively, using the Common Voice dataset. The Korean speech recognition dataset results were 97%, 97%, and 90% for gender, age, and age-gender recognition, respectively. The prediction performance of our proposed model, which was obtained in the experiments, demonstrated the superiority and robustness of the tasks regarding age, gender, and age-gender recognition from speech signals.
Emotion recognition from speech signals is an interesting research with several applications like smart healthcare, autonomous voice response systems, assessing situational seriousness by caller ...affective state analysis in emergency centers, and other smart affective services. In this paper, we present a study of speech emotion recognition based on the features extracted from spectrograms using a deep convolutional neural network (CNN) with rectangular kernels. Typically, CNNs have square shaped kernels and pooling operators at various layers, which are suited for 2D image data. However, in case of spectrograms, the information is encoded in a slightly different manner. Time is represented along the x-axis and y-axis shows frequency of the speech signal, whereas, the amplitude is indicated by the intensity value in the spectrogram at a particular position. To analyze speech through spectrograms, we propose rectangular kernels of varying shapes and sizes, along with max pooling in rectangular neighborhoods, to extract discriminative features. The proposed scheme effectively learns discriminative features from speech spectrograms and performs better than many state-of-the-art techniques when evaluated its performance on Emo-DB and Korean speech dataset.