Akademska digitalna zbirka SLovenije - logo
E-viri
Celotno besedilo
Recenzirano
  • MLT-DNet: Speech emotion re...
    Mustaqeem; Kwon, Soonil

    Expert systems with applications, 04/2021, Letnik: 167
    Journal Article

    •A lightweight model using a one-dimensional CNN for real-time SER system is proposed.•A multi-learning trick (MLT) is proposed for utilizing UFLBs, and stacked GRUs setup.•Proposed model have peculiar ability to parallel learn spatial and temporal features.•A 1D dilated CNN architecture is explored, in order to enhance the usage of features.•We evaluated our model on benchmark corpora and improve the current baseline methods. Speech is the most dominant source of communication among humans, and it is an efficient way for human–computer interaction (HCI) to exchange information. Nowadays, speech emotion recognition (SER) is an active research area that plays a crucial role in real-time applications. In this era, the SER system has lacked real-time speech processing. To address this problem, we propose an end-to-end real-time SER model that is based on a one-dimensional dilated convolutional neural network (DCNN). Our model used a multi-learning strategy to parallel extract spatial salient emotional features and learn long term contextual dependencies from the speech signals. We used residual blocks with a skip connection (RBSC) module, in order to find a correlation, the emotional cues, and the sequence learning (Seq_L) module, to learn the long term contextual dependencies in the input features. Furthermore, we used a fusion layer to concatenate these learned features for the final emotion recognition task. Our model structure is quite simple, and it is capable of automatically learning salient discriminative features from the speech signals. We evaluated our model using benchmark IEMOCAP and EMO-DB datasets and obtained a high recognition accuracy, which were 73% and 90%, respectively. The experimental results indicated the significance and the efficiency of our proposed model have shown excessive assistance with the implementation of a real-time SER system. Hence, our model is capable of processing original speech signals for the emotion recognition that utilizes lightweight dilated CNN architecture that implements the multi-learning trick (MLT) approach.