Speech emotion recognition (SER) is an essential field of artificial intelligence. Although the Mel spectrogram is commonly used in SER, it emphasizes low-frequency emotional components. In this ...paper, we propose VMD-Teager-Mel (VTMel) spectrogram, which complements the Mel spectrogram by emphasizing high-frequency components. In addition, to reduce the redundancy of the acoustic features, we propose a convolutional neural network with a deep restricted Boltzmann machine (CNN-DBM) to obtain optimized deep features. Furthermore, a dual-channel complementary structure is proposed for SER. First, a CNN-DBM extracts optimized deep features from the Mel spectrogram, highlighting low-frequency components. Second, another CNN-DBM extracts optimized deep features from the VTMel spectrogram, highlighting high-frequency components. These features are spliced together and fed to a classifier. The experimental results on three public datasets (EMO-DB, SAVEE, and RAVDESS) reveal that the use of the merged features achieves better performance, confirming the complementarity between the Mel and VTMel spectrograms. The recognition accuracy using CNN-DBM optimized deep features is superior to that using deep features from CNN alone, demonstrating the superiority of the proposed method. Our experiments also show advantages of the proposed method compared with the state-of-the-art methods reported in the literature.
•A VTMel spectrogram that supplements the Mel spectrogram is proposed, highlighting high-frequency components.•The optimized deep features using the CNN-DBM networks from spectrograms are abstracted.•A dual-channel complementary structure is designed for speech emotion recognition and performs well.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP
The classification of electrocardiogram (ECG) signals is very important for the automatic diagnosis of heart disease. Traditionally, it is divided into two steps, including the step of feature ...extraction and the step of pattern classification. Owing to recent advances in artificial intelligence, it has been demonstrated that deep neural network, which trained on a huge amount of data, can carry out the task of feature extraction directly from the data and recognize cardiac arrhythmias better than professional cardiologists. This paper proposes an ECG arrhythmia classification method using two-dimensional (2D) deep convolutional neural network (CNN). The time domain signals of ECG, belonging to five heart beat types including normal beat (NOR), left bundle branch block beat (LBB), right bundle branch block beat (RBB), premature ventricular contraction beat (PVC), and atrial premature contraction beat (APC), were first transformed into time-frequency spectrograms by short-time Fourier transform. Subsequently, the spectrograms of the five arrhythmia types were utilized as input to the 2D-CNN such that the ECG arrhythmia types were identified and classified finally. Using ECG recordings from the MIT-BIH arrhythmia database as the training and testing data, the classification results show that the proposed 2D-CNN model can reach an averaged accuracy of 99.00%. On the other hand, in order to achieve optimal classification performances, the model parameter optimization was investigated. It was found when the learning rate is 0.001 and the batch size parameter is 2500, the classifier achieved the highest accuracy and the lowest loss. We also compared the proposed 2D-CNN model with a conventional one-dimensional CNN model. Comparison results show that the 1D-CNN classifier can achieve an averaged accuracy of 90.93%. Therefore, it is validated that the proposed CNN classifier using ECG spectrograms as input can achieve improved classification accuracy without additional manual pre-processing of the ECG signals.
•First, the spectrograms of heart cycles are scaled for comparison.•Second, tensor decomposition is utilized to the scaled spectrograms.•Third, the intrinsic structure information of scaled ...spectrograms is extracted.•Fourth, more useful physiological and pathological information is reserved.•Fifth, the extracted features are more discriminative.
Heart sound signal analysis is an effective and convenient method for the preliminary diagnosis of heart disease. However, automatic heart sound classification is still a challenging problem which mainly reflected in heart sound segmentation and feature extraction from the corresponding segmentation results. In order to extract more discriminative features for heart sound classification, a scaled spectrogram and tensor decomposition based method was proposed in this study. In the proposed method, the spectrograms of the detected heart cycles are first scaled to a fixed size. Then a dimension reduction process of the scaled spectrograms is performed to extract the most discriminative features. During the dimension reduction process, the intrinsic structure of the scaled spectrograms, which contains important physiological and pathological information of the heart sound signals, is extracted using tensor decomposition method. As a result, the extracted features are more discriminative. Finally, the classification task is completed by support vector machine (SVM). Moreover, the proposed method is evaluated on three public datasets offered by the PASCAL classifying heart sounds challenge and 2016 PhysioNet challenge. The results show that the proposed method is competitive.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UL, UM, UPCLJ, UPUK, ZRSKP
In this study, an effective approach of spectral images based on environmental sound classification using Convolutional Neural Networks (CNN) with meaningful data augmentation is proposed. The ...feature used in this approach is the Mel spectrogram. Our approach is to define features from audio clips in the form of spectrogram images. The randomly selected CNN models used in this experiment are, a 7-layer or a 9-layer CNN learned from scratch. Also, various well-known deep learning structures with transfer learning and with a concept of freezing initial layers, training model, unfreezing the layers, again training the model with discriminative learning are considered. Three datasets, ESC-10, ESC-50, and Us8k are considered. As for the transfer learning methodology, 11 explicit pre-trained deep learning structures are used. In this study, instead of using those available data augmentation schemes for images, we proposed to have meaningful data augmentation by considering variations applied to the audio clips directly. The results show the effectiveness, robustness, and high accuracy of the proposed approach. The meaningful data augmentation can accomplish the highest accuracy with a lower error rate on all datasets by using transfer learning models. Among those used models, The ResNet-152 attained 99.04% for ESC-10 and 99.49% for Us8k datasets. DenseNet-161 gained 97.57% for ESC-50. From our understanding, they are the best-achieved results on these datasets.
Full text
Available for:
GEOZS, IJS, IMTLJ, KILJ, KISLJ, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, UILJ, UL, UM, UPCLJ, UPUK, ZAGLJ, ZRSKP
Background: The manual detection, analysis and classification of animal vocalizations in acoustic recordings is laborious and requires expert knowledge. Hence, there is a need for objective, ...generalizable methods that detect underlying patterns in these data, categorize sounds into distinct groups and quantify similarities between them. Among all computational methods that have been proposed to accomplish this, neighbourhood‐based dimensionality reduction of spectrograms to produce a latent space representation of calls stands out for its conceptual simplicity and effectiveness.
Goal of the study/what was done: Using a dataset of manually annotated meerkat Suricata suricatta vocalizations, we demonstrate how this method can be used to obtain meaningful latent space representations that reflect the established taxonomy of call types. We analyse strengths and weaknesses of the proposed approach, give recommendations for its usage and show application examples, such as the classification of ambiguous calls and the detection of mislabelled calls.
What this means: All analyses are accompanied by example code to help researchers realize the potential of this method for the study of animal vocalizations.
Complexity and a lack of training materials often create barriers in using computational methods to detect patterns in animal vocalizations. This Research Methods Guide provides a tutorial and example code for a simple, yet effective computational method that enables researchers to apply this method to their own data.
Full text
Available for:
BFBNIB, FZAB, GIS, IJS, KILJ, NLZOH, NUK, OILJ, SBCE, SBMB, UL, UM, UPUK
Music genre classification based on visual representation has been successfully explored over the last years. Recently, there has been increasing interest in attempting convolutional neural networks ...(CNNs) to achieve the task. However, most of the existing methods employ the mature CNN structures proposed in image recognition without any modification, which results in the learning features that are not adequate for music genre classification. Faced with the challenge of this issue, we fully exploit the low-level information from spectrograms of audio and develop a novel CNN architecture in this paper. The proposed CNN architecture takes the multi-scale time-frequency information into considerations, which transfers more suitable semantic features for the decision-making layer to discriminate the genre of the unknown music clip. The experiments are evaluated on the benchmark datasets including GTZAN, Ballroom, and Extended Ballroom. The experimental results show that the proposed method can achieve 93.9%, 96.7%, 97.2% classification accuracies respectively, which to the best of our knowledge, are the best results on these public datasets so far. It is notable that the trained model by our proposed network possesses tiny size, only 0.18M, which can be applied in mobile phones or other devices with limited computational resources. Codes and model will be available at
https://github.com/CaifengLiu/music-genre-classification
.
Full text
Available for:
CEKLJ, EMUNI, FIS, FZAB, GEOZS, GIS, IJS, IMTLJ, KILJ, KISLJ, MFDPS, NLZOH, NUK, OILJ, PNG, SAZU, SBCE, SBJE, SBMB, SBNM, UKNU, UL, UM, UPUK, VKSCE, ZAGLJ
This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps ...character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize time-domain waveforms from those spectrograms. Our model achieves a mean opinion score (MOS) of 4.53 comparable to a MOS of 4.58 for professionally recorded speech. To validate our design choices, we present ablation studies of key components of our system and evaluate the impact of using mel spectrograms as the conditioning input to WaveNet instead of linguistic, duration, and F_{0} features. We further show that using this compact acoustic intermediate representation allows for a significant reduction in the size of the WaveNet architecture.
During the last few years, the speaker recognition technique has been widely attractive for its extensive application in many fields, such as speech communications, domestics services, and smart ...terminals. As a critical method, the Gaussian mixture model (GMM) makes it possible to achieve the recognition capability that is close to the hearing ability of human in a long speech. However, the GMM is failing to recognize a short utterance speaker with a high accuracy. Aiming at solving this problem, in this paper, we propose a novel model to enhance the recognition accuracy of the short utterance speaker recognition system. Different from traditional models based on the GMM, we design a method to train a convolutional neural network to process spectrograms, which can describe speakers better. Thus, the recognition system gains the considerable accuracy as well as the reasonable convergence speed. The experiment results show that our model can help to decrease the equal error rate of the recognition from 4.9% to 2.5%.
Recurrent sequence-to-sequence models using encoder-decoder architecture have made great progress in speech recognition task. However, they suffer from the drawback of slow training speed because the ...internal recurrence limits the training parallelization. In this paper, we present the Speech-Transformer, a no-recurrence sequence-to-sequence model entirely relies on attention mechanisms to learn the positional dependencies, which can be trained faster with more efficiency. We also propose a 2D-Attention mechanism, which can jointly attend to the time and frequency axes of the 2-dimensional speech inputs, thus providing more expressive representations for the Speech-Transformer. Evaluated on the Wall Street Journal (WSJ) speech recognition dataset, our best model achieves competitive word error rate (WER) of 10.9%, while the whole training process only takes 1.2 days on 1 GPU, significantly faster than the published results of recurrent sequence-to-sequence models.
In this paper, a neural network named sequence-to-sequence ConvErsion NeTwork (SCENT) is presented for acoustic modeling in voice conversion. At training stage, a SCENT model is estimated by aligning ...the feature sequences of source and target speakers implicitly using attention mechanism. At the conversion stage, acoustic features and durations of source utterances are converted simultaneously using the unified acoustic model. Mel-scale spectrograms are adopted as acoustic features, which contain both excitation and vocal tract descriptions of speech signals. The bottleneck features extracted from source speech using an automatic speech recognition model are appended as an auxiliary input. A WaveNet vocoder conditioned on Mel-spectrograms is built to reconstruct waveforms from the outputs of the SCENT model. It is worth noting that our proposed method can achieve appropriate duration conversion, which is difficult in conventional methods. Experimental results show that our proposed method obtained better objective and subjective performance than the baseline methods using Gaussian mixture models and deep neural networks as acoustic models. This proposed method also outperformed our previous work, which achieved the top rank in Voice Conversion Challenge 2018. Ablation tests further confirmed the effectiveness of several components in our proposed method.