People express emotions through different modalities. Utilization of both verbal and nonverbal communication channels allows to create a system in which the emotional state is expressed more clearly ...and therefore easier to understand. Expanding the focus to several expression forms can facilitate research on emotion recognition as well as human–machine interaction. This article presents analysis of audiovisual information to recognize human emotions. A cross-corpus evaluation is done using three different databases as the training set (SAVEE, eNTERFACE’05 and RML) and AFEW (database simulating real-world conditions) as a testing set. Emotional speech is represented by commonly known audio and spectral features as well as MFCC coefficients. The SVM algorithm has been used for classification. In case of facial expression, faces in key frames are found using Viola–Jones face recognition algorithm and facial image emotion classification done by CNN (AlexNet). Multimodal emotion recognition is based on decision-level fusion. The performance of emotion recognition algorithm is compared with the validation of human decision makers.
•This paper is the first comprehensive review of mechanical fault diagnosis based on audio signal analysis (MFDA).•According to the model principle, the MFDA methods are classified and summarized in ...detail.•This paper provides a clear definition of MFDA and introduces its development process and mechanism.•This paper proposes that unknown fault diagnosis, compound fault diagnosis, high-noise fault diagnosis, and the development of a real-time MFDA system will be the focus of future research.
Mechanical fault diagnosis is one of the important technologies in the fourth industrial revolution. In recent years, mechanical fault diagnosis based on audio signal analysis (MFDA) has gradually become a research hotspot in the field of mechanical fault diagnosis, because of its advantages of high detection accuracy, good generalization, non-embedded measurement, and low cost. However, to the best of our knowledge, there is no comprehensive review work in this field. To facilitate colleagues to quickly understand the relevant progress in this field, this paper reviews and summarizes the research work on MFDA in recent years. First, we provide a clear definition of MFDA and introduce its development process and mechanism. Then, the related research results are classified, and the advantages and disadvantages of each method are discussed. Finally, the development prospects and challenges of MFDA are summarized. We hope that this review can provide some useful insights for researchers.
Sound event localization and detection (SELD) is a combined task that classifies acoustic events from audio signals, estimates temporal boundaries, and identifies event locations. With the ...advancement of industries utilizing audio signals, SELD has been applied in various fields, and deep-learning-based research is being conducted for its effective application. However, current deep-learning-based SELD research focuses mainly on performance improvement in noise-free environments, which leads to performance degradation issues in noisy environments. To address this problem, this study proposes a robust SELD U-Net model that performs SELD in noisy environments. The proposed model combines a U-Net to remove noise and a SELDnet to perform SELD. The proposed model was trained and evaluated using noisy environmental data with various sizes. Consequently, it was confirmed that the proposed model has superior performance compared with existing deep learning-based SELD models in environments with high levels of noise.
In this paper, the authors make an attempt to solve the convoluted problem of identifying musical instruments based on their audio excerpts, using a deep convolutional neural network. Continuous ...wavelet transform of audio signals are realized through Morse wavelet and two‐dimensional feature maps are formed, which are then fed to a simple yet robust convolutional neural network. The outcome is appreciable in the sense that training the model with just 20% of the data and testing on the rest gives a classification accuracy of 85%.
We show above a simple, yet robust convolutional neural network, which can classify musical instruments with appreciable accuracy. In this proof‐of‐concept paper we argue that it is possible to identify musical instruments on the basis of scalograms and that too with this simple CNN model. This indicates that the overall framework can be realized on a portable user device which might have limited power, memory or computational capability.
Environmental sound signals are multi-source, heterogeneous, and varying in time. Many systems have been proposed to process such signals for event detection in ambient assisted living applications. ...Typically, these systems use feature extraction, selection, and classification. However, despite major advances, several important questions remain unanswered, especially in real-world settings. This paper contributes to the body of knowledge in the field by addressing the following problems for ambient sounds recorded in various real-world kitchen environments: (1) which features and which classifiers are most suitable in the presence of background noise? (2) what is the effect of signal duration on recognition accuracy? (3) how do the signal-to-noise-ratio and the distance between the microphone and the audio source affect the recognition accuracy in an environment in which the system was not trained? We show that for systems that use traditional classifiers, it is beneficial to combine gammatone frequency cepstral coefficients and discrete wavelet transform coefficients and to use a gradient boosting classifier. For systems based on deep learning, we consider 1D and 2D Convolutional Neural Networks (CNN) using mel-spectrogram energies and mel-spectrograms images as inputs, respectively, and show that the 2D CNN outperforms the 1D CNN. We obtained competitive classification results for two such systems. The first one, which uses a gradient boosting classifier, achieved an F1-Score of 90.2% and a recognition accuracy of 91.7%. The second one, which uses a 2D CNN with mel-spectrogram images, achieved an F1-Score of 92.7% and a recognition accuracy of 96%.
The present work approaches intelligent traffic evaluation and congestion detection using sound sensors and machine learning. For this, two important problems are addressed: traffic condition ...assessment from audio data, and analysis of audio under uncontrolled environments. By modeling the traffic parameters and the sound generation from passing vehicles and using the produced audio as a source of data for learning the traffic audio patterns, we provide a solution that copes with the time, the cost and the constraints inherent to the activity of traffic monitoring. External noise sources were introduced to produce more realistic acoustic scenes and to verify the robustness of the methods presented. Audio-based monitoring becomes a simple and low-cost option, comparing to other methods based on detector loops, or GPS, and as good as camera-based solutions, without some of the common problems of image-based monitoring, such as occlusions and light conditions. The approach is evaluated with data from audio analysis of traffic registered in locations around the city of São Jose dos Campos, Brazil, and audio files from places around the world, downloaded from YouTube. Its validation shows the feasibility of traffic automatic audio monitoring as well as using machine learning algorithms to recognize audio patterns under noisy environments.
A filter algorithm based on cochlear mechanics and neuron filter mechanism is proposed from the view point of vibration. It helps to solve the problem that the non-linear amplification is rarely ...considered in studying the auditory filters. A cochlear mechanical transduction model is built to illustrate the audio signals processing procedure in cochlea, and then the neuron filter mechanism is modeled to indirectly obtain the outputs with the cochlear properties of frequency tuning and non-linear amplification. The mathematic description of the proposed algorithm is derived by the two models. The parameter space, the parameter selection rules and the error correction of the proposed algorithm are discussed. The unit impulse responses in the time domain and the frequency domain are simulated and compared to probe into the characteristics of the proposed algorithm. Then a 24-channel filter bank is built based on the proposed algorithm and applied to the enhancements of the audio signals. The experiments and comparisons verify that, the proposed algorithm can effectively divide the audio signals into different frequencies, significantly enhance the high frequency parts, and provide positive impacts on the performance of speech enhancement in different noise environments, especially for the babble noise and the volvo noise.
Blind source separation (BSS) is a research hotspot in the field of signal processing. This scheme is widely applied to separate a group of source signals from a given set of observations or mixed ...signals. In the present study, the Savitzky-Golay filter is applied to smooth the mixed signals, adopt a simplified cost function based on the signal to noise ratio (SNR) and obtain the demixing matrix accordingly. To this end, the generalized eigenvalue problem is solved without conventional iterative methods. It is founded that the proposed algorithm has a simple structure and can be easily implemented in diverse problems. The obtained results demonstrate the good performance of the proposed model for separating audio signals in cases with high signal to noise ratios.