Keyword spotting (KWS) is one of the speech recognition tasks most sensitive to the quality of the feature representation. However, the research on KWS has traditionally focused on new model ...topologies, putting little emphasis on other aspects like feature extraction. This paper investigates the use of the multitaper technique to create improved features for KWS. The experimental study is carried out for different test scenarios, windows and parameters, datasets, and neural networks commonly used in embedded KWS applications. Experiment results confirm the advantages of using the proposed improved features.
Single-channel, speaker-independent speech separation methods have recently seen great progress. However, the accuracy, latency, and computational cost of such methods remain insufficient. The ...majority of the previous methods have formulated the separation problem through the time-frequency representation of the mixed signal, which has several drawbacks, including the decoupling of the phase and magnitude of the signal, the suboptimality of time-frequency representation for speech separation, and the long latency in calculating the spectrograms. To address these shortcomings, we propose a fully convolutional time-domain audio separation network (Conv-TasNet), a deep learning framework for end-to-end time-domain speech separation. Conv-TasNet uses a linear encoder to generate a representation of the speech waveform optimized for separating individual speakers. Speaker separation is achieved by applying a set of weighting functions (masks) to the encoder output. The modified encoder representations are then inverted back to the waveforms using a linear decoder. The masks are found using a temporal convolutional network consisting of stacked one-dimensional dilated convolutional blocks, which allows the network to model the long-term dependencies of the speech signal while maintaining a small model size. The proposed Conv-TasNet system significantly outperforms previous time-frequency masking methods in separating two- and three-speaker mixtures. Additionally, Conv-TasNet surpasses several ideal time-frequency magnitude masks in two-speaker speech separation as evaluated by both objective distortion measures and subjective quality assessment by human listeners. Finally, Conv-TasNet has a significantly smaller model size and a shorter minimum latency, making it a suitable solution for both offline and real-time speech separation applications. This study, therefore, represents a major step toward the realization of speech separation systems for real-world speech processing technologies.
Deep learning makes radar-based human activity recognition (HAR) attract more attention in the fields of intelligent security, traffic management, medical rehabilitation and military operation, ...because it has the ability to automatically extract comprehensive features of human activity. This paper proposes a recognition method based on multi-spectrogram and mixed convolutional neural network (MCNN). Specifically, three time-frequency analyses, including short-time Fourier transform (STFT), reduced interference distribution with Hanning kernel (RIDHK) and smoothed pseudo Wigner-Ville distribution (SPWVD), are performed on radar echo data to obtain the time-frequency spectrograms with different feature expressions, and then the spectrograms are fed into the MCNN for recognition and classification. In the MCNN, three two-dimensional CNNs (2DCNNs) are used to extract the independent spatial features from three types of spectrograms, and one three-dimensional CNN (3DCNN) with unit convolution kernel is employed to focus on extracting the correlation features between three kinds of spectrograms. The behavioral features in the spectrograms are characterized comprehensively by fusing these two kinds of features, which is able to improve the recognition accuracy. Experimental results illustrate that the proposed method improves the average recognition accuracy of eight human activities by at least 2.16% compared with other methods.
The CMT welding process has been widely used for aluminum alloy welding. The weld’s penetration state is essential for evaluating the welding quality. Arc sound signals contain a wealth of ...information related to the penetration state of the weld. This paper studies the correlation between the frequency domain features of arc sound signals and the weld penetration state, as well as the correlation between Mel spectrograms, Gammatone spectrograms and Bark spectrograms and the weld penetration state. Arc sound features fused with multilingual spectrograms are constructed as inputs to a custom Inception CNN model that is optimized based on GoogleNet for CMT weld penetration state recognition. The experimental results show that the accuracy of the method proposed in this paper for identifying the fusion state of CMT welds in aluminum alloy plates is 97.7%, which is higher than the identification accuracy of a single spectrogram as the input. The recognition accuracy of the customized Inception CNN is improved by 0.93% over the recognition accuracy of GoogleNet. The customized Inception CNN also has high recognition results compared to AlexNet and ResNet.
Obstructive sleep apnea (OSA) is a severe sleep-associated respiratory disorder, caused due to periodic disruption of breath during sleep. It may cause a number of serious cardiovascular ...complications, including stroke. Generally, OSA is detected by polysomnography (PSG), a costly procedure, and may cause discomfort to the patient. Nowadays, electrocardiogram (ECG) signal-based detection techniques have been explored as an alternative to PSG for OSA detection. Usual linear and nonlinear machine learning techniques are mainly focused on handcrafted feature extraction and classification that are time-consuming and may not be suitable for huge data. Therefore, in this work, a deep learning model (DLM) using smoothed Gabor spectrogram (SGS) of ECG signals is proposed for automated OSA detection to obtain high performance. The proposed framework fed Gabor spectrogram and SGS of ECG signals as input to the pretrained Squeeze-Net, Res-Net50, and developed DLM called obstructive sleep apnea convolutional neural network (OSACN-Net). The proposed OSACN-Net achieved an average classification accuracy of 94.81% with SGS using a tenfold cross-validation strategy. Compared to Squeeze-Net and Res-Net50, developed OSACN-Net is more accurate and lightweight as it requires few learnable parameters, which makes it computationally fast and efficient. The comparison results showed that the proposed framework outperformed all existing state-of-the-art methodologies.
We address the problem of "cocktail-party" source separation in a deep learning framework called deep clustering. Previous deep network approaches to separation have shown promising performance in ...scenarios with a fixed number of sources, each belonging to a distinct signal class, such as speech and noise. However, for arbitrary source classes and number, "class-based" methods are not suitable. Instead, we train a deep network to assign contrastive embedding vectors to each time-frequency region of the spectrogram in order to implicitly predict the segmentation labels of the target spectrogram from the input mixtures. This yields a deep network-based analogue to spectral clustering, in that the embeddings form a low-rank pair-wise affinity matrix that approximates the ideal affinity matrix, while enabling much faster performance. At test time, the clustering step "decodes" the segmentation implicit in the embeddings by optimizing K-means with respect to the unknown assignments. Preliminary experiments on single-channel mixtures from multiple speakers show that a speaker-independent model trained on two-speaker mixtures can improve signal quality for mixtures of held-out speakers by an average of 6dB. More dramatically, the same model does surprisingly well with three-speaker mixtures.
In recent years, deep learning algorithms have become increasingly more prominent for their unparalleled ability to automatically learn discriminant features from large amounts of data. However, ...within the field of electromyography-based gesture recognition, deep learning algorithms are seldom employed as they require an unreasonable amount of effort from a single person, to generate tens of thousands of examples. This paper's hypothesis is that general, informative features can be learned from the large amounts of data generated by aggregating the signals of multiple users, thus reducing the recording burden while enhancing gesture recognition. Consequently, this paper proposes applying transfer learning on aggregated data from multiple users while leveraging the capacity of deep learning algorithms to learn discriminant features from large datasets. Two datasets comprised 19 and 17 able-bodied participants, respectively (the first one is employed for pre-training), were recorded for this work, using the Myo armband. A third Myo armband dataset was taken from the NinaPro database and is comprised ten able-bodied participants. Three different deep learning networks employing three different modalities as input (raw EMG, spectrograms, and continuous wavelet transform (CWT)) are tested on the second and third dataset. The proposed transfer learning scheme is shown to systematically and significantly enhance the performance for all three networks on the two datasets, achieving an offline accuracy of 98.31% for 7 gestures over 17 participants for the CWT-based ConvNet and 68.98% for 18 gestures over 10 participants for the raw EMG-based ConvNet. Finally, a use-case study employing eight able-bodied participants suggests that real-time feedback allows users to adapt their muscle activation strategy which reduces the degradation in accuracy normally experienced over time.
Underwater acoustic target recognition (UATR) is usually difficult due to the complex and multipath underwater environment. Currently, deep-learning (DL)-based UATR methods have proved their ...effectiveness and have outperformed the traditional methods by using powerful convolution neural networks (CNNs) to extract discriminative features on acoustic spectrograms. However, CNNs always fail to capture the global information implicated in the spectrogram due to the use of a small kernel and thus encounter the performance bottleneck. To this end, we propose the UATR-transformer based on a convolution-free architecture, referred to as the transformer, which can perceive both the global and local information from acoustic spectrograms, and thus improve the accuracy. Experiments on two real-world data demonstrate that our proposed model has achieved comparative results to the state of art CNNs and thus can be applied to certain cases in UATR.
We propose the use of deep convolutional neural networks (DCNNs) for human detection and activity classification based on Doppler radar. Previously, proposed schemes for these problems remained in ...the conventional supervised learning paradigm that relies on the design of handcrafted features. Whereas these schemes attained high accuracy, the requirement for domain knowledge of each problem limits the scalability of the proposed schemes. In this letter, we present an alternative deep learning approach. We apply the DCNN, one of the most successful deep learning algorithms, directly to a raw micro-Doppler spectrogram for both human detection and activity classification problem. The DCNN can jointly learn the necessary features and classification boundaries using the measured data without employing any explicit features on the micro-Doppler signals. We show that the DCNN can achieve accuracy results of 97.6% for human detection and 90.9% for human activity classification.
In practice, radar measurements are hindered by unavoidable noise, which lowers the signal-to-noise ratio (SNR) and raises the problem of radar signal denoising. Thanks to the development of deep ...learning techniques, recently proposed denoisers are progressively capable of blind denoising. On the other hand, due to the great fitting capacity of deep neural networks, the deep-learning-based denoising model would prefer to overfit on the training set, hence diminishing the generalization of a denoiser and impeding its use in a broader situation. This article focuses on this "blind universal denoising" problem for the first time and introduces a novel generative-adversarial-network-based (GAN-based) denoiser for radar spectrograms. The core idea of the proposed model lies in minimizing the generalization error during the model's training, and to this end, our model incorporates a proposed identical dual learning (IDL) scheme and a reciprocal adversarial training (RAT) strategy to avoid the overfitting risk in the denoiser's training. We perform the radar simulation using a motion capture database, and verify our model's effectiveness under three different setups of training and testing datasets. For each setup, the noise level in the training and testing sets is configured to be different so to simulate the unknown measurement situations. Eleven algorithms are selected as comparisons, and the experimental results on two criteria illustrate that our method outperforms the others with a significant improvement.