In this paper, we propose the utterance-level permutation invariant training (uPIT) technique. uPIT is a practically applicable, end-to-end, deep-learning-based solution for speaker independent ...multitalker speech separation. Specifically, uPIT extends the recently proposed permutation invariant training (PIT) technique with an utterance-level cost function, hence eliminating the need for solving an additional permutation problem during inference, which is otherwise required by frame-level PIT. We achieve this using recurrent neural networks (RNNs) that, during training, minimize the utterance-level separation error, hence forcing separated frames belonging to the same speaker to be aligned to the same output stream. In practice, this allows RNNs, trained with uPIT, to separate multitalker mixed speech without any prior knowledge of signal duration, number of speakers, speaker identity, or gender. We evaluated uPIT on the WSJ0 and Danish two- and three-talker mixed-speech separation tasks and found that uPIT outperforms techniques based on nonnegative matrix factorization and computational auditory scene analysis, and compares favorably with deep clustering, and the deep attractor network. Furthermore, we found that models trained with uPIT generalize well to unseen speakers and languages. Finally, we found that a single model, trained with uPIT, can handle both two-speaker, and three-speaker speech mixtures.
In this paper, we propose novel strategies for neutral vector variable decorrelation. Two fundamental invertible transformations, namely, serial nonlinear transformation and parallel nonlinear ...transformation, are proposed to carry out the decorrelation. For a neutral vector variable, which is not multivariate-Gaussian distributed, the conventional principal component analysis cannot yield mutually independent scalar variables. With the two proposed transformations, a highly negatively correlated neutral vector can be transformed to a set of mutually independent scalar variables with the same degrees of freedom. We also evaluate the decorrelation performances for the vectors generated from a single Dirichlet distribution and a mixture of Dirichlet distributions. The mutual independence is verified with the distance correlation measurement. The advantages of the proposed decorrelation strategies are intensively studied and demonstrated with synthesized data and practical application evaluations.
Many deep learning-based speech enhancement algorithms are designed to minimize the mean-square error (MSE) in some transform domain between a predicted and a target speech signal. However, ...optimizing for MSE does not necessarily guarantee high speech quality or intelligibility, which is the ultimate goal of many speech enhancement algorithms. Additionally, only little is known about the impact of the loss function on the emerging class of time-domain deep learning-based speech enhancement systems. We study how popular loss functions influence the performance of time-domain deep learning-based speech enhancement systems. First, we demonstrate that perceptually inspired loss functions might be advantageous over classical loss functions like MSE. Furthermore, we show that the learning rate is a crucial design parameter even for adaptive gradient-based optimizers, which has been generally overlooked in the literature. Also, we found that waveform matching performance metrics must be used with caution as they in certain situations can fail completely. Finally, we show that a loss function based on scale-invariant signal-to-distortion ratio (SI-SDR) achieves good general performance across a range of popular speech enhancement evaluation metrics, which suggests that SI-SDR is a good candidate as a general-purpose loss function for speech enhancement systems.
With the development of speech synthesis technology, automatic speaker verification (ASV) systems have encountered the serious challenge of spoofing attacks. In order to improve the security of ASV ...systems, many antispoofing countermeasures have been developed. In the front-end domain, much research has been conducted on finding effective features which can distinguish spoofed speech from genuine speech and the published results show that dynamic acoustic features work more effectively than static ones. In the back-end domain, Gaussian mixture model (GMM) and deep neural networks (DNNs) are the two most popular types of classifiers used for spoofing detection. The log-likelihood ratios (LLRs) generated by the difference of human and spoofing log-likelihoods are used as spoofing detection scores. In this paper, we train a five-layer DNN spoofing detection classifier using dynamic acoustic features and propose a novel, simple scoring method only using human log-likelihoods (HLLs) for spoofing detection. We mathematically prove that the new HLL scoring method is more suitable for the spoofing detection task than the classical LLR scoring method, especially when the spoofing speech is very similar to the human speech. We extensively investigate the performance of five different dynamic filter bank-based cepstral features and constant Q cepstral coefficients (CQCC) in conjunction with the DNN-HLL method. The experimental results show that, compared to the GMM-LLR method, the DNN-HLL method is able to significantly improve the spoofing detection accuracy. Compared with the CQCC-based GMM-LLR baseline, the proposed DNN-HLL model reduces the average equal error rate of all attack types to 0.045%, thus exceeding the performance of previously published approaches for the ASVspoof 2015 Challenge task. Fusing the CQCC-based DNN-HLL spoofing detection system with ASV systems, the false acceptance rate on spoofing attacks can be reduced significantly.
Metaplasticity, a higher order of synaptic plasticity, as well as a key issue in neuroscience, is realized with artificial synapses based on a WO3 thin film, and the activity‐dependent metaplastic ...responses of the artificial synapses, such as spike‐timing‐dependent plasticity, are systematically investigated. This work has significant implications in neuromorphic computation.
We propose a novel deep learning training criterion, named permutation invariant training (PIT), for speaker independent multi-talker speech separation, commonly known as the cocktail-party problem. ...Different from the multi-class regression technique and the deep clustering (DPCL) technique, our novel approach minimizes the separation error directly. This strategy effectively solves the long-lasting label permutation problem, that has prevented progress on deep learning based techniques for speech separation. We evaluated PIT on the WSJ0 and Danish mixed-speech separation tasks and found that it compares favorably to non-negative matrix factorization (NMF), computational auditory scene analysis (CASA), and DPCL and generalizes well over unseen speakers and languages. Since PIT is simple to implement and can be easily integrated and combined with other advanced techniques, we believe improvements built upon PIT can eventually solve the cocktail-party problem.
Artificial neurons with functions such as leaky integrate‐and‐fire (LIF) and spike output are essential for brain‐inspired computation with high efficiency. However, previously implemented artificial ...neurons, e.g., Hodgkin–Huxley (HH) neurons, integrate‐and‐fire (IF) neurons, and LIF neurons, only achieve partial functionality of a biological neuron. In this work, quasi‐HH neurons with leaky integrate‐and‐fire functions are physically demonstrated with a volatile memristive device, W/WO3/poly(3,4‐ethylenedioxythiophene): polystyrene sulfonate/Pt. The resistive switching behavior of the device can be attributed to the migration of protons, unlike the migration of oxygen ions normally involved in oxide‐based memristors. With multifunctions similar to their biological counterparts, quasi‐HH neurons are advantageous over the reported HH and LIF neurons, demonstrating their potential for neuromorphic computing applications.
Quasi‐Hodgkin–Huxley (HH) neurons with leaky integrate‐and‐fire functions are physically demonstrated by W/WO3/poly(3,4‐ethylenedioxythiophene):polystyrene sulfonate/Pt memristive devices with a battery effect; in the device, proton migration plays a key role. With the help of a neuromorphic circuit, the neuron successfully emulates the multifunction of a biological neuron, being advantageous over previously reported HH and leaky integrate‐and‐fire neurons.
Spoken keyword spotting (KWS) deals with the identification of keywords in audio streams and has become a fast-growing technology thanks to the paradigm shift introduced by deep learning a few years ...ago. This has allowed the rapid embedding of deep KWS in a myriad of small electronic devices with different purposes like the activation of voice assistants. Prospects suggest a sustained growth in terms of social use of this technology. Thus, it is not surprising that deep KWS has become a hot research topic among speech scientists, who constantly look for KWS performance improvement and computational complexity reduction. This context motivates this paper, in which we conduct a literature review into deep spoken KWS to assist practitioners and researchers who are interested in this technology. Specifically, this overview has a comprehensive nature by covering a thorough analysis of deep KWS systems (which includes speech features, acoustic modeling and posterior handling), robustness methods, applications, datasets, evaluation metrics, performance of deep KWS systems and audio-visual KWS. The analysis performed in this paper allows us to identify a number of directions for future research, including directions adopted from automatic speech recognition research and directions that are unique to the problem of spoken KWS.
This paper presents a rule-based adaptive protection scheme using machine-learning methodology for microgrids in extensive distribution automation (DA). The uncertain elements in a microgrid are ...first analysed quantitatively by Pearson correlation coefficients from data mining. Then, a so-called hybrid artificial neural network and support vector machine (ANN-SVM) model is proposed for state recognition in microgrids, which utilises the growing massive data streams in smart grids. Based on the state recognition in the algorithm, adaptive reconfigurations can be implemented with enhanced decision-making to modify the protective settings and the network topology to ensure the reliability of the intelligent operation. The effectiveness of the proposed methods is demonstrated on a microgrid model in Aalborg, Denmark and an IEEE 9 bus model, respectively.
In this paper, we study aspects of single microphone speech enhancement (SE) based on deep neural networks (DNNs). Specifically, we explore the generalizability capabilities of state-of-the-art ...DNN-based SE systems with respect to the background noise type, the gender of the target speaker, and the signal-to-noise ratio (SNR). Furthermore, we investigate how specialized DNN-based SE systems, which have been trained to be either noise type specific, speaker specific or SNR specific, perform relative to DNN based SE systems that have been trained to be noise type general, speaker general, and SNR general. Finally, we compare how a DNN-based SE system trained to be noise type general, speaker general, and SNR general performs relative to a state-of-the-art short-time spectral amplitude minimum mean square error (STSA-MMSE) based SE algorithm. We show that DNN-based SE systems, when trained specifically to handle certain speakers, noise types and SNRs, are capable of achieving large improvements in estimated speech quality (SQ) and speech intelligibility (SI), when tested in matched conditions. Furthermore, we show that improvements in estimated SQ and SI can be achieved by a DNN-based SE system when exposed to unseen speakers, genders and noise types, given a large number of speakers and noise types have been used in the training of the system. In addition, we show that a DNN-based SE system that has been trained using a large number of speakers and a wide range of noise types outperforms a state-of-the-art STSA-MMSE based SE method, when tested using a range of unseen speakers and noise types. Finally, a listening test using several DNN-based SE systems tested in unseen speaker conditions show that these systems can improve SI for some SNR and noise type configurations but degrade SI for others.