Although great progress has been made in automatic speech recognition, significant performance degradation still exists in noisy environments. Recently, very deep convolutional neural networks (CNNs) ...have been successfully applied to computer vision and speech recognition tasks. Based on our previous work on very deep CNNs, in this paper this architecture is further developed to improve recognition accuracy for noise robust speech recognition. In the proposed very deep CNN architecture, we study the best configuration for the sizes of filters, pooling, and input feature maps: the sizes of filters and poolings are reduced and dimensions of input features are extended to allow for adding more convolutional layers. Then the appropriate pooling, padding, and input feature map selection strategies are investigated and applied to the very deep CNN to make it more robust for speech recognition. In addition, an in-depth analysis of the architecture reveals key characteristics, such as compact model scale, fast convergence speed, and noise robustness. The proposed new model is evaluated on two tasks: Aurora4 task with multiple additive noise types and channel mismatch, and the AMI meeting transcription task with significant reverberation. Experiments on both tasks show that the proposed very deep CNNs can significantly reduce word error rate (WER) for noise robust speech recognition. The best architecture obtains a 10.0% relative reduction over the traditional CNN on AMI, competitive with the long short-term memory recurrent neural networks (LSTM-RNN) acoustic model. On Aurora4, even without feature enhancement, model adaptation, and sequence training, it achieves a WER of 8.81%, a 17.0% relative improvement over the LSTM-RNN. To our knowledge, this is the best published result on Aurora4.
Recent advances in automatic speaker verification (ASV) lead to an increased interest in securing these systems for real-world applications. Malicious spoofing attempts against ASV systems can lead ...to serious security breaches. A spoofing attack within the context of ASV is a condition in which a (potentially harmful) person successfully masks as another, to the ASV system already known person by falsifying or manipulating data. While most previous work focuses on enhanced, spoof-aware features, end-to-end models can be a potential alternative. In this paper, we investigate the training of a raw wave front-ends for deep convolutional, long short-term memory (LSTM) and vanilla neural networks, which are analyzed for their suitability toward spoofing detection, regarding the influence of frame size, number of output neurons, and sequence length. A joint convolutional LSTM neural network (CLDNN) is proposed, which outperforms previous attempts on the BTAS2016 dataset (0.82% → 0.19% HTER), placing itself as the current state-of-the-art model for the dataset. We show that end-to-end approaches are appropriate for the important replay detection task and show that the proposed model is capable of distinguishing device-invariant spoofing attempts. Regarding the ASVspoof2015 dataset, the end-to-end solution achieves an equal error rate (EER) of 0.00% for the S1-S9 conditions. We show that the end-to-end approach based on a raw waveform input can outperform common cepstral features, without the use of context-dependent frame extensions. In addition, a cross-database (domain mismatch) scenario is also evaluated, which shows that the proposed CLDNN model trained on the BTAS2016 dataset achieves an EER of 25.7% on the ASVspoof2015 dataset.
Speech recognition is a sequence prediction problem. Besides employing various deep learning approaches for frame-level classification, sequence-level discriminative training has been proved to be ...indispensable to achieve the state-of-the-art performance in large vocabulary continuous speech recognition (LVCSR). However, keyword spotting (KWS), as one of the most common speech recognition tasks, almost only benefits from frame-level deep learning due to the difficulty of getting competing sequence hypotheses. The few studies on sequence discriminative training for KWS are limited for fixed vocabulary or LVCSR based methods and have not been compared to the state-of-the-art deep learning based KWS approaches. In this paper, a sequence discriminative training framework is proposed for both fixed vocabulary and unrestricted acoustic KWS. Sequence discriminative training for both sequence-level generative and discriminative models are systematically investigated. By introducing word-independent phone lattices or non-keyword blank symbols to construct competing hypotheses, feasible and efficient sequence discriminative training approaches are proposed for acoustic KWS. Experiments showed that the proposed approaches obtained consistent and significant improvement in both fixed vocabulary and unrestricted KWS tasks, compared to previous frame-level deep learning based acoustic KWS methods.
Short duration text-independent speaker verification remains a hot research topic in recent years, and deep neural network based embeddings have shown impressive results in such conditions. Good ...speaker embeddings require the property of both small intra-class variation and large inter-class difference, which is critical for the ability of discrimination and generalization. Current embedding learning strategies can be grouped into two frameworks: "Cascade embedding learning" with multiple stages and "direct embedding learning" from spectral feature directly. We propose new approaches to achieve more discriminant speaker embeddings. Within the cascade framework, a neural network based deep discriminant analysis (DDA) is proposed to project i-vector to more discriminative embeddings. Within the direct embedding framework, a deep model with more advanced center loss and A-softmax loss is used, the focal loss is also investigated in this framework. Moreover, the traditional i-vector and neural embeddings are finally combined with neural network based DDA to achieve further gain. Main experiments are carried out on a short-duration text-independent speaker verification dataset generated from the SRE corpus. The results show that the newly proposed method is promising for short-duration text-independent speaker verification, and it is consistently better than traditional i-vector and neural embedding baselines. The best embeddings achieve roughly 30% relative EER reduction compared to the i-vector baseline, which could be further enhanced when combined with the i-vector system.
End-to-end spoofing detection with raw waveform CLDNNS Dinkel, Heinrich; Nanxin Chen; Yanmin Qian ...
2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
2017-March
Conference Proceeding
Odprti dostop
Albeit recent progress in speaker verification generates powerful models, malicious attacks in the form of spoofed speech, are generally not coped with. Recent results in ASVSpoof2015 and BTAS2016 ...challenges indicate that spoof-aware features are a possible solution to this problem. Most successful methods in both challenges focus on spoof-aware features, rather than focusing on a powerful classifier. In this paper we present a novel raw waveform based deep model for spoofing detection, which jointly acts as a feature extractor and classifier, thus allowing it to directly classify speech signals. This approach can be considered as an end-to-end classifier, which removes the need for any pre- or post-processing on the data, making training and evaluation a streamlined process, consuming less time than other neural-network based approaches. The experiments on the BTAS2016 dataset show that the system performance is significantly improved by the proposed raw waveform convolutional long short term neural network (CLDNN), from the previous best published 1.26% half total error rate (HTER) to the current 0.82% HTER. Moreover it shows that the proposed system also performs well under the unknown (RE-PH2-PH3,RE-LPPH2-PH3) conditions.
Although great progress has been made in automatic speech recognition, significant performance degradation still exists in noisy environments. Our previous work has demonstrated the superior noise ...robustness of very deep convolutional neural networks (VDCNN). Based on our work on VDCNNs, this paper proposes a more advanced model referred to as the very deep convolutional residual network (VDCRN). This new model incorporates batch normalization and residual learning, showing more robustness than previous VDCNNs.Then, to alleviate the mismatch between the training and testing conditions, model adaptation and adaptive training are developed and compared for the new VDCRN. This paper focuses on factor aware training (FAT) and cluster adaptive training (CAT). For FAT, a unified framework is explored. For CAT, two schemes are first explored to construct the bases in the canonical model; furthermore, a factorized version of CAT is designed to address multiple nonspeech variabilities in one model. Finally, a complete multipass system is proposed to achieve the best system performance in the noisy scenarios. The proposed new approaches are evaluated on three different tasks: Aurora4 (simulated data with additive noise and channel distortion), CHiME4 (both simulated and real data with additive noise and reverberation), and the AMI meeting transcription task (real data with significant reverberation).The evaluation not only includes different noisy conditions, but also covers both simulated and real noisy data. The experiments show that the new VDCRN is more robust, and the adaptation on this model can further significantly reduce the word error rate (WER). The proposed best architecture obtains consistent and very large improvements on all tasks compared to the baseline VDCNN or long short-term memory. Particularly, on Aurora4 a new milestone 5.67% WER is achieved by only improving acoustic modeling.
This paper describes the AISpeech-SJTU system for the accent identification track of the Interspeech-2020 Accented English Speech Recognition Challenge. In this challenge track, only 160-hour ...accented English data collected from 8 countries and the auxiliary Librispeech dataset are provided for training. To build an accurate and robust accent identification system, we explore the whole system pipeline in detail. First, we introduce the ASR based phone posteriorgram (PPG) feature to accent identification and verify its efficacy. Then, a novel TTS based approach is carefully designed to augment the very limited accent training data for the first time. Finally, we propose the test time augmentation and embedding fusion schemes to further improve the system performance. Our final system is ranked first in the challenge and outperforms all the other participants by a large margin. The submitted system achieves 83.63% average accuracy on the challenge evaluation data, ahead of the others by more than 10% in absolute terms.
Spoofing detection for automatic speaker verification (ASV) aims to discriminate between genuine and spoofed speech. This topic has received increased attentions recently due to safety concerns with ...deploying an ASV system. While the performance of spoofing detection has improved significantly in clean condition in recent studies, the performance degrades dramatically in noisy conditions. To address this issue, in this paper, we propose to extract robust and discriminative deep features by using deep learning techniques for spoofing detection. In particular, we employ deep feedforward, recurrent, and convolutional neural networks to extract discriminative features. We also introduce multicondition training, noise-aware training, and annealed dropout training to make neural networks more robust against noise and to avoid overfitting to specific spoofing attacks and noise types. The proposed neural networks and training techniques are combined into a single framework for spoofing detection. Experimental evaluation is carried out on a noisy version of the standard ASVspoof 2015 corpus, including both additive noisy and reverberant scenarios. Experimental results confirm that the proposed system dramatically decreases averaged equal error rates from 19.1% and 22.6% to 3.2% and 5.1% for seen and unseen noisy conditions, respectively.
Speech enhancement has been extensively studied and applied in the fields of automatic speech recognition (ASR), speaker recognition, etc. With the advances of deep learning, attempts to apply Deep ...Neural Networks (DNN) to speech enhancement have achieved remarkable results and the quality of enhanced speech has been greatly improved. In this study, we propose a two-stage model for single-channel speech enhancement. The model has two DNNs with the same architecture. In the first stage, only the first DNN is trained. In the second stage, the second DNN is trained to refine the enhanced output from the first DNN, while the first DNN is frozen. A multi-frame filter is introduced to help the second DNN reduce the distortion of the enhanced speech. Experimental results on both synthetic and real datasets show that the proposed model outperforms other enhancement models not only in terms of speech enhancement evaluation metrics and word error rate (WER), but also in its superior generalization ability. The results of the ablation experiments also demonstrate that combining the two-stage model with the multi-frame filter yields better enhancement performance and less distortion.
Phone Synchronous Speech Recognition With CTC Lattices Chen, Zhehuai; Zhuang, Yimeng; Qian, Yanmin ...
IEEE/ACM transactions on audio, speech, and language processing,
2017-Jan., 2017-1-00, 20170101, Letnik:
25, Številka:
1
Journal Article
Recenzirano
Connectionist temporal classification (CTC) has recently shown improved performance and efficiency in automatic speech recognition. One popular decoding implementation is to use a CTC model to ...predict the phone posteriors at each frame and then perform Viterbi beam search on a modified WFST network. This is still within the traditional frame synchronous decoding framework. In this paper, the peaky posterior property of CTC is carefully investigated and it is found that ignoring blank frames will not introduce additional search errors. Based on this phenomenon, a novel phone synchronous decoding framework is proposed by removing tremendous search redundancy due to blank frames, which results in significant search speed up. The framework naturally leads to an extremely compact phone-level acoustic space representation: CTC lattice. With CTC lattice, efficient and effective modular speech recognition approaches, second pass rescoring for large vocabulary continuous speech recognition (LVCSR), and phone-based keyword spotting (KWS), are also proposed in this paper. Experiments showed that phone synchronous decoding can achieve 3-4 times search speed up without performance degradation compared to frame synchronous decoding. Modular LVCSR with CTC lattice can achieve further WER improvement. KWS with CTC lattice not only achieved significant equal error rate improvement, but also greatly reduced the KWS model size and increased the search speed.