Speaking skills generally receive little attention in traditional English as a Foreign Language (EFL) classrooms, and this is especially the case in secondary education in Indonesia. A vocabulary ...deficit and poor pronunciation skills hinder learners in their efforts to improve speaking proficiency. In the present study, we investigated the effects of using two language learning websites, I Love Indonesia (ILI) and NovoLearning (NOVO). These websites are equipped with Automatic Speech Recognition (ASR) technology, with each website providing different types of immediate feedback. We measured written receptive and productive vocabulary knowledge of 40 target words before and after the intervention in which 146 students practiced with these two ASR-based websites, and compared it to that of a control group (n = 86). The ASR-based websites successfully helped students enhance their receptive vocabulary. Twenty-four students participated in a spoken pre-and post-test pronouncing the same 40 target words. We successfully developed an approach to measure pronunciation skills which showed that the treatment groups outperformed the control group. Our results indicate that our technology is successful in improving vocabulary and pronunciation skills.
The machine recognition of speech spoken at a distance from the microphones, known as far-field automatic speech recognition (ASR), has received a significant increase in attention in science and ...industry, which caused or was caused by an equally significant improvement in recognition accuracy. Meanwhile, it has entered the consumer market with digital home assistants with a spoken language interface being its most prominent application. Speech recorded at a distance is affected by various acoustic distortions, and consequently, quite different processing pipelines have emerged compared with ASR for close-talk speech. A signal enhancement front end for dereverberation, source separation, and acoustic beamforming is employed to clean up the speech, and the back-end ASR engine is robustified by multicondition training and adaptation. We will also describe the so-called end-to-end approach to ASR, which is a new promising architecture that has recently been extended to the far-field scenario. This tutorial article gives an account of the algorithms used to enable accurate speech recognition from a distance, and it will be seen that, although deep learning has a significant share in the technological breakthroughs, a clever combination with traditional signal processing can lead to surprisingly effective solutions.
Speaker diarization is a task to label audio or video recordings with classes that correspond to speaker identity, or in short, a task to identify “who spoke when”. In the early years, speaker ...diarization algorithms were developed for speech recognition on multispeaker audio recordings to enable speaker adaptive processing. These algorithms also gained their own value as a standalone application over time to provide speaker-specific metainformation for downstream tasks such as audio retrieval. More recently, with the emergence of deep learning technology, which has driven revolutionary changes in research and practices across speech application domains, rapid advancements have been made for speaker diarization. In this paper, we review not only the historical development of speaker diarization technology but also the recent advancements in neural speaker diarization approaches. Furthermore, we discuss how speaker diarization systems have been integrated with speech recognition applications and how the recent surge of deep learning is leading the way of jointly modeling these two components to be complementary to each other. By considering such exciting technical trends, we believe that this paper is a valuable contribution to the community to provide a survey work by consolidating the recent developments with neural methods and thus facilitating further progress toward a more efficient speaker diarization.
•The latest trends and approaches to speaker diarization as part of speech interaction applications.•Overview of the development of speaker diarization in the era of deep learning.•Review of diarization techniques belonging to the proposed taxonomy.•Introduction of techniques used in the traditional, modular speaker diarization systems.•Recent advancements in joint training approaches and fully end-to-end models.•A perspective of how speaker diarization has been investigated in the context of ASR.•Review of the challenges, the future of speaker diarization and the applications of speaker diarization.
Speech processing for under-resourced languages is an active field of research, which has experienced significant progress during the past decade. We propose, in this paper, a survey that focuses on ...automatic speech recognition (ASR) for these languages. The definition of under-resourced languages and the challenges associated to them are first defined. The main part of the paper is a literature review of the recent (last 8years) contributions made in ASR for under-resourced languages. Examples of past projects and future trends when dealing with under-resourced languages are also presented. We believe that this paper will be a good starting point for anyone interested to initiate research in (or operational development of) ASR for one or several under-resourced languages. It should be clear, however, that many of the issues and approaches presented here, apply to speech technology in general (text-to-speech synthesis for instance).
Conventional automatic speech recognition (ASR) based on a hidden Markov model (HMM)/deep neural network (DNN) is a very complicated system consisting of various modules such as acoustic, lexicon, ...and language models. It also requires linguistic resources, such as a pronunciation dictionary, tokenization, and phonetic context-dependency trees. On the other hand, end-to-end ASR has become a popular alternative to greatly simplify the model-building process of conventional ASR systems by representing complicated modules with a single deep network architecture, and by replacing the use of linguistic resources with a data-driven learning method. There are two major types of end-to-end architectures for ASR; attention-based methods use an attention mechanism to perform alignment between acoustic frames and recognized symbols, and connectionist temporal classification (CTC) uses Markov assumptions to efficiently solve sequential problems by dynamic programming. This paper proposes hybrid CTC/attention end-to-end ASR, which effectively utilizes the advantages of both architectures in training and decoding. During training, we employ the multiobjective learning framework to improve robustness and achieve fast convergence. During decoding, we perform joint decoding by combining both attention-based and CTC scores in a one-pass beam search algorithm to further eliminate irregular alignments. Experiments with English (WSJ and CHiME-4) tasks demonstrate the effectiveness of the proposed multiobjective learning over both the CTC and attention-based encoder-decoder baselines. Moreover, the proposed method is applied to two large-scale ASR benchmarks (spontaneous Japanese and Mandarin Chinese), and exhibits performance that is comparable to conventional DNN/HMM ASR systems based on the advantages of both multiobjective learning and joint decoding without linguistic resources.
Willingness to communicate (WTC) is considered to be an important factor contributing to successful foreign language learning. Many studies aim at finding effective tools for enhancing WTC. With the ...support of AI and Automatic Speech Recognition technology, intelligent personal assistants (IPAs) seem to have potentials in improving foreign language learners' WTC. However, few empirical studies focus on the possible impact of IPAs on learners' WTC. This study was conducted to investigate the potentials of an IPA, Google Assistant, for developing adolescent EFL learners' WTC and their perceptions of IPAs for EFL learning. This study recruited 112 eighth-grade EFL learners who engaged in Google-Assistant-language-learning activities for two weeks. Two WTC questionnaires were administered at the beginning and end of the intervention. The results demonstrated that Google Assistant significantly promoted EFL learners' WTC, enhanced communicative confidence, and reduced speaking anxiety. Analyses of interviews revealed that participants enjoyed playing games with Google Assistant and talking to chatbots, which helped them feel less anxious and motivated to use English for real and meaningful communication. The findings indicated that IPA-based interaction provided a less threatening environment, in which learners displayed higher levels of engagement, motivation, confidence, and, in turn, their WTC in the target language.
Communication is an integral part of our day-to-day lives. People experiencing difficulty in speaking or hearing often feel neglected in our society. While Automatic Speech Recognition Systems have ...now progressed to the purpose of being commercially viable, Signed Language Recognition Systems are still in the early stages. Currently, all such interpretations are administered by humans. Here, we present an approach using ensembled architecture for the classification of Sign Language characters. The novel ensemble of InceptionV3 and ResNet101 achieved an accuracy of 97.24% on the ASL dataset.
This paper presents our latest investigations on different features for factored language models for Code-Switching speech and their effect on automatic speech recognition (ASR) performance. We focus ...on syntactic and semantic features which can be extracted from Code-Switching text data and integrate them into factored language models. Different possible factors, such as words, part-of-speech tags, Brown word clusters, open class words and clusters of open class word embeddings are explored. The experimental results reveal that Brown word clusters, part-of-speech tags and open-class words are the most effective at reducing the perplexity of factored language models on the Mandarin-English Code-Switching corpus SEAME. In ASR experiments, the model containing Brown word clusters and part-of-speech tags and the model also including clusters of open class word embeddings yield the best mixed error rate results. In summary, the best language model can significantly reduce the perplexity on the SEAME evaluation set by up to 10.8% relative and the mixed error rate by up to 3.4% relative.
We summarize the results of a host of efforts using giant automatic speech recognition (ASR) models pre-trained using large, diverse unlabeled datasets containing approximately a million hours of ...audio. We find that the combination of pre-training, self-training and scaling up model size greatly increases data efficiency, even for extremely large tasks with tens of thousands of hours of labeled data. In particular, on an ASR task with 34 k hours of labeled data, by fine-tuning an 8 billion parameter pre-trained Conformer model we can match state-of-the-art (SoTA) performance with only 3% of the training data and significantly improve SoTA with the full training set. We also report on the universal benefits gained from using big pre-trained and self-trained models for a large set of downstream tasks that cover a wide range of speech domains and span multiple orders of magnitudes of dataset sizes, including obtaining SoTA performance on many public benchmarks. In addition, we utilize the learned representation of pre-trained networks to achieve SoTA results on non-ASR tasks.
Speech technology is not appropriately explored even though modern advances in speech technology-especially those driven by deep learning (DL) technology-offer unprecedented opportunities for ...transforming the healthcare industry. In this paper, we have focused on the enormous potential of speech technology for revolutionising the healthcare domain. More specifically, we review the state-of-the-art approaches in automatic speech recognition (ASR), speech synthesis or text to speech (TTS), and health detection and monitoring using speech signals. We also present a comprehensive overview of various challenges hindering the growth of speech-based services in healthcare. To make speech-based healthcare solutions more prevalent, we discuss open issues and suggest some possible research directions aimed at fully leveraging the advantages of other technologies for making speech-based healthcare solutions more effective.