Oral narrative abilities are an important measure of children's language competency and have predictive value for children's later academic performance. Research and development underway in New ...Zealand is advancing an innovative online oral narrative task. This task uses audio recordings of children's story retells, speech-to-text software and language analysis to record, transcribe, analyse and present oral narrative and listening comprehension data back to class teachers. The task has been designed for class teachers' use with the support of SLP or literacy specialists in data interpretation. Teachers are upskilled and supported in order to interpret these data and implement teaching practices for students through online professional learning and development modules, within the context of a broader evidence-based approach to early literacy instruction. This article describes the development of this innovative, culturally relevant, online tool for monitoring children's oral narrative ability and listening comprehension in their first year of school. Three phases of development are outlined, showing the progression of the tool from a researcher-administered task during controlled research trials, to wide-scale implementation with thousands of students throughout New Zealand. The current iteration of the tool uses an automatic speech-recognition system with specifically trained transcription models and support from research assistants to check transcription, then code and analyse the oral narrative. This reduces transcription and analysis time to ~7 min, with a word error rate of around 20%. Future development plans to increase the accuracy of automatic transcription and embed basic language analysis into the tool, with the aim of removing the need for support from research assistants.
Speech Recognition and its potential applications in terms of “talking devices” have become indispensable in today’s world. Technological advances like mobiles, smart home assistants or tablets ...extensively use the techniques of automatic speech recognition that works good for adults but cannot always follow and understand children’s speech. The primary goal of this paper is to bridge the gap of communication between voice assistants and Indian children speaking English as secondary language. The issue of lack of children’s speech corpora with English as non-native language, is addressed by creating a dataset of children in the age group of 5-15 years, speaking Hindi or Marathi as their mother tongue and English as their second language. The analysis and implementation of the proposed work shows the accuracy of approximately 96% and potential for further scope by increasing the size of dataset in lower age group. The key contributions of our work are (i) creating speech dataset of Indian children whose mother-tongue is Hindi or Marathi, (ii) employing and evaluating hybrid Convolutional Neural Network (CNN) as an age classifier, (iii) language modeling to customize children vocabulary, (iv) checking accuracy and performance of the system.
This paper presents several acoustic analyses carried out on read speech collected from Italian children aged from 7 to 13 years and North American children aged from 5 to 17 years. These analyses ...aimed at achieving a better understanding of spectral and temporal changes in speech produced by children of various ages in view of the development of automatic speech recognition applications. The results of these analyses confirm and complement the results reported in the literature, showing that characteristics of children’s speech change with age and that spectral and temporal variability decrease as age increases. In fact, younger children show a substantially higher intra- and inter-speaker variability with respect to older children and adults. We investigated the use of several methods for speaker adaptive acoustic modeling to cope with inter-speaker spectral variability and to improve recognition performance for children. These methods proved to be effective in recognition of read speech with a vocabulary of about 11k words.
This article studies the acoustic production of Central Calatan rhotics in a group of 90 boys and girls aged 5 to 7 years, in medial and final coda. An acoustic description is given by analyzing the ...number and type of components, both of the closure and opening phases. In view of the combination of different types of components, up to 78 different types of rhotics have been detected. The results show that most rhotic sounds have one or two components, and that rhotics with five or more components are almost always at the end of a word. In terms of acoustic properties, there is variability, with more presence of vocalic components in the medial coda and more fricative or occlusive components in the final coda.
•Perceiving natural poetic sentences with strong prosodic regularities requires lower brain workload but wider functional networks with long-range connections comparing to non-poetic ...speech.•Long-reaching hubs are elicited by integrated sentence-level speech perception in the right hemisphere.•Poetic speeches promote the auditory perception-to-production circuit in children, which decreases with ages.•The positive correlation between neural sensitivity of poetic speeches and behavioral speech ability indicates its role on facilitating early speech development.
Natural poetic speeches (i.e., proverbs, nursery rhymes, and commercial ads) with strong prosodic regularities are easily memorized by children and the harmonious acoustic patterns are suggested to facilitate their integrated sentence processing. Do children have specific neural pathways for perceiving such poetic utterances, and does their speech development benefit from it? We recorded the task-induced hemodynamic changes of 94 children aged 2 to 12 years using functional near-infrared spectroscopy (fNIRS) while they listened to poetic and non-poetic natural sentences. Seventy-three adult as controls were recruited to investigate the developmental specificity of children group. The results indicated that poetic sentences perceiving is a highly integrated process featured by a lower brain workload in both groups. However, an early activated large-scale network was induced only in the child group, coordinated by hubs for connectivity diversity. Additionally, poetic speeches evoked activation in the phonological encoding regions in the children's group rather than adult controls which decreases with children's ages. The neural responses to poetic speeches were positively linked to children's speech communication performance, especially the fluency and semantic aspects. These results reveal children's neural sensitivity to integrated speech perception which facilitate early speech development by strengthening more sophisticated language networks and the perception-production circuit.
Disfluency detection and classification on children's speech has a great potential for teaching reading skills. Word-level assessment of children's speech can help teachers to effectively gauge their ...students' progress. Hence, we propose a novel attention-based model to perform word-level disfluency detection and classification in a fully end-to-end (E2E) manner making it fast and easy to use. We develop a word-level disfluency annotation scheme using which we annotate a dataset of children read speech, the reading races dataset (READR). We also annotate disfluencies in the existing CMU Kids corpus. The proposed model significantly outperforms traditional cascaded baselines, which use forced alignments, on both datasets. To deal with the inevitable class-imbalance in the datasets, we propose a novel technique called HiDeC (Hierarchical Detection and Classification) which yields a detection improvement of 23% and 16% and a classification improvement of 3.8% and 19.3% relative F1-score on the READR and CMU Kids datasets respectively.
The primary motive of this study is to develop an automatic speech recognition (ASR) system using limited amount of speech data such that it is least affected by speaker-dependent acoustic ...variations. The two factors contributing towards inter-speaker variability that are focused upon in this work are pitch and speaking-rate variations. In order to simulate such a limited data scenario, an ASR system is trained on adults' speech and tested using speech data from adult as well as child speakers. Compared to adults' speech test case, the recognition rates are noted to be extremely degraded when the test speech is from child speakers. The observed degradation is due to large differences in pitch and speaking-rate between adults' and children's speech along with other factors leading to inter-speaker acoustic variations. To overcome the mismatch in pitch and speaking-rate, two different approaches are proposed in this paper. In the first approach, the pitch and speaking-rate of children's speech test set are explicitly modified using a recently proposed prosody modification technique that exploits fuzzy classification of spectral bins. In the second approach, pitch and speaking-rate of the training data are modified to create newer versions of the data. In order to capture greater acoustic variability, the original and the modified versions are then pooled together. The ASR system trained on augmented data is noted to be more robust towards pitch and speaking-rate variations. Consequently, relative improvements of 17% and 31% over the baseline are obtained on decoding adults' and children's speech test sets, respectively.
Searching for words of interest from a speech sequence is referred to as keyword spotting (KWS). A myriad of techniques have been proposed over the years for effectively spotting keywords from ...adults' speech. However, not much work has been reported on KWS for children's speech. The speech data for adult and child speakers differs significantly due to physiological differences between the two groups of speakers. Consequently, the performance of a KWS system trained on adults' speech degrades severely when used by children due to the acoustic mismatch. In this paper, we present our efforts towards improving the performance of keyword spotting systems for children's speech under limited data scenario. In this regard, we have explored prosody modification in order to reduce the acoustic mismatch resulting from the differences in pitch and speaking-rate. The prosody modification technique explored in this paper is the one based on glottal closure instant (GCI) events. The approach based on zero-frequency filtering (ZFF) is used to compute the GCI locations. Further, we have presented two different ways for effectively applying prosody modification. In the first case, prosody modification is applied to the children's speech test set prior to the decoding step in order to improve the recognition performance. Alternatively, we have also applied prosody modification to the training data from adult speakers. The original as well as the prosody modified adults' speech data are then augmented together before learning the statistical parameters of the KWS system. The experimental evaluations presented in this paper show that, significantly improved performances for children's speech are obtained by both of the aforementioned approaches of applying prosody modification. Prosody-modification-based data augmentation helps in improving the performance with respect to adults' speech as well.
Data augmentation is a technique which enhances the size and quality of training data such that deep learning or machine learning models can achieve better performance. This paper proposes a novel ...way of applying data augmentation for child speech recognition in the low data resource scenario. Data augmentation is achieved by modifying existing adult speech signals. The procedure consists of two main parts, resampling, and time scaling. The experiment involves both speech from children aged from kindergarten to grade 10, and adults' speech. We test the proposed method using both a TDNN-HMM and a GMM-HMM acoustic model. The results show that the proposed data augmentation scheme achieves a relative 7.95% reduction of WERs compared with 4.56% relative reduction when using a traditional bilinear frequency warping approach.
This paper 1 presents a novel system which utilizes acoustic, phonological, morphosyntactic, and prosodic information for binary automatic dialect detection of African American English. We train this ...system utilizing adult speech data and then evaluate on both children's and adults' speech with unmatched training and testing scenarios. The proposed system combines novel and state-of-the-art architectures, including a multi-source transformer language model pre-trained on Twitter text data and fine-tuned on ASR transcripts as well as an LSTM acoustic model trained on self-supervised learning representations, in order to learn a comprehensive view of dialect. We show robust, explainable performance across recording conditions for different features for adult speech, but fusing multiple features is important for good results on children's speech.