While we are capable of modeling the shape, e.g. face, arms, etc. of humanoid robots in a nearly natural or humanlike way, it is much more difficult to generate human-like facial or body movements ...and human-like behavior like e.g. speaking and co-speech gesturing. In this paper it will be argued for a developmental robotics approach for learning to speak. On the basis of current literature a blueprint of a brain model will be outlined for this kind of robots and preliminary scenarios for knowledge acquisition will be described. Furthermore it will be illustrated that natural speech acquisition mainly results from learning during face-to-face communication and it will be argued that learning to speak should be based on human-robot face-to-face communication. Here the human acts like a caretaker or teacher and the robot acts like a speech-acquiring toddler. This is a fruitful basic scenario not only for learning to speak, but also for learning to communicate in general, including to produce co-verbal manual gestures and to produce co-verbal facial expressions.
This paper reviews interactive methods for improving the phonetic competence of subjects in the case of second language learning as well as in the case of speech therapy for subjects suffering from ...hearing-impairments or articulation disorders. As an example our audiovisual feedback software “SpeechTrainer” for improving the pronunciation quality of Standard German by visually highlighting acoustics-related and articulation-related sound features will be introduced here. Results from literature on training methods as well as the results concerning our own software indicate that audiovisual tools for phonetic and articulatory visualization are beneficial for computer-aided pronunciation training environments.
Besides the recognition of audible speech, there is currently an increasing interest in the recognition of silent speech, which has a range of novel applications. A major obstacle for a wide spread ...of silent-speech technology is the lack of measurement methods for speech movements that are convenient, non-invasive, portable, and robust at the same time. Therefore, as an alternative to established methods, we examined to what extent different phonemes can be discriminated from the electromagnetic transmission and reflection properties of the vocal tract. To this end, we attached two Vivaldi antennas on the cheek and below the chin of two subjects. While the subjects produced 25 phonemes in multiple phonetic contexts each, we measured the electromagnetic transmission spectra from one antenna to the other, and the reflection spectra for each antenna (radar), in a frequency band from 2-12 GHz. Two classification methods ( k -nearest neighbors and linear discriminant analysis) were trained to predict the phoneme identity from the spectral data. With linear discriminant analysis, cross-validated phoneme recognition rates of 93% and 85% were achieved for the two subjects. Although these results are speaker- and session-dependent, they suggest that electromagnetic transmission and reflection measurements of the vocal tract have great potential for future silent-speech interfaces.
•Computational simulation of vocal learning can be achieved without explicit speaker normalisation.•Deep-learning based speech recogniser provides better auditory guidance than acoustic features for ...learning articulatory targets.•Perception-guided vocal practice rather than phonetic imitation is therefore the likely strategy of vocal learning.•Coarticulatory dynamics are essential for learning CV syllables.•Using open-vocabulary dictation to evaluate model performance sets a new standard for vocal learning modelling.
It has long been a mystery how children learn to speak without formal instructions. Previous research has used computational modelling to help solve the mystery by simulating vocal learning with direct imitation or caregiver feedback, but has encountered difficulty in overcoming the speaker normalisation problem, namely, discrepancies between children’s vocalisations and that of adults due to age-related anatomical differences. Here we show that vocal learning can be successfully simulated via recognition-guided vocal exploration without explicit speaker normalisation. We trained an articulatory synthesiser with three-dimensional vocal tract models of an adult and two child configurations of different ages to learn monosyllabic English words consisting of CVC syllables, based on coarticulatory dynamics and two kinds of auditory feedback: (i) acoustic features to simulate universal phonetic perception (or direct imitation), and (ii) a deep-learning-based speech recogniser to simulate native-language phonological perception. Native listeners were invited to evaluate the learned synthetic speech with natural speech as baseline reference. Results show that the English words trained with the speech recogniser were more intelligible than those trained with acoustic features, sometimes close to natural speech. The successful simulation of vocal learning in this study suggests that a combination of coarticulatory dynamics and native-language phonological perception may be critical also for real-life vocal production learning.
In music analysis, one of the most fundamental tasks is note onset detection - detecting the beginning of new note events. As the target function of onset detection is related to other tasks, such as ...beat tracking or tempo estimation, onset detection is the basis for such related tasks. Furthermore, it can help to improve Automatic Music Transcription (AMT). Typically, different approaches for onset detection follow a similar outline: An audio signal is transformed into an Onset Detection Function (ODF), which should have rather low values (i.e. close to zero) for most of the time but with pronounced peaks at onset times, which can then be extracted by applying peak picking algorithms on the ODF. In the recent years, several kinds of neural networks were used successfully to compute the ODF from feature vectors. Currently, Convolutional Neural Networks (CNNs) define the state of the art. In this paper, we build up on an alternative approach to obtain a ODF by Echo State Networks (ESNs), which have achieved comparable results to CNNs in several tasks, such as speech and image recognition. In contrast to the typical iterative training procedures of deep learning architectures, such as CNNs or networks consisting of Long-Short-Term Memory Cells (LSTMs), in ESNs only a very small part of the weights is easily trained in one shot using linear regression. By comparing the performance of several feature extraction methods, pre-processing steps and introducing a new way to stack ESNs, we expand our previous approach to achieve results that fall between a bidirectional LSTM network and a CNN with relative improvements of 1.8 % and −1.4 %, respectively. For the evaluation, we used exactly the same 8-fold cross validation setup as for the reference results.
Articulation-to-speech synthesis based solely on supraglottal articulation requires some sort of intonation control. This paper examines to what extent the f 0 contour of an utterance can be ...predicted from such supraglottal articulation data. To that end, three groups of machine learning models (support vector machines, kernel ridge regression and neural networks) were trained and evaluated on the mngu0speech corpus containing synchronous articulatory and audio data. The best voiced/unvoiced/silence classification rates were achieved by a deep neural network with two hidden layers: 85.8 % with no look-ahead (important for on-line applications) and 86 % with a look-ahead of 50 ms. The best f 0 prediction model without look-ahead scored a root-mean-square error (RMSE) (when compared to the original f 0 contours) of 10.4 Hz using a neural network with one hidden layer, while the best prediction with a look-ahead of 50 ms was attained by kernel ridge regression and an RMSE of 10.3 Hz. The predicted f 0 contours were also subjectively evaluated in a listening test by manipulating the f 0 of the original speech files using PRAAT. The results are consistent with the objective evaluation.
This paper presents the semi-automatically created Corpus of Aligned Read Speech Including Annotations (CARInA), a speech corpus based on the German Spoken Wikipedia Corpus (GSWC). CARInA tokenizes, ...consolidates and organizes the vast, but rather unstructured material contained in GSWC. The contents are grouped by annotation completeness, and extended by canonic, morphosyntactic and prosodic annotations. The annotations are provided in BPF and TextGrid format. It contains 194 hours of speech material from 327 speakers, of which 124 hours are fully phonetically aligned and 30 hours are fully aligned at all annotation levels. CARInA is freely available 1 , designed to grow and improve over time, and suitable for large-scale speech analyses or machine learning tasks as illustrated by two examples shown in this paper.
Articulatory speech synthesis based on aero-acoustic simulations of the vocal tract is computationally expensive and, therefore, requires simple yet precise models. Modeling the onedimensional vocal ...tract area function directly instead of a higher dimensional vocal tract model is an efficient way to minimize the computational overhead of the simulations. In this paper, we propose a new parametric vocal tract model that is controlled by six points and capable of modeling a large variety of vocal tract shapes. We geometrically and perceptually evaluated the model on a set of 22 reference area functions corresponding to German vowels and consonants. The model was able to geometrically approximate the reference area functions with a minimum root-mean-square error of 0.302 cm 2 , a maximum error of 1.142 cm 2 , and a median error of 0.891 cm 2 . After optimizations, a perceptual evaluation of the synthesis using our model in combination with a state-of-the-art aero-acoustic simulation achieved a vowel recognition rate of 90.7% and a consonant recognition rate of 73.2%.
This study compared the f0 of 14 German vowels in monosyllabic words (/dVt/) embedded in carrier sentences produced by 30 native speakers and 30 Mandarin Chinese learners. Appropriate techniques were ...employed to robustly measure f0 values and reliably analyze f0 profiles. The results showed that Mandarin learners produced the vowels bearing sentence stress with significantly larger f0 ranges and steeper f0 slopes but comparable f0 mean and maximum in comparison to German natives. Moreover, lax vowels produced by both groups demonstrated narrower ranges with faster f0 changes than tense vowels, which was stronger for Mandarin learners.
This paper introduces a paradigm shift regarding vocal learning simulations, in which the communicative function of speech acquisition determines the learning process and intelligibility is ...considered the primary measure of learning success. Thereby, a novel approach for artificial vocal learning is presented that utilizes deep neural network-based phoneme recognition in order to calculate the speech acquisition objective function. This function guides a learning framework that involves the state-of-the-art articulatory speech synthesizer VocalTractLab as the motor-to-acoustic forward model. In this way, an extensive set of German phonemes, including most of the consonants and all stressed vowels, was produced successfully. The synthetic phonemes were rated as highly intelligible by human listeners. Furthermore, it is shown that visual speech information, such as lip and jaw movements, can be extracted from video recordings and be incorporated into the learning framework as an additional loss component during the optimization process. It was observed that this visual loss did not increase the overall intelligibility of phonemes. Instead, the visual loss acted as a regularization mechanism that facilitated the finding of more biologically plausible solutions in the articulatory domain.