A central challenge for articulatory speech synthesis is the simulation of realistic articulatory movements, which is critical for the generation of highly natural and intelligible speech. This ...includes modeling coarticulation, i.e., the context-dependent variation of the articulatory and acoustic realization of phonemes, especially of consonants. Here we propose a method to simulate the context-sensitive articulation of consonants in consonant-vowel syllables. To achieve this, the vocal tract target shape of a consonant in the context of a given vowel is derived as the weighted average of three measured and acoustically-optimized reference vocal tract shapes for that consonant in the context of the corner vowels /a/, /i/, and /u/. The weights are determined by mapping the target shape of the given context vowel into the vowel subspace spanned by the corner vowels. The model was applied for the synthesis of consonant-vowel syllables with the consonants /b/, /d/, /g/, /l/, /r/, /m/, /n/ in all combinations with the eight long German vowels. In a perception test, the mean recognition rate for the consonants in the isolated syllables was 82.4%. This demonstrates the potential of the approach for highly intelligible articulatory speech synthesis.
Celotno besedilo
Dostopno za:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
In this study, 23 subjects produced cyclic transitions between rounded vowels and unrounded vowels as in /o-i-o-i-o-…/ at two specific speaking rates. Rounded vowels are typically produced with a ...lower larynx position than unrounded vowels. This contrast in vertical larynx position was further amplified by producing the unrounded vowels with a higher pitch than the rounded vowels. The vertical larynx movements of each subject were measured by means of object tracking in laryngeal ultrasound videos. The results indicate that larynx lowering was on average 26% faster than larynx raising, and that this velocity difference was more pronounced in woman than in men. Possible reasons for this are discussed with a focus on specific biomechanical properties. The results can help to interpret vertical larynx movements with regard to underlying neural control and aerodynamic conditions, and to improve movement models for articulatory speech synthesis.
Celotno besedilo
Dostopno za:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
Purpose: Psychoacoustical studies on transmission characteristics related to bone-conducted (BC) speech, perceived by speakers during vocalization, are important for further understanding the ...relationship between speech production and perception, especially auditory feedback. For exploring how the outer ear part contributes to BC speech transmission, this article aims to measure the transmission characteristics of bone conduction focusing on the vibration of the regio temporalis (RT) and sound radiation in the ear canal (EC) due to the excitation in the oral cavity (OC). Method: While an excitation signal was presented through a loudspeaker located in the enclosed cavity below the hard palate, transmitted signals were measured on the RT and in the EC. The transfer functions of the RT vibration and EC sound pressure relative to OC sound pressure were determined from the measured signals using the sweep-sine method. Results: Our findings obtained from the measurements of five participants are as follows: (a) the transfer function of the RT vibration relative to the OC sound pressure attenuated the frequency components above 1 kHz and (b) the transfer function of the EC relative to the OC sound pressure emphasized the frequency components between 2 and 3 kHz. Conclusions: The vibration of the soft tissue or the skull bone has an effect of low-pass filtering, whereas the sound radiation in the EC has an effect of 2-3 kHz bandpass filtering. Considering the perceptual effect of low-pass filtering in BC speech, our findings suggest that the transmission to the outer ear may not be a dominant contributor to BC speech perception during vocalization.
Celotno besedilo
Dostopno za:
DOBA, IZUM, KILJ, NUK, ODKLJ, OILJ, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK, VSZLJ
Recently, 3D printing has been increasingly used to create physical models of the vocal tract with geometries obtained from magnetic resonance imaging. These printed models allow measuring the vocal ...tract transfer function, which is not reliably possible in vivo for the vocal tract of living humans. The transfer functions enable the detailed examination of the acoustic effects of specific articulatory strategies in speaking and singing, and the validation of acoustic plane-wave models for realistic vocal tract geometries in articulatory speech synthesis. To measure the acoustic transfer function of 3D-printed models, two techniques have been described: (1) excitation of the models with a broadband sound source at the glottis and measurement of the sound pressure radiated from the lips, and (2) excitation of the models with an external source in front of the lips and measurement of the sound pressure inside the models at the glottal end. The former method is more frequently used and more intuitive due to its similarity to speech production. However, the latter method avoids the intricate problem of constructing a suitable broadband glottal source and is therefore more effective. It has been shown to yield a transfer function similar, but not exactly equal to the volume velocity transfer function between the glottis and the lips, which is usually used to characterize vocal tract acoustics. Here, we revisit this method and show both, theoretically and experimentally, how it can be extended to yield the precise volume velocity transfer function of the vocal tract.
Celotno besedilo
Dostopno za:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
Recovering speech in the absence of the acoustic speech signal itself, i.e., silent speech, holds great potential for restoring or enhancing oral communication in those who lost it. Radar is a ...relatively unexplored silent speech sensing modality, even though it has the advantage of being fully non-invasive. We therefore built a custom stepped frequency continuous wave radar hardware to measure the changes in the transmission spectra during speech between three antennas, located on both cheeks and the chin with a measurement update rate of 100 Hz. We then recorded a command word corpus of 40 phonetically balanced, two-syllable German words and the German digits zero to nine for two individual speakers and evaluated both the speaker-dependent multi-session and inter-session recognition accuracies on this 50-word corpus using a bidirectional long-short term memory network. We obtained recognition accuracies of 99.17% and 88.87% for the speaker-dependent multi-session and inter-session accuracy, respectively. These results show that the transmission spectra are very well suited to discriminate individual words from one another, even across different sessions, which is one of the key challenges for fully non-invasive silent speech interfaces.
Speech recognition based on articulatory movements instead of the acoustic signal is of growing interest in the community. In this work, we present the results of a study using a novel measurement ...technology called Electro-Optical Stomatography to capture speech movements and use the acquired data to recognize a number of command words. The performance of the recognition system was evaluated using two vocabularies (one with 30 and one with 10 words) and four speakers. The speaker-dependent results were up to the state-of-the-art with average word accuracies of 97% to 99.5%, while the speaker-independent results exceeded it with average word accuracies of approx. 56% to 62%.
Voice, as a secondary sexual characteristic, is known to affect the perceived attractiveness of human individuals. But the underlying mechanism of vocal attractiveness has remained unclear. Here, we ...presented human listeners with acoustically altered natural sentences and fully synthetic sentences with systematically manipulated pitch, formants and voice quality based on a principle of body size projection reported for animal calls and emotional human vocal expressions. The results show that male listeners preferred a female voice that signals a small body size, with relatively high pitch, wide formant dispersion and breathy voice, while female listeners preferred a male voice that signals a large body size with low pitch and narrow formant dispersion. Interestingly, however, male vocal attractiveness was also enhanced by breathiness, which presumably softened the aggressiveness associated with a large body size. These results, together with the additional finding that the same vocal dimensions also affect emotion judgment, indicate that humans still employ a vocal interaction strategy used in animal calls despite the development of complex language.
Celotno besedilo
Dostopno za:
DOBA, IZUM, KILJ, NUK, PILJ, PNG, SAZU, SIK, UILJ, UKNU, UL, UM, UPUK
The human voice is a directional sound source. This property has been explored for more than 200 years, mainly using measurements of human participants. Some efforts have been made to understand the ...anatomical parameters that influence speech directivity, e.g., the mouth opening, diffraction and reflections due to the head and torso, the lips and the vocal tract. However, these parameters have mostly been studied separately, without being integrated into a complete model or replica. The aim of this work was to study the combined influence of the torso, the lips and the vocal tract geometry on speech directivity. For this purpose, a simplified head and torso simulator was built; this simulator made it possible to vary these parameters independently. It consisted of two spheres representing the head and the torso into which vocal tract replicas with or without lips could be inserted. The directivity patterns were measured in an anechoic room with a turntable and a microphone that could be placed at different angular positions. Different effects such as torso diffraction and reflections, the correlation of the mouth dimensions with directionality, the higher-order modes and the increase in directionality due to the lips were confirmed and further documented. Interactions between the different parameters were found. It was observed that torso diffraction and reflections were enhanced by the presence of the lips, that they could be modified or masked by the effect of higher-order modes and that the lips tend to attenuate the effect of higher-order modes.
Representation learning is one of the fundamental issues in modeling articulatory-based speech synthesis using target-driven models. This paper proposes a computational strategy for learning ...underlying articulatory targets from a 3D articulatory speech synthesis model using a bi-directional long short-term memory recurrent neural network based on a small set of representative seed samples. Using a seeding set from VocalTractLab, a larger training set was generated that provided richer contextual variations for the model to learn. The deep learning model for acoustic-to-target mapping was then trained to model the inverse relation of the articulation process. This method allows the trained model to map the given acoustic data onto the articulatory target parameters which can then be used to identify the distribution based on linguistic contexts. The model was evaluated based on its effectiveness in mapping acoustics to articulation, and the perceptual accuracy of speech reproduced from the articulation estimated from the recorded speech by native Thai speakers. The model achieved more than 80% phoneme classification accuracy in the listening test conducted with 25 native Thai speakers. The results indicate that the model can accurately imitate speech with a high degree of phonemic precision.
Articulatory synthesis is based on modeling various physical phenomena of speech production, including sound radiation from the mouth. With regard to sound radiation, the most common approach is to ...approximate it in terms of a simple spherical source of strength equal to the mouth volume velocity. However, because this approximation is only valid at very low frequencies and does not account for the diffraction by the head and the torso, we simulated two alternative radiation characteristics that are potentially more realistic: the radiation from a vibrating piston in a spherical baffle, and the radiation from the mouth of a detailed model of the human head and torso. Using the articulatory speech synthesizer VocalTractLab, a corpus of 10 sentences was synthesized with the different radiation characteristics combined with three different phonation types. The synthesized sentences were acoustically compared with natural recordings of the same sentences in terms of their long-term average spectra (LTAS), and evaluated in terms of their naturalness and intelligibility. The intelligibility was not affected by the type of radiation characteristic. However, it was found that the more similar their LTAS was to real speech, the more natural the synthetic sentences were perceived to be. Hence, the naturalness was not directly determined by the realism of the radiation characteristic, but by the combined spectral effect of the radiation characteristic and the voice source. While the more realistic radiation models do not per se improve synthesis quality, they provide new insights in the study of speech production and articulatory synthesis.