•We introduce a wavelet based representation system for speech prosody.•Emergent hierarchy from f0, intensity and duration.•Prominences and boundaries are represented in one framework.•System allows ...for efficient analysis and annotation of prosodic events.•The unsupervised prosodic labelling scheme is comparable with supervised methods.
Prominences and boundaries are the essential constituents of prosodic structure in speech. They provide for means to chunk the speech stream into linguistically relevant units by providing them with relative saliences and demarcating them within utterance structures. Prominences and boundaries have both been widely used in both basic research on prosody as well as in text-to-speech synthesis. However, there are no representation schemes that would provide for both estimating and modelling them in a unified fashion. Here we present an unsupervised unified account for estimating and representing prosodic prominences and boundaries using a scale-space analysis based on continuous wavelet transform. The methods are evaluated and compared to earlier work using the Boston University Radio News corpus. The results show that the proposed method is comparable with the best published supervised annotation methods.
Prosodic features are important in achieving intelligibility, comprehensibility, and fluency in a second or foreign language (L2). However, research on the assessment of prosody as part of oral ...proficiency remains scarce. Moreover, the acoustic analysis of L2 prosody has often focused on fluency-related temporal measures, neglecting language-dependent stress features that can be quantified in terms of syllable prominence. Introducing the evaluation of prominence-related measures can be of use in developing both teaching and assessment of L2 speaking skills. In this study we compare temporal measures and syllable prominence estimates as predictors of prosodic proficiency in non-native speakers of English with respect to the speaker’s native language (L1).
The predictive power of temporal and prominence measures was evaluated for utterance-sized samples produced by language learners from four different L1 backgrounds: Czech, Slovak, Polish, and Hungarian. Firstly, the speech samples were assessed using the revised Common European Framework of Reference scale for prosodic features. The assessed speech samples were then analyzed to derive articulation rate and three fluency measures. Syllable-level prominence was estimated by a continuous wavelet transform analysis using combinations of F0, energy, and syllable duration.
The results show that the temporal measures serve as reliable predictors of prosodic proficiency in the L2, with prominence measures providing a small but significant improvement to prosodic proficiency predictions. The predictive power of the individual measures varies both quantitatively and qualitatively depending on the L1 of the speaker. We conclude that the possible effects of the speaker’s L1 on the production of L2 prosody in terms of temporal features as well as syllable prominence deserve more attention in applied research and developing teaching and assessment methods for spoken L2.
The Finnmark North Sámi is a variety of North Sámi language, an indigenous, endangered minority language spoken in the northernmost parts of Norway and Finland. The speakers of this language are ...bilingual, and regularly speak the majority language (Finnish or Norwegian) as well as their own North Sámi variety. In this paper we investigate possible influences of these majority languages on prosodic characteristics of Finnmark North Sámi, and associate them with prosodic patterns prevalent in the majority languages. We present a novel methodology that: (a) automatically finds the portions of speech (words) where the prosodic differences based on majority languages are most robustly manifested; and (b) analyzes the nature of these differences in terms of intonational patterns. For the first step, we trained convolutional WaveNet speech synthesis models on North Sámi speech material, modified to contain purely prosodic information, and used conditioning embeddings to find words with the greatest differences between the varieties. The subsequent exploratory analysis suggests that the differences in intonational patterns between the two Finnmark North Sámi varieties are not manifested uniformly across word types (based on part-of-speech category). Instead, we argue that the differences reflect phrase-level prosodic characteristics of the majority languages.
•A new wavelet-based method for assessing prosodic proficiency in L2 is proposed.•Signal-based syllable prominence is a reliable predictor of prosodic proficiency.•Wavelet-based analysis of ...prominence correlates significantly with expert ratings.•The method has potential for fully automatic L2 proficiency assessment.
Prosodic characteristics, such as lexical and phrasal stress, are one of the most challenging features for second language (L2) speakers to learn. The ability to quantify language learners’ proficiency in terms of prosody can be of use to language teachers and improve the assessment of L2 speaking skills. Automatic assessment, however, requires reliable automatic analyses of prosodic features that allow for the comparison between the productions of L2 speech and reference samples. In this paper we investigate whether signal-based syllable prominence can be used to predict the prosodic competence of Finnish learners of Swedish. Syllable-level prominence was estimated for 180 L2 and 45 native (L1) utterances by a continuous wavelet transform analysis using combinations of f0, energy, and duration. The L2 utterances were graded by four expert assessors using the revised CEFR scale for prosodic features. Correlations of prominence estimates for L2 utterances with estimates for L1 utterances and linguistic stress patterns were used as a measure of prosodic proficiency of the L2 speakers. The results show that the level of agreement conceptualized in this way correlates significantly with the assessments of expert raters, providing strong support for the use of the wavelet-based prominence estimation techniques in computer-assisted assessment of L2 speaking skills.
Over the last century, researchers have collected a considerable amount of data reflecting the properties of Lombard speech, i.e., speech in a noisy environment. The documented phenomena ...predominately report effects on the speech signal produced in ambient noise. In comparison, relatively little is known about the underlying articulatory patterns of Lombard speech, in particular for lingual articulation. Here the authors present an analysis of articulatory recordings of speech material in babble noise of different intensity levels and in hypoarticulated speech and report quantitative differences in relative expansion of movement of different articulatory subsystems (the jaw, the lips and the tongue) as well as in relative expansion of utterance duration. The trajectory modifications for one articulator can be relatively reliably predicted by those for another one, but subsystems differ in a degree of continuity in trajectory expansion elicited across different noise levels. Regression analysis of articulatory modifications against durational expansion shows further qualitative differences between the subsystems, namely, the jaw and the tongue. The findings are discussed in terms of possible influences of a combination of prosodic, segmental, and physiological factors. In addition, the Lombard effect is put forward as a viable methodology for eliciting global articulatory variation in a controlled manner.
This paper shows that a highly simplified model of speech production based on the optimization of articulatory effort versus intelligibility can account for some observed articulatory consequences of ...signal-to-noise ratio. Simulations of static vowels in the presence of various background noise levels show that the model predicts articulatory and acoustic modifications of the type observed in Lombard speech. These features were obtained only when the constraint applied to articulatory effort decreases as the level of background noise increases. These results support the hypothesis that Lombard speech is listener oriented and speakers adapt their articulation in noisy environments.
•An optimization-based model of suprasegmental speech timing is presented.•Timing patterns are modeled by trading off economy and clarity-related demands.•Global and local weights allow for ...simulating different prosodic conditions.•Simulation experiments demonstrate replication of various timing phenomena.
We present a model of suprasegmental speech timing based on the assumption that speech patterns are shaped by global and local adjustments of trade-offs between conflicting demands of minimizing production effort and maximizing perceptual clarity. The model uses an optimization procedure to determine durations of suprasegmental constituents of simulated utterances by minimizing an independently motivated composite cost function. The cost is a function of the constituent durations and encompasses different components that represent independently derived measures of speaker-based production effort, listener-oriented perceptual clarity as well as time conceptualized as a resource shared between both parties, linked to transmission efficiency. The trade-offs between these influences can be globally and locally adjusted by weights assigned to individual cost components within the composite cost function. We show that this approach facilitates modeling a hierarchy of interacting prosodic features of utterances, such as different degrees of prominence or effects of speaking rate and overall requirements of clarity. We outline the theoretical foundations and the architecture of the model and present results of simulation experiments, demonstrating that the model correctly predicts a range of suprasegmental timing phenomena in stress-accent languages that have not been addressed by a unified model. Results underline the model’s capacity to account for several empirical observations regarding durational variation in speech.
•Language background affects cABR responses to sound.•Finnish (quantity language) speakers show higher cABR peak amplitude than German.•Change in sound intensity and spectral band delay cABR ...response.
The complex auditory brainstem response (cABR) can reflect language-based plasticity in subcortical stages of auditory processing. It is sensitive to differences between language groups as well as stimulus properties, e.g. intensity or frequency. It is also sensitive to the synchronicity of the neural population stimulated by sound, which results in increased amplitude of wave V.
Finnish is a full-fledged quantity language, in which word meaning is dependent upon duration of the vowels and consonants. Previous studies have shown that Finnish speakers have enhanced behavioural sound duration discrimination ability and larger cortical mismatch negativity (MMN) to duration change compared to German and French speakers.
The next step is to find out whether these enhanced duration discrimination abilities of quantity language speakers originate at the brainstem level. Since German has a complementary quantity contrast which restricts the possible patterns of short and long vowels and consonants, the current experiment compared cABR between nonmusician Finnish and German native speakers using seven short complex stimuli. Finnish speakers had a larger cABR peak amplitude than German speakers, while the peak onset latency was only affected by stimulus intensity and spectral band. The results suggest that early cABR responses are better synchronised for Finns, which could underpin the enhanced duration sensitivity of quantity language speakers.
Embodied Task Dynamics is a modeling platform combining task dynamical implementation of articulatory phonology with an optimization approach based on adjustable trade-offs between production ...efficiency and perception efficacy. Within this platform we model a consonantal quantity contrast in bilabial stops as emerging from local adjustment of demands on relative prominence of the consonantal gesture conceptualized in terms of closure duration. The contrast is manifested in the form of two distinct, stable inter-gestural coordination patterns characterized by quantitative differences in relative phasing between the consonant and the coproduced vocalic gesture. Furthermore, the model generates a set of qualitative predictions regarding dependence of kinematic characteristics and inter-gestural coordination on consonant quantity and gestural context. To evaluate these predictions, we collected articulatory data for Finnish speakers uttering singletons and geminates in the same context as explored by the model. Statistical analysis of the data shows strong agreement with model predictions. This result provides support for the hypothesis that speech articulation is guided by efficiency principles that underlie many other types of embodied skilled action.
•We model consonant quantity contrast within an optimization-based dynamical platform.•The contrast emerges as two discretely different solutions of the optimization task.•A dependency of gestural timing on gemination and articulatory context is predicted.•We collected Finnish articulatory data to evaluate the model predictions.•Statistical analysis shows a strong agreement between the data and the predictions.