Human speech perception results from neural computations that transform external acoustic speech signals into internal representations of words. The superior temporal gyrus (STG) contains the ...nonprimary auditory cortex and is a critical locus for phonological processing. Here, we describe how speech sound representation in the STG relies on fundamentally nonlinear and dynamical processes, such as categorization, normalization, contextual restoration, and the extraction of temporal structure. A spatial mosaic of local cortical sites on the STG exhibits complex auditory encoding for distinct acoustic-phonetic and prosodic features. We propose that as a population ensemble, these distributed patterns of neural activity give rise to abstract, higher-order phonemic and syllabic representations that support speech perception. This review presents a multi-scale, recurrent model of phonological processing in the STG, highlighting the critical interface between auditory and language systems.
Speech analysis could provide an indicator of Alzheimer's disease and help develop clinical tools for automatically detecting and monitoring disease progression. While previous studies have employed ...acoustic (speech) features for characterisation of Alzheimer's dementia, these studies focused on a few common prosodic features, often in combination with lexical and syntactic features which require transcription. We present a detailed study of the predictive value of purely acoustic features automatically extracted from spontaneous speech for Alzheimer's dementia detection, from a computational paralinguistics perspective. The effectiveness of several state-of-the-art paralinguistic feature sets for Alzheimer's detection were assessed on a balanced sample of DementiaBank's Pitt spontaneous speech dataset, with patients matched by gender and age. The feature sets assessed were the extended Geneva minimalistic acoustic parameter set (eGeMAPS), the emobase feature set, the ComParE 2013 feature set, and new Multi-Resolution Cochleagram (MRCG) features. Furthermore, we introduce a new active data representation (ADR) method for feature extraction in Alzheimer's dementia recognition. Results show that classification models based solely on acoustic speech features extracted through our ADR method can achieve accuracy levels comparable to those achieved by models that employ higher-level language features. Analysis of the results suggests that all feature sets contribute information not captured by other feature sets. We show that while the eGeMAPS feature set provides slightly better accuracy than other feature sets individually (71.34%), "hard fusion" of feature sets improves accuracy to 78.70%.
•This is the first journal submission describing in considerable detail acoustic-prosodic entrainment (the tendency of speakers to speak like one another in successful conversations) in speakers of ...Mandarin Chinese.•Findings include the following o in Mandarin, global (over a whole conversation) and local (turn by turn) entrainment are quite different in terms of the features speakers entrain on.•Local entrainment is more prevalent than global.•Entrainment over tone units is the most prominent form of local entrainment in Mandarin.•Mandarin speakers also entrain globally and locally in terms of intensity (loudness).•Durational and F0 (fundamental frequency) entrainment is also more prevalent in local entrainment.•These findings are compared with our and our colleagues findings’ for Slovak and English.•We also propose additional and more detailed research for Mandarin Chinese entrainment.
Previous research on acoustic entrainment has paid less attention to tones than to other prosodic features. This study sets a hierarchical framework by three layers of conversations, turns and tone units, investigates prosodic entrainment in Mandarin spontaneous dialogues at each level, and compares the three. Our research has found that (1) global and local entrainment exist independently, and local entrainment is more evident than global; (2) variation exists in prosodic features’ contribution to entrainment at three levels: amplitude features exhibiting more prominent entrainment at both global and local levels, and speaking-rate and F0 features showing more prominence at the local levels; and (3) no convergence is found at the conversational level, at the turn level or over tone units.
While deep neural networks have shown impressive results in automatic speaker recognition and related tasks, it is dissatisfactory how little is understood about what exactly is responsible for these ...results. Part of the success has been attributed in prior work to their capability to model supra-segmental temporal information (SST), i.e., learn rhythmic-prosodic characteristics of speech in addition to spectral features. In this paper, we (i) present and apply a novel test to quantify to what extent the performance of state-of-the-art neural networks for speaker recognition can be explained by modeling SST; and (ii) present several means to force respective nets to focus more on SST and evaluate their merits. We find that a variety of CNN- and RNN-based neural network architectures for speaker recognition do not model SST to any sufficient degree, even when forced. The results provide a highly relevant basis for impactful future research into better exploitation of the full speech signal and give insights into the inner workings of such networks, enhancing explainability of deep learning for speech technologies.
•Literature explains speaker recognition in neural nets by modeling of voice dynmaics.•Diagnostic: We quantify how well deep learning models actually capture dynamics.•Observation: State-of-the-art deep nets do not model speaker prosody but ignore it.•Interpretation as “cheating”: Achieving high without putting in due effort.•Outlook: Increasing task difficulty biases models towards prosody, but not enough.
•Novel corpus of prosodic portrayals of speakers’ communicative intentions.•Speakers use characteristic prosodic (acoustic) patterns to express their intentions.•Listeners use these prosodic patterns ...to understand the speaker’s intention.•Comprehension is not contingent on context, semantics or affective processing.•Prosody serves as purposeful interactional instrument between speaker and listener.
Action-theoretic views of language posit that the recognition of others’ intentions is key to successful interpersonal communication. Yet, speakers do not always code their intentions literally, raising the question of which mechanisms enable interlocutors to exchange communicative intents. The present study investigated whether and how prosody—the vocal tone—contributes to the identification of “unspoken” intentions. Single (non-)words were spoken with six intonations representing different speech acts—as carriers of communicative intentions. This corpus was acoustically analyzed (Experiment 1), and behaviorally evaluated in two experiments (Experiments 2 and 3). The combined results show characteristic prosodic feature configurations for different intentions that were reliably recognized by listeners. Interestingly, identification of intentions was not contingent on context (single words), lexical information (non-words), and recognition of the speaker’s emotion (valence and arousal). Overall, the data demonstrate that speakers’ intentions are represented in the prosodic signal which can, thus, determine the success of interpersonal communication.
•Higher speech intensity and lower speech rates improve automatic speech recognition accuracy.•Arab ESL teachers and students give more attention to pronunciation errors that do not affect ...intelligibility.•Arabic-influenced ESL pronunciation errors affecting automatic speech recognition are related to consonant clusters and sounds with adjacent places of articulation that are likely to appear in minimal pairs (as in temper vs. temple).
The present study examines the impact of Arab speakers’ phonological and prosodic features on the accuracy of automatic speech recognition (ASR) of non-native English speech. The authors first investigated the perceptions of 30 Egyptian ESL teachers and 70 Egyptian university students towards the L1 (Arabic)-based errors affecting intelligibility and then carried out a data analysis of the ASR of the students’ English speech to find out whether the errors investigated resulted in intelligibility breakdowns in an ASR setting. In terms of the phonological features of non-native speech, the results showed that the teachers gave more weight to pronunciation features of accented speech that did not actually hinder recognition, that the students were mostly oblivious to the L2 errors they made and their impact on intelligibility, and that L2 errors which were not perceived as serious by both teachers and students had negative impacts on ASR accuracy levels. In regard to the prosodic features of non-native speech, it was found that lower speech rates resulted in more accurate speech recognition levels, higher speech intensity led to less deletion errors, and voice pitch did not seem to have any impact on ASR accuracy levels. The study, accordingly, recommends training ASR systems with more non-native data to increase their accuracy levels as well as paying more attention to remedying non-native speakers’ L1-based errors that are more likely to impact non-native automatic speech recognition.
Speech is an effective medium to express emotions and attitude through language. Finding the emotional content from a speech signal and identify the emotions from the speech utterances is an ...important task for the researchers. Speech emotion recognition has considered as an important research area over the last decade. Many researchers have been attracted due to the automated analysis of human affective behaviour. Therefore a number of systems, algorithms, and classifiers have been developed and outlined for the identification of emotional content of a speech from a person’s speech. In this study, available literature on various databases, different features and classifiers have been taken in to consideration for speech emotion recognition from assorted languages.
Speech emotion recognition (SER) research has usually focused on the analysis of the native language of speakers, most commonly, targeting European and Asian languages. In the present study, a ...bilingual Arabic/English speech emotion database elicited from 16 male and 16 female Egyptian participants was created in order to investigate how the linguistic and prosodic features were affected by the anger, fear, happiness and sadness emotions across Arabic and English emotional speech. The results of the linguistic analysis indicated that the participants preferred to express their emotions indirectly, mainly using religious references, and that the female participants tended to use language that was more tentative and emotionally expressive, while the male participants tended to use language that was more assertive and independent. As for the prosodic analysis, statistical t-tests showed that the prosodic features of pitch, intensity and speech rate were more indicative of anger and happiness while less relevant to fear and scarcely significant for sadness. Furthermore, speech emotion recognition performed using linear support vector machine (SVM) with AdaBoost also supported these results. In regard to first and second language linguistic features, there was no significant difference in the choice of words and structures expressing the different emotions in the two languages, but in terms of prosodic features, the females' speech showed higher pitch in Arabic in all cases while both genders showed close intensity values in the two languages and faster speech rate in Arabic than in English.
This paper describes a novel end‐to‐end deep generative model‐based speaker recognition system using prosodic features. The usefulness of variational autoencoders (VAE) in learning the ...speaker‐specific prosody representations for the speaker recognition task is examined herein for the first time. The speech signal is first automatically segmented into syllable‐like units using vowel onset points (VOP) and energy valleys. Prosodic features, such as the dynamics of duration, energy, and fundamental frequency (
F0), are then extracted at the syllable level and used to train/adapt a speaker‐dependent VAE from a universal VAE. The initial comparative studies on VAEs and traditional autoencoders (AE) suggest that the former can efficiently learn speaker representations. Investigations on the impact of gender information in speaker recognition also point out that gender‐dependent impostor banks lead to higher accuracies. Finally, the evaluation on the NIST SRE 2010 dataset demonstrates the usefulness of the proposed approach for speaker recognition.