Purpose: Some speech recognition data suggest that children rely less on voice pitch and harmonicity to support auditory scene analysis than adults. Two experiments evaluated development of ...speech-in-speech recognition using voiced speech and whispered speech, which lacks the harmonic structure of voiced speech. Method: Listeners were 5- to 7-year-olds and adults with normal hearing. Targets were monosyllabic words organized into three-word sets that differ in vowel content. Maskers were two-talker or one-talker streams of speech. Targets and maskers were recorded by different female talkers in both voiced and whispered speaking styles. For each masker, speech reception thresholds (SRTs) were measured in all four combinations of target and masker speech, including matched and mismatched speaking styles for the target and masker. Results: Children performed more poorly than adults overall. For the two-talker masker, this age effect was smaller for the whispered target and masker than for the other three conditions. Children's SRTs in this condition were predominantly positive, suggesting that they may have relied on a wholistic listening strategy rather than segregating the target from the masker. For the one-talker masker, age effects were consistent across the four conditions. Reduced informational masking for the one-talker masker could be responsible for differences in age effects for the two maskers. A benefit of mismatching the target and masker speaking style was observed for both target styles in the two-talker masker and for the voiced targets in the one-talker masker. Conclusions: These results provide no compelling evidence that young school-age children and adults are differentially sensitive to the cues present in voiced and whispered speech. Both groups benefit from mismatches in speaking style under some conditions. These benefits could be due to a combination of reduced perceptual similarity, harmonic cancelation, and differences in energetic masking.
Purpose: Speech-evoked neurophysiological responses are often collected to answer clinically and theoretically driven questions concerning speech and language processing. Here, we highlight the ...practical application of machine learning (ML)-based approaches to analyzing speech-evoked neurophysiological responses. Method: Two categories of ML-based approaches are introduced: decoding models, which generate a speech stimulus output using the features from the neurophysiological responses, and encoding models, which use speech stimulus features to predict neurophysiological responses. In this review, we focus on (a) a decoding model classification approach, wherein speech-evoked neurophysiological responses are classified as belonging to 1 of a finite set of possible speech events (e.g., phonological categories), and (b) an encoding model temporal response function approach, which quantifies the transformation of a speech stimulus feature to continuous neural activity. Results: We illustrate the utility of the classification approach to analyze early electroencephalographic (EEG) responses to Mandarin lexical tone categories from a traditional experimental design, and to classify EEG responses to English phonemes evoked by natural continuous speech (i.e., an audiobook) into phonological categories (plosive, fricative, nasal, and vowel). We also demonstrate the utility of temporal response function to predict EEG responses to natural continuous speech from acoustic features. Neural metrics from the 3 examples all exhibit statistically significant effects at the individual level. Conclusion: We propose that ML-based approaches can complement traditional analysis approaches to analyze neurophysiological responses to speech signals and provide a deeper understanding of natural speech and language processing using ecologically valid paradigms in both typical and clinical populations.
Speech-language pathologists and audiologists are called to serve an increasingly diverse patient population in the United States. This increased diversity highlights the need for clinicians to be ...educated early in their careers about best practices to serve patients and clients from diverse backgrounds. In this clinical focus article, the authors present the development, implementation, and preliminary perceptions of a culturally responsive clinical experience for speech-language pathology graduate students designed to engage them early in their learning career.
The pilot program was based on pillars of experiential learning and community engagement. Graduate students attended trainings aligned with a model of culturally relevant care to prepare them to conduct speech and language screenings and small group language enrichment in English and Spanish.
Preliminary analyses of student reflections indicated themes of positive perceptions about the experience and provided preliminary support for students learning about working with culturally and linguistically diverse populations in an early, intentional, and focused experience.
Early personnel preparation to culturally responsive care is crucial to meet the needs of future caseloads. Further research into the effectiveness of this kind of program is necessary to identify which variables may have the most impact on a student's cultural sensitivity, awareness, knowledge, and skills.
Purpose: Patients with voice problems commonly report increased vocal effort, regardless of the underlying pathophysiology. Previous studies investigating vocal effort and voice production have used ...a range of methods to quantify vocal effort. The goals of the current study were to use the Borg CR100 effort scale to (a) demonstrate the relation between vocal intensity or vocal level (dB) and tasked vocal effort goals and (b) investigate the repeated measure reliability of vocal level at tasked effort level goals. Method: Three types of speech (automatic, read, and structured spontaneous) were elicited at four vocal effort level goals on the Borg CR100 scale (2, 13, 25, and 50) from 20 participants (10 females and 10 males). Results: Participants' vocal level reliably changed approximately 5 dB between the elicited effort level goals; this difference was statistically significant and repeatable. Biological females produced a voice with consistently less intensity for a vocal effort level goal compared to biological males. Conclusions: The results indicate the utility of the Borg CR100 in tracking effort in voice production that is repeatable with respect to vocal level (dB). Future research will investigate other metrics of voice production with the goal of understanding the mechanisms underlying vocal effort and the external environmental influences on the perception of vocal effort.
Natural speech builds on contextual relations that can prompt predictions of upcoming utterances. To study the neural underpinnings of such predictive processing we asked 10 healthy adults to listen ...to a 1-h-long audiobook while their magnetoencephalographic (MEG) brain activity was recorded. We correlated the MEG signals with acoustic speech envelope, as well as with estimates of Bayesian word probability with and without the contextual word sequence (N-gram and Unigram, respectively), with a focus on time-lags. The MEG signals of auditory and sensorimotor cortices were strongly coupled to the speech envelope at the rates of syllables (4–8 Hz) and of prosody and intonation (0.5–2 Hz). The probability structure of word sequences, independently of the acoustical features, affected the ≤ 2-Hz signals extensively in auditory and rolandic regions, in precuneus, occipital cortices, and lateral and medial frontal regions. Fine-grained temporal progression patterns occurred across brain regions 100–1000 ms after word onsets. Although the acoustic effects were observed in both hemispheres, the contextual influences were statistically significantly lateralized to the left hemisphere. These results serve as a brain signature of the predictability of word sequences in listened continuous speech, confirming and extending previous results to demonstrate that deeply-learned knowledge and recent contextual information are employed dynamically and in a left-hemisphere-dominant manner in predicting the forthcoming words in natural speech.
•A lightweight model using a one-dimensional CNN for real-time SER system is proposed.•A multi-learning trick (MLT) is proposed for utilizing UFLBs, and stacked GRUs setup.•Proposed model have ...peculiar ability to parallel learn spatial and temporal features.•A 1D dilated CNN architecture is explored, in order to enhance the usage of features.•We evaluated our model on benchmark corpora and improve the current baseline methods.
Speech is the most dominant source of communication among humans, and it is an efficient way for human–computer interaction (HCI) to exchange information. Nowadays, speech emotion recognition (SER) is an active research area that plays a crucial role in real-time applications. In this era, the SER system has lacked real-time speech processing. To address this problem, we propose an end-to-end real-time SER model that is based on a one-dimensional dilated convolutional neural network (DCNN). Our model used a multi-learning strategy to parallel extract spatial salient emotional features and learn long term contextual dependencies from the speech signals. We used residual blocks with a skip connection (RBSC) module, in order to find a correlation, the emotional cues, and the sequence learning (Seq_L) module, to learn the long term contextual dependencies in the input features. Furthermore, we used a fusion layer to concatenate these learned features for the final emotion recognition task. Our model structure is quite simple, and it is capable of automatically learning salient discriminative features from the speech signals. We evaluated our model using benchmark IEMOCAP and EMO-DB datasets and obtained a high recognition accuracy, which were 73% and 90%, respectively. The experimental results indicated the significance and the efficiency of our proposed model have shown excessive assistance with the implementation of a real-time SER system. Hence, our model is capable of processing original speech signals for the emotion recognition that utilizes lightweight dilated CNN architecture that implements the multi-learning trick (MLT) approach.
Background
Electropalatography (EPG) records details of the location and timing of tongue contacts with the hard palate during speech. It has been effective in treating articulation disorders that ...have failed to respond to conventional therapy approaches but, until now, its use with children and adolescents with intellectual/learning disabilities and speech disorders has been limited.
Aims
To evaluate the usefulness of EPG in the treatment of speech production difficulties in children and adolescents with Down syndrome (DS) aged 8–18 years.
Methods & Procedures
A total of 27 children with DS were assessed on a range of cognitive and speech and language measures and underwent additional EPG assessment. Participants were randomly allocated to one of three age‐matched groups receiving either EPG therapy, EPG‐informed conventional therapy or ‘treatment as usual’ over a 12‐week period. The speech of all children was assessed before therapy using the Diagnostic Evaluation of Articulation and Phonology (DEAP) and reassessed immediately post‐ and 3 and 6 months post‐intervention to measure percentage consonants correct (PCC). EPG recordings were made of the DEAP assessment items at all time points. Per cent intelligibility was also calculated using the Children's Speech Intelligibility Measure (CSIM).
Outcomes & Results
Gains in accuracy of production immediately post‐therapy, as measured by PCC, were seen for all groups. Reassessment at 3 and 6 months post‐therapy revealed that those who had received therapy based directly on EPG visual feedback were more likely to maintain and improve on these gains compared with the other groups. Statistical testing showed significant differences between groups in DEAP scores across time points, although the majority did not survive post‐hoc evaluation. Intelligibility across time points, as measured by CSIM, was also highly variable within and between the three groups, but despite significant correlations between DEAP and CSIM at all time points, no statistically significant group differences emerged.
Conclusions & Implications
EPG was an effective intervention tool for improving speech production in many participants. This may be because it capitalizes on the relative strength of visual over auditory processing in this client group. The findings would seem to warrant an increased focus on addressing speech production difficulties in current therapy.
Purpose: This study examined the effects of lexical tone contour on the intelligibility of Mandarin sentences in quiet and in noise. Method: A text-to-speech synthesis engine was used to synthesize ...Mandarin sentences with each word carrying the original lexical tone, flat tone, or a tone randomly selected from the 4 Mandarin lexical tones. The synthesized speech signals were presented to 11 normal-hearing listeners for recognition in quiet and in speech-shaped noise at 0 dB signal-to-noise ratio. Results: Normal-hearing listeners nearly perfectly recognized the Mandarin sentences produced with modified tone contours in quiet; however, performance declined substantially in noise. Conclusions: Consistent with previous findings to some extent, the present findings suggest that lexical tones are relatively redundant cues for Mandarin sentence intelligibility in quiet and that other cues could compensate for the distorted lexical tone contour. However, in noise, the results provide direct evidence that lexical tone contour is important for the recognition of Mandarin sentences.
Inner speech-also known as covert speech or verbal thinking-has been implicated in theories of cognitive development, speech monitoring, executive function, and psychopathology. Despite a growing ...body of knowledge on its phenomenology, development, and function, approaches to the scientific study of inner speech have remained diffuse and largely unintegrated. This review examines prominent theoretical approaches to inner speech and methodological challenges in its study, before reviewing current evidence on inner speech in children and adults from both typical and atypical populations. We conclude by considering prospects for an integrated cognitive science of inner speech, and present a multicomponent model of the phenomenon informed by developmental, cognitive, and psycholinguistic considerations. Despite its variability among individuals and across the life span, inner speech appears to perform significant functions in human cognition, which in some cases reflect its developmental origins and its sharing of resources with other cognitive processes.
Objective
To describe the results of two reliability studies and to assess the effect of training on interrater reliability scores.
Design
The first study (1) examined interrater and intrarater ...reliability scores (weighted and unweighted kappas) and (2) compared interrater reliability scores before and after training on the use of the Cleft Audit Protocol for Speech–Augmented (CAPS-A) with British English-speaking children. The second study examined interrater and intrarater reliability on a modified version of the CAPS-A (CAPS-A Americleft Modification) with American and Canadian English-speaking children. Finally, comparisons were made between the interrater and intrarater reliability scores obtained for Study 1 and Study 2.
Participants
The participants were speech-language pathologists from the Americleft Speech Project.
Results
In Study 1, interrater reliability scores improved for 6 of the 13 parameters following training on the CAPS-A protocol. Comparison of the reliability results for the two studies indicated lower scores for Study 2 compared with Study 1. However, this appeared to be an artifact of the kappa statistic that occurred due to insufficient variability in the reliability samples for Study 2. When percent agreement scores were also calculated, the ratings appeared similar across Study 1 and Study 2.
Conclusion
The findings of this study suggested that improvements in interrater reliability could be obtained following a program of systematic training. However, improvements were not uniform across all parameters. Acceptable levels of reliability were achieved for those parameters most important for evaluation of velopharyngeal function.