Speech intelligibility (SI) measurement has attracted great attention in the speech communication community over the last decade. It is a critical consideration for speech enhancement, coding, and ...transmission, as well as for diagnostics. In particular, nonintrusive SI measures that are realistically applicable without reference speech signals have been growing rapidly. This paper gives a review of methodology of nonintrusive SI measures and aims to show how nonintrusive SI metrics perform relative to intrusive ones, as well as their potential in future speech communication applications. In addition, this paper provides a systematic classification of historical and recently introduced methods in a comprehensive framework with critical comments and comparisons of their advantages and limitations. It considers an extensive and up-to-date bibliography to provide a suitable background and overview of recent advancements. The current SI metric development status is presented in the context of an organized framework with associated analyses and examples of the utility of SI metrics in physiological research and clinical applications. Finally, this paper discusses important emergent and potential future research directions.
Intelligibility measures, which assess the number of words or phonemes a listener correctly transcribes or repeats, are commonly used metrics for speech perception research. While these measures have ...many benefits for researchers, they also come with a number of limitations. By pointing out the strengths and limitations of this approach, including how it fails to capture aspects of perception such as listening effort, this article argues that the role of intelligibility measures must be reconsidered in fields such as linguistics, communication disorders, and psychology. Recommendations for future work in this area are presented.
•Subjective speech intelligibility tests are conducted on different age groups with Chinese word lists, and the relationship between the Chinese speech intelligibility (CSI) score and speech ...intelligibility index (SII) of the elderly is explored.•Auralization technique is applied.•There is a good correlation between the CSI score and SII in the elderly.•The SII required by the elderly to obtain the same CSI score is higher than the young, and it increases with age.•Compared with the results of the relationship between CSI and speech transmission index (STI), the SII increment needed by the elderly is less than the STI increment needed to obtain the same CSI score with the young.
The speech intelligibility index (SII) and speech transmission index (STI) are widely accepted objective metrics for assessing speech intelligibility. In previous work, the relationship between STI and Chinese speech intelligibility (CSI) scores was studied. In this paper, the relationship between SII and CSI scores in rooms for the elderly aged 60–69 and over 70 is investigated by using auralization method under different background noise levels (40dBA and 55dBA) and different reverberation times. The results show that SII has good correlation with CSI score of the elderly. To get the same CSI score as the young adults, the elderly need a larger SII value, and the value increases with the increase of the age for the elderly. Since hearing loss of the elderly is considered in the calculation of SII, the difference in the required SII between the elderly and young is less than that of the required STI under the same CSI score condition. This indicates that SII is a more consistent evaluation criterion for different ages.
Most cues to speech intelligibility are within a narrow frequency range, with its upper limit not exceeding 4 kHz. It is still unclear whether speaker-related (indexical) information is available ...past this limit or how speaker characteristics are distributed at frequencies within and outside the intelligibility range. Using low-pass and high-pass filtering, we examined the perceptual salience of dialect and gender cues in both intelligible and unintelligible speech. Setting the upper frequency limit at 11 kHz, spontaneously produced unique utterances (n = 400) from 40 speakers were high-pass filtered with frequency cutoffs from 0.7 to 5.56 kHz and presented to listeners for dialect and gender identification and intelligibility evaluation. The same material and experimental procedures were used to probe perception of low-pass filtered and unmodified speech with cutoffs from 0.5 to 1.1 kHz. Applying statistical signal detection theory analyses, we found that cues to gender were well preserved at low and high frequencies and did not depend on intelligibility, and the redundancy of gender cues at higher frequencies reduced response bias. Cues to dialect were relatively strong at low and high frequencies; however, most were in intelligible speech, modulated by a differential intelligibility advantage of male and female speakers at low and high frequencies.
Objective: A review of the development, evaluation, and application of the so-called 'matrix sentence test' for speech intelligibility testing in a multilingual society is provided. The format allows ...for repeated use with the same patient in her or his native language even if the experimenter does not understand the language. Design: Using a closed-set format, the syntactically fixed, semantically unpredictable sentences (e.g. 'Peter bought eight white ships') provide a vocabulary of 50 words (10 alternatives for each position in the sentence). The principles (i.e. construction, optimization, evaluation, and validation) for 14 different languages are reviewed. Studies of the influence of talker, language, noise, the training effect, open vs. closed conduct of the test, and the subjects' language proficiency are reported and application examples are discussed. Results: The optimization principles result in a steep intelligibility function and a high homogeneity of the speech materials presented and test lists employed, yielding a high efficiency and excellent comparability across languages. The characteristics of speakers generally dominate the differences across languages. Conclusion: The matrix test format with the principles outlined here is recommended for producing efficient, reliable, and comparable speech reception thresholds across different languages.
Purpose: To further understand the effect of Parkinson's disease (PD) on articulatory movements in speech and to expand our knowledge of therapeutic treatment strategies, this study examined ...movements of the jaw, tongue blade, and tongue dorsum during sentence production with respect to speech intelligibility and compared the effect of varying speaking styles on these articulatory movements. Method: Twenty-one speakers with PD and 20 healthy controls produced 3 sentences under normal, loud, clear, and slow speaking conditions. Speech intelligibility was rated for each speaker. A 3-dimensional electromagnetic articulograph tracked movements of the articulators. Measures included articulatory working spaces, ranges along the first principal component, average speeds, and sentence durations. Results: Speakers with PD demonstrated significantly smaller jaw movements as well as shorter than normal sentence durations. Between-speaker variation in movement size of the jaw, tongue blade, and tongue dorsum was associated with speech intelligibility. Analysis of speaking conditions revealed similar patterns of change in movement measures across groups and articulators: larger than normal movement sizes and faster speeds for loud speech, increased movement sizes for clear speech, and larger than normal movement sizes and slower speeds for slow speech. Conclusions: Sentence-level measures of articulatory movements are sensitive to both disease-related changes in PD and speaking-style manipulations.
Speech intelligibility (SI) prediction models are a valuable tool for the development of speech processing algorithms for hearing aids or consumer electronics. For the use in realistic environments ...it is desirable that the SI model is non-intrusive (does not require separate input of original and degraded speech, transcripts or a-priori knowledge about the signals) and does a binaural processing of the audio signals. Most of the existing SI models do not fulfill all of these criteria. In this study, we propose an SI model based on phone probabilities obtained from a deep neural net. The model comprises a binaural enhancement stage for prediction of the speech recognition threshold (SRT) in realistic acoustic scenes. In the first part of the study, SRT predictions in different spatial configurations are compared to the results from normal-hearing listeners. On average, our approach produces lower errors and higher correlations compared to three intrusive baseline models. In the second part, we explore if measures relevant in spatial hearing, i.e., the intelligibility level difference (ILD) and the binaural ILD (BILD), can be predicted with our modeling approach. We also investigate if a language mismatch between training and testing the model plays a role when predicting ILD and BILD. This point is especially important for low-resource languages, where not thousands of hours of language material are available for training. Binaural benefits are predicted by our model with an error of 1.5 dB. This is slightly higher than the error with a competitive baseline MBSTOI (1.1 dB), but does not require separate input of original and degraded speech. We also find that good binaural predictions can be obtained with models that are not specifically trained with the target language.
•Predicts binaural benefits for speech intelligibility of normal-hearing listeners.•Non-intrusive speech intelligibility predictions.•Accurate speech intelligibility predictions for languages excluded from training.
The coronavirus pandemic has resulted in the recommended/required use of face masks in public. The use of a face mask compromises communication, especially in the presence of competing noise. It is ...crucial to measure the potential effects of wearing face masks on speech intelligibility in noisy environments where excessive background noise can create communication challenges. The effects of wearing transparent face masks and using clear speech to facilitate better verbal communication were evaluated in this study. We evaluated listener word identification scores in the following four conditions: (1) type of mask condition (i.e., no mask, transparent mask, and disposable face mask), (2) presentation mode (i.e., auditory only and audiovisual), (3) speaking style (i.e., conversational speech and clear speech), and (4) with two types of background noise (i.e., speech shaped noise and four-talker babble at -5 signal-to-noise ratio). Results indicate that in the presence of noise, listeners performed less well when the speaker wore a disposable face mask or a transparent mask compared to wearing no mask. Listeners correctly identified more words in the audiovisual presentation when listening to clear speech. Results indicate the combination of face masks and the presence of background noise negatively impact speech intelligibility for listeners. Transparent masks facilitate the ability to understand target sentences by providing visual information. Use of clear speech was shown to alleviate challenging communication situations including compensating for a lack of visual cues and reduced acoustic signals.
Neural speech tracking has advanced our understanding of how our brains rapidly map an acoustic speech signal onto linguistic representations and ultimately meaning. It remains unclear, however, how ...speech intelligibility is related to the corresponding neural responses. Many studies addressing this question vary the level of intelligibility by manipulating the acoustic waveform, but this makes it difficult to cleanly disentangle the effects of intelligibility from underlying acoustical confounds. Here, using magnetoencephalography recordings, we study neural measures of speech intelligibility by manipulating intelligibility while keeping the acoustics strictly unchanged. Acoustically identical degraded speech stimuli (three-band noise-vocoded, ~20 s duration) are presented twice, but the second presentation is preceded by the original (nondegraded) version of the speech. This intermediate priming, which generates a "pop-out" percept, substantially improves the intelligibility of the second degraded speech passage. We investigate how intelligibility and acoustical structure affect acoustic and linguistic neural representations using multivariate temporal response functions (mTRFs). As expected, behavioral results confirm that perceived speech clarity is improved by priming. mTRFs analysis reveals that auditory (speech envelope and envelope onset) neural representations are not affected by priming but only by the acoustics of the stimuli (bottom-up driven). Critically, our findings suggest that segmentation of sounds into words emerges with better speech intelligibility, and most strongly at the later (~400 ms latency) word processing stage, in prefrontal cortex, in line with engagement of top-down mechanisms associated with priming. Taken together, our results show that word representations may provide some objective measures of speech comprehension.
Wearing face masks (alongside physical distancing) provides some protection against infection from COVID-19. Face masks can also change how people communicate and subsequently affect speech signal ...quality. This study investigated how three common face mask types (N95, surgical, and cloth) affected acoustic analysis of speech and perceived intelligibility in healthy subjects. Acoustic measures of timing, frequency, perturbation, and power spectral density were measured. Speech intelligibility and word and sentence accuracy were also examined using the Assessment of Intelligibility of Dysarthric Speech. Mask type impacted the power distribution in frequencies above 3 kHz for the N95 mask, and above 5 kHz in surgical and cloth masks. Measures of timing and spectral tilt mainly differed with N95 mask use. Cepstral and harmonics to noise ratios remained unchanged across mask type. No differences were observed across conditions for word or sentence intelligibility measures; however, accuracy of word and sentence translations were affected by all masks. Data presented in this study show that face masks change the speech signal, but some specific acoustic features remain largely unaffected (e.g., measures of voice quality) irrespective of mask type. Outcomes have bearing on how future speech studies are run when personal protective equipment is worn.