This study compared the combined effect of noise and reverberation on listening effort and speech intelligibility to predictions of the speech transmission index (STI). Listening effort was measured ...in normal-hearing subjects using a scaling procedure. Speech intelligibility scores were measured in the same subjects and conditions: (a) Speech-shaped noise as the only interfering factor, (b) + (c) fixed signal-to-noise ratios (SNRs) of 0 or 7 dB and reverberation as detrimental factors, and (d) reverberation as the only detrimental factor. In each condition, SNR and reverberation were combined to produce STI values of 0.17, 0.30, 0.43, 0.57, and 0.70, respectively. Listening effort always decreased with increasing STI, thus enabling a rough prediction, but a significant bias was observed indicating that listening effort was lower in reverberation only than in noise only at the same STI for one type of impulse responses. Accordingly, speech intelligibility increased with increasing STI and was significantly better in reverberation only than in noise only at the same STI. Further analyses showed that the broadband reverberation time is not always a good estimate of speech degradation in reverberation and that different speech materials may differ in their robustness toward detrimental effects of reverberation.
In many speech communication applications, such as public address systems, speech is degraded by additive noise, leading to reduced speech intelligibility. In this paper a pre-processing algorithm is ...proposed that is capable of increasing speech intelligibility under an equal-power constraint. The proposed AdaptDRC algorithm comprises two time- and frequency-dependent stages, i.e., an amplification stage and a dynamic range compression stage that are both dependent on the Speech Intelligibility Index (SII). Experiments using two objective measures, namely, the extended SII and the short-time objective intelligibility measure (STOI), and a formal listening test were conducted to compare the AdaptDRC algorithm with a modified version of a recently proposed algorithm in three different noise conditions (stationary car noise and speech-shaped noise and non-stationary cafeteria noise). While the objective measures indicate a similar performance for both algorithms, results from the formal listening test indicate that for the two stationary noises both algorithms lead to statistically significant improvements in speech intelligibility and for the non-stationary cafeteria noise only the proposed AdaptDRC algorithm leads to statistically significant improvements. A comparison of both objective measures and results from the listening test shows high correlations, although, in general, the performance of both algorithms is overestimated.
Speech perception in complex sound fields can greatly benefit from different unmasking cues to segregate the target from interfering voices. This study investigated the role of three unmasking cues ...(spatial separation, gender differences, and masker time reversal) on speech intelligibility and perceived listening effort in normal-hearing listeners. Speech intelligibility and categorically scaled listening effort were measured for a female target talker masked by two competing talkers with no unmasking cues or one to three unmasking cues. In addition to natural stimuli, all measurements were also conducted with glimpsed speech—which was created by removing the time–frequency tiles of the speech mixture in which the maskers dominated the mixture—to estimate the relative amounts of informational and energetic masking as well as the effort associated with source segregation. The results showed that all unmasking cues as well as glimpsing improved intelligibility and reduced listening effort and that providing more than one cue was beneficial in overcoming informational masking. The reduction in listening effort due to glimpsing corresponded to increases in signal-to-noise ratio of 8 to 18 dB, indicating that a significant amount of listening effort was devoted to segregating the target from the maskers. Furthermore, the benefit in listening effort for all unmasking cues extended well into the range of positive signal-to-noise ratios at which speech intelligibility was at ceiling, suggesting that listening effort is a useful tool for evaluating speech-on-speech masking conditions at typical conversational levels.
Speech recognition in rooms requires the temporal integration of reflections which arrive with a certain delay after the direct sound. It is commonly assumed that there is a certain temporal window ...of about 50–100 ms, during which reflections can be integrated with the direct sound, while later reflections are detrimental to speech intelligibility. This concept was challenged in a recent study by employing binaural room impulse responses (RIRs) with systematically varied interaural phase differences (IPDs) and amplitude of the direct sound and a variable number of reflections delayed by up to 200 ms. When amplitude or IPD favored late RIR components, normal-hearing (NH) listeners appeared to be capable of focusing on these components rather than on the precedent direct sound, which contrasted with the common concept of considering early RIR components as useful and late components as detrimental. The present study investigated speech intelligibility in the same conditions in hearing-impaired (HI) listeners. The data indicate that HI listeners were generally less able to “ignore” the direct sound than NH listeners, when the most useful information was confined to late RIR components. Some HI listeners showed a remarkable inability to integrate across multiple reflections and to optimally “shift” their temporal integration window, which was quite dissimilar to NH listeners. This effect was most pronounced in conditions requiring spatial and temporal integration and could provide new challenges for individual prediction models of binaural speech intelligibility.
Real-world sounds like speech or traffic noise typically exhibit spectro-temporal variability because the energy in different spectral regions evolves differently as a sound unfolds in time. However, ...it is currently not well understood how the energy in different spectral and temporal portions contributes to loudness. This study investigated how listeners weight different temporal and spectral components of a sound when judging its overall loudness. Spectral weights were measured for the combination of three loudness-matched narrowband noises with different center frequencies. To measure temporal weights, 1,020-ms stimuli were presented, which randomly changed in level every 100 ms. Temporal weights were measured for each narrowband noise separately, and for a broadband noise containing the combination of the three noise bands. Finally, spectro-temporal weights were measured with stimuli where the level of the three narrowband noises randomly and independently changed every 100 ms. The data consistently showed that (i) the first 300 ms of the sounds had a greater influence on overall loudness perception than later temporal portions (primacy effect), and (ii) the lowest noise band contributed significantly more to overall loudness than the higher bands. The temporal weights did not differ between the three frequency bands. Notably, the spectral weights and temporal weights estimated from the conditions with only spectral or only temporal variability were very similar to the corresponding weights estimated in the spectro-temporal condition. The results indicate that the temporal and the spectral weighting of the loudness of a time-varying sound are independent processes. The spectral weights remain constant across time, and the temporal weights do not change across frequency. The results are discussed in the context of current loudness models.
For speech intelligibility in rooms, the temporal integration of speech reflections is typically modeled by separating the room impulse response (RIR) into an early (assumed beneficial for speech ...intelligibility) and a late part (assumed detrimental). This concept was challenged in this study by employing binaural RIRs with systematically varied interaural phase differences (IPDs) and amplitude of the direct sound and a variable number of reflections delayed by up to 200 ms. Speech recognition thresholds in stationary noise were measured in normal-hearing listeners for 86 conditions. The data showed that direct sound and one or several early speech reflections could be perfectly integrated when they had the same IPD. Early reflections with the same IPD as the noise (but not as the direct sound) could not be perfectly integrated with the direct sound. All conditions in which the dominant speech information was within the early RIR components could be well predicted by a binaural speech intelligibility model using classic early/late separation. In contrast, when amplitude or IPD favored late RIR components, listeners appeared to be capable of focusing on these components rather than on the precedent direct sound. This could not be modeled by an early/late separation window but required a temporal integration window that can be flexibly shifted along the RIR.
Masking noise and reverberation strongly influence speech intelligibility and decrease listening comfort. To optimize acoustics for ensuring a comfortable environment, it is crucial to understand the ...respective contribution of bottom-up signal-driven cues and top-down linguistic-semantic cues to speech recognition in noise and reverberation. Since the relevance of these cues differs across speech test materials and training status of the listeners, we investigate the influence of speech material type on speech recognition in noise, reverberation and combinations of noise and reverberation. We also examine the influence of training on the performance for a subset of measurement conditions. Speech recognition is measured with an open-set, everyday Plomp-type sentence test and compared to the recognition scores for a closed-set Matrix-type test consisting of syntactically fixed and semantically unpredictable sentences (c.f. data by Rennies et al., J. Acoust. Soc. America, 2014, 136, 2642–2653). While both tests yield approximately the same recognition threshold in noise in trained normal-hearing listeners, their performance may differ as a result of cognitive factors, i.e., the closed-set test is more sensitive to training effects while the open-set test is more affected by language familiarity. All experimental data were obtained at a fixed signal-to-noise ratio (SNR) and/or reverberation time set to obtain the desired speech transmission index (STI) values of 0.17, 0.30, and 0.43. respectively, thus linking the data to STI predictions as a measure of pure low-level acoustic effects. The results confirm the consistent difference between robustness to reverberation observed in the literature between the matrix type sentences and the Plomp-type sentences, especially for poor and medium speech intelligibility. The robustness of the closed-set matrix type sentences against reverberation disappeared when listeners had no a priori knowledge about the speech material (sentence structure and words used), thus demonstrating the influence of higher-level lexical-semantic cues in speech recognition. In addition, the consistent difference between reverberation- and noise-induced recognition scores of everyday sentences for medium and high STI conditions and the differences between Matrix-type and Plomp-type sentence scores clearly demonstrate the limited utility of the STI in predicting speech recognition in noise and reverberation.
We present a mobile apparatus for audio-visual experiments (MASAVE) that is easy to build with a low budget and which can run listening tests, pupillometry, and eye-tracking, e.g., for measuring ...listening effort and fatigue. The design goal was to keep the MASAVE at affordable costs and to enable shipping the preassembled system to the subjects for self-setup in home environments. Two experiments were conducted to validate the proposed system. In the first experiment we tested the reliability of speech perception data gathered using the MASAVE in a less controlled, rather noisy environment. Speech recognition thresholds (SRTs) were measured in a lobby versus a sound-attenuated boot. Results show that the data from both sites did not differ significantly and SRT measurements were possible even for speech levels as low as 40–45 dB SPL. The second experiment validated the usability of the preassembled system and the use of pupillometry measurements under conditions of darkness, which can be achieved by applying a textile cover over the MASAVE and the subject to block out light. The results suggest that the tested participants had no usability issues with setting up the system, that the temperature under the cover increased by several degrees only when the measurement duration was rather long, and that pupillometry measurements can be made with the proposed setup. Overall, the validations indicate that the MASAVE can serve as an alternative when lab testing is not possible, and to gather more data or to reach subject groups that are otherwise difficult to reach.
This study presents a method of adding to clean speech signals a controlled degree of “musical” noise distortions that mimic typical artefacts of speech enhancement systems. The resulting distorted ...speech signals were evaluated with respect to listening effort and sound quality in subjective listening tests and via model predictions. Both subjective ratings and model prediction outcomes covered the entire rating scale from “excellent”/ “no effort” to “bad”/ “extreme effort”, respectively, in a consistent way. The proposed method proved to be useful for systematic assessments of “musical” noise distortions for the conditions tested in this study.
We reanalyzed a study that investigated binaural and temporal integration of speech reflections with different amplitudes, delays, and interaural phase differences. We used a blind binaural speech ...intelligibility model (bBSIM), applying an equalization-cancellation process for modeling binaural release from masking. bBSIM is blind, as it requires only the mixed binaural speech and noise signals and no auxiliary information about the listening conditions. bBSIM was combined with two non-blind back-ends: The speech intelligibility index (SII) and the speech transmission index (STI) resulting in hybrid-models. Furthermore, bBSIM was combined with the non-intrusive short-time objective intelligibility (NI-STOI) resulting in a fully blind model. The fully non-blind reference model used in the previous study achieved the best prediction accuracy (
R
2
= 0.91 and RMSE = 1 dB). The fully blind model yielded a coefficient of determination (
R
2
= 0.87) similar to that of the reference model but also the highest root mean square error of the models tested in this study (RMSE = 4.4 dB). By adjusting the binaural processing errors of bBSIM as done in the reference model, the RMSE could be decreased to 1.9 dB. Furthermore, in this study, the dynamic range of the SII had to be adjusted to predict the low SRTs of the speech material used.