This paper examines the questions, what levels of speech can be perceived visually, and how is visual speech represented by the brain? Review of the literature leads to the conclusions that every ...level of psycholinguistic speech structure (i.e., phonetic features, phonemes, syllables, words, and prosody) can be perceived visually, although individuals differ in their abilities to do so; and that there are visual modality-specific representations of speech qua speech in higher-level vision brain areas. That is, the visual system represents the modal patterns of visual speech. The suggestion that the auditory speech pathway receives and represents visual speech is examined in light of neuroimaging evidence on the auditory speech pathways. We outline the generally agreed-upon organization of the visual ventral and dorsal pathways and examine several types of visual processing that might be related to speech through those pathways, specifically, face and body, orthography, and sign language processing. In this context, we examine the visual speech processing literature, which reveals widespread diverse patterns of activity in posterior temporal cortices in response to visual speech stimuli. We outline a model of the visual and auditory speech pathways and make several suggestions: (1) The visual perception of speech relies on visual pathway representations of speech qua speech. (2) A proposed site of these representations, the temporal visual speech area (TVSA) has been demonstrated in posterior temporal cortex, ventral and posterior to multisensory posterior superior temporal sulcus (pSTS). (3) Given that visual speech has dynamic and configural features, its representations in feedforward visual pathways are expected to integrate these features, possibly in TVSA.
Traditionally, speech perception training paradigms have not adequately taken into account the possibility that there may be modality-specific requirements for perceptual learning with auditory-only ...(AO) versus visual-only (VO) speech stimuli. The study reported here investigated the hypothesis that there are modality-specific differences in how prior information is used by normal-hearing participants during vocoded versus VO speech training. Two different experiments, one with vocoded AO speech (Experiment 1) and one with VO, lipread, speech (Experiment 2), investigated the effects of giving different types of
information to trainees on each trial during training. The training was for four ~20 min sessions, during which participants learned to label novel visual images using novel spoken words. Participants were assigned to different types of prior information during training: Word Group trainees saw a printed version of each training word (e.g., "tethon"), and Consonant Group trainees saw only its consonants (e.g., "t_th_n"). Additional groups received no prior information (i.e., Experiment 1, AO Group; Experiment 2, VO Group) or a spoken version of the stimulus in a different modality from the training stimuli (Experiment 1, Lipread Group; Experiment 2, Vocoder Group). That is, in each experiment, there was a group that received prior information in the modality of the training stimuli from the other experiment. In both experiments, the Word Groups had difficulty retaining the novel words they attempted to learn during training. However, when the training stimuli were vocoded, the Word Group improved their phoneme identification. When the training stimuli were visual speech, the Consonant Group improved their phoneme identification and their open-set sentence lipreading. The results are considered in light of theoretical accounts of perceptual learning in relationship to perceptual modality.
It has been postulated that the brain is organized by "metamodal," sensory-independent cortical modules capable of performing tasks (e.g., word recognition) in both "standard" and novel sensory ...modalities. Still, this theory has primarily been tested in sensory-deprived individuals, with mixed evidence in neurotypical subjects, thereby limiting its support as a general principle of brain organization. Critically, current theories of metamodal processing do not specify requirements for successful metamodal processing at the level of neural representations. Specification at this level may be particularly important in neurotypical individuals, where novel sensory modalities must interface with existing representations for the standard sense. Here we hypothesized that effective metamodal engagement of a cortical area requires congruence between stimulus representations in the standard and novel sensory modalities in that region. To test this, we first used fMRI to identify bilateral auditory speech representations. We then trained 20 human participants (12 female) to recognize vibrotactile versions of auditory words using one of two auditory-to-vibrotactile algorithms. The vocoded algorithm attempted to match the encoding scheme of auditory speech while the token-based algorithm did not. Crucially, using fMRI, we found that only in the vocoded group did trained-vibrotactile stimuli recruit speech representations in the superior temporal gyrus and lead to increased coupling between them and somatosensory areas. Our results advance our understanding of brain organization by providing new insight into unlocking the metamodal potential of the brain, thereby benefitting the design of novel sensory substitution devices that aim to tap into existing processing streams in the brain.
It has been proposed that the brain is organized by "metamodal," sensory-independent modules specialized for performing certain tasks. This idea has inspired therapeutic applications, such as sensory substitution devices, for example, enabling blind individuals "to see" by transforming visual input into soundscapes. Yet, other studies have failed to demonstrate metamodal engagement. Here, we tested the hypothesis that metamodal engagement in neurotypical individuals requires matching the encoding schemes between stimuli from the novel and standard sensory modalities. We trained two groups of subjects to recognize words generated by one of two auditory-to-vibrotactile transformations. Critically, only vibrotactile stimuli that were matched to the neural encoding of auditory speech engaged auditory speech areas after training. This suggests that matching encoding schemes is critical to unlocking the brain's metamodal potential.
Group analysis of structure or function in cerebral cortex typically involves, as a first step, the alignment of cortices. A surface-based approach to this problem treats the cortex as a convoluted ...surface and coregisters across subjects so that cortical landmarks or features are aligned. This registration can be performed using curves representing sulcal fundi and gyral crowns to constrain the mapping. Alternatively, registration can be based on the alignment of curvature metrics computed over the entire cortical surface. The former approach typically involves some degree of user interaction in defining the sulcal and gyral landmarks while the latter methods can be completely automated. Here we introduce a cortical delineation protocol consisting of 26 consistent landmarks spanning the entire cortical surface. We then compare the performance of a landmark-based registration method that uses this protocol with that of two automatic methods implemented in the software packages FreeSurfer and BrainVoyager. We compare performance in terms of discrepancy maps between the different methods, the accuracy with which regions of interest are aligned, and the ability of the automated methods to correctly align standard cortical landmarks. Our results show similar performance for ROIs in the perisylvian region for the landmark-based method and FreeSurfer. However, the discrepancy maps showed larger variability between methods in occipital and frontal cortex and automated methods often produce misalignment of standard cortical landmarks. Consequently, selection of the registration approach should consider the importance of accurate sulcal alignment for the specific task for which coregistration is being performed. When automatic methods are used, the users should ensure that sulci in regions of interest in their studies are adequately aligned before proceeding with subsequent analysis.
When the auditory and visual components of spoken audiovisual nonsense syllables are mismatched, perceivers produce four different types of perceptual responses, auditory correct, visual correct, ...fusion (the so-called "McGurk effect"), and combination (i.e., two consonants are reported). Here, quantitative measures were developed to account for the distribution of the four types of perceptual responses to 384 different stimuli from four talkers. The measures included mutual information, correlations, and acoustic measures, all representing audiovisual stimulus relationships. In Experiment 1, open-set perceptual responses were obtained for acoustic /bscript a/ or /lscript a/ dubbed to video /bscript a, dscript a, gscript a, vscript a, zscript a, lscript a, wscript a, ethscript a/. The talker, the video syllable, and the acoustic syllable significantly influenced the type of response. In Experiment 2, the best predictors of response category proportions were a subset of the physical stimulus measures, with the variance accounted for in the perceptual response category proportions between 17% and 52%. That audiovisual stimulus relationships can account for perceptual response distributions supports the possibility that internal representations are based on modality-specific stimulus relationships. (Contains 1 footnote, 6 tables, and 6 figures.)
The goal of this review article is to reinvigorate interest in lipreading and lipreading training for adults with acquired hearing loss. Most adults benefit from being able to see the talker when ...speech is degraded; however, the effect size is related to their lipreading ability, which is typically poor in adults who have experienced normal hearing through most of their lives. Lipreading training has been viewed as a possible avenue for rehabilitation of adults with an acquired hearing loss, but most training approaches have not been particularly successful. Here, we describe lipreading and theoretically motivated approaches to its training, as well as examples of successful training paradigms. We discuss some extensions to auditory-only (AO) and audiovisual (AV) speech recognition.
Visual speech perception and word recognition are described. Traditional and contemporary views of training and perceptual learning are outlined. We focus on the roles of external and internal feedback and the training task in perceptual learning, and we describe results of lipreading training experiments.
Lipreading is commonly characterized as limited to viseme perception. However, evidence demonstrates subvisemic perception of visual phonetic information. Lipreading words also relies on lexical constraints, not unlike auditory spoken word recognition. Lipreading has been shown to be difficult to improve through training, but under specific feedback and task conditions, training can be successful, and learning can generalize to untrained materials, including AV sentence stimuli in noise. The results on lipreading have implications for AO and AV training and for use of acoustically processed speech in face-to-face communication.
Given its importance for speech recognition with a hearing loss, we suggest that the research and clinical communities integrate lipreading in their efforts to improve speech recognition in adults with acquired hearing loss.
The ability to recognize words in connected speech under noisy listening conditions is critical to everyday communication. Many processing levels contribute to the individual listener's ability to ...recognize words correctly against background speech, and there is clinical need for measures of individual differences at different levels. Typical listening tests of speech recognition in noise require a list of items to obtain a single threshold score. Diverse abilities measures could be obtained through mining various open-set recognition errors during multi-item tests. This study sought to demonstrate that an error mining approach using open-set responses from a clinical sentence-in-babble-noise test can be used to characterize abilities beyond signal-to-noise ratio (SNR) threshold. A stimulus-response phoneme-to-phoneme sequence alignment software system was used to achieve automatic, accurate quantitative error scores. The method was applied to a database of responses from normal-hearing (NH) adults. Relationships between two types of response errors and words correct scores were evaluated through use of mixed models regression.
Two hundred thirty-three NH adults completed three lists of the Quick Speech in Noise test. Their individual open-set speech recognition responses were automatically phonemically transcribed and submitted to a phoneme-to-phoneme stimulus-response sequence alignment system. The computed alignments were mined for a measure of acoustic phonetic perception, a measure of response text that could not be attributed to the stimulus, and a count of words correct. The mined data were statistically analyzed to determine whether the response errors were significant factors beyond stimulus SNR in accounting for the number of words correct per response from each participant. This study addressed two hypotheses: (1) Individuals whose perceptual errors are less severe recognize more words correctly under difficult listening conditions due to babble masking and (2) Listeners who are better able to exclude incorrect speech information such as from background babble and filling in recognize more stimulus words correctly.
Statistical analyses showed that acoustic phonetic accuracy and exclusion of babble background were significant factors, beyond the stimulus sentence SNR, in accounting for the number of words a participant recognized. There was also evidence that poorer acoustic phonetic accuracy could occur along with higher words correct scores. This paradoxical result came from a subset of listeners who had also performed subjective accuracy judgments. Their results suggested that they recognized more words while also misallocating acoustic cues from the background into the stimulus, without realizing their errors. Because the Quick Speech in Noise test stimuli are locked to their own babble sample, misallocations of whole words from babble into the responses could be investigated in detail. The high rate of common misallocation errors for some sentences supported the view that the functional stimulus was the combination of the target sentence and its babble.
Individual differences among NH listeners arise both in terms of words accurately identified and errors committed during open-set recognition of sentences in babble maskers. Error mining to characterize individual listeners can be done automatically at the levels of acoustic phonetic perception and the misallocation of background babble words into open-set responses. Error mining can increase test information and the efficiency and accuracy of characterizing individual listeners.
This study investigated the effects of external feedback on perceptual learning of visual speech during lipreading training with sentence stimuli. The goal was to improve visual-only (VO) speech ...recognition and increase accuracy of audiovisual (AV) speech recognition in noise. The rationale was that spoken word recognition depends on the accuracy of sublexical (phonemic/phonetic) speech perception; effective feedback during training must support sublexical perceptual learning.
Normal-hearing (NH) adults were assigned to one of three types of feedback: Sentence feedback was the entire sentence printed after responding to the stimulus. Word feedback was the correct response words and perceptually near but incorrect response words. Consonant feedback was correct response words and consonants in incorrect but perceptually near response words. Six training sessions were given. Pre- and posttraining testing included an untrained control group. Test stimuli were disyllable nonsense words for forced-choice consonant identification, and isolated words and sentences for open-set identification. Words and sentences were VO, AV, and audio-only (AO) with the audio in speech-shaped noise.
Lipreading accuracy increased during training. Pre- and posttraining tests of consonant identification showed no improvement beyond test-retest increases obtained by untrained controls. Isolated word recognition with a talker not seen during training showed that the control group improved more than the sentence group. Tests of untrained sentences showed that the consonant group significantly improved in all of the stimulus conditions (VO, AO, and AV). Its mean words correct scores increased by 9.2 percentage points for VO, 3.4 percentage points for AO, and 9.8 percentage points for AV stimuli.
Consonant feedback during training with sentences stimuli significantly increased perceptual learning. The training generalized to untrained VO, AO, and AV sentence stimuli. Lipreading training has potential to significantly improve adults' face-to-face communication in noisy settings in which the talker can be seen.
Acoustic speech is easier to detect in noise when the talker can be seen. This finding could be explained by integration of multisensory inputs or refinement of auditory processing from visual ...guidance. In two experiments, we studied two‐interval forced‐choice detection of an auditory ‘ba’ in acoustic noise, paired with various visual and tactile stimuli that were identically presented in the two observation intervals. Detection thresholds were reduced under the multisensory conditions vs. the auditory‐only condition, even though the visual and/or tactile stimuli alone could not inform the correct response. Results were analysed relative to an ideal observer for which intrinsic (internal) noise and efficiency were independent contributors to detection sensitivity. Across experiments, intrinsic noise was unaffected by the multisensory stimuli, arguing against the merging (integrating) of multisensory inputs into a unitary speech signal, but sampling efficiency was increased to varying degrees, supporting refinement of knowledge about the auditory stimulus. The steepness of the psychometric functions decreased with increasing sampling efficiency, suggesting that the ‘task‐irrelevant’ visual and tactile stimuli reduced uncertainty about the acoustic signal. Visible speech was not superior for enhancing auditory speech detection. Our results reject multisensory neuronal integration and speech‐specific neural processing as explanations for the enhanced auditory speech detection under noisy conditions. Instead, they support a more rudimentary form of multisensory interaction: the otherwise task‐irrelevant sensory systems inform the auditory system about when to listen.
When acoustic speech is buried in noise, a task‐irrelevant visual and/or vibrotactile stimulus can enhance its detectability. Within an ideal observer model, enhancement is attributable to reduced noise intrinsic to the perceptual system and/or improved statistical sampling efficiency. Experiments here support only improved efficiency via uncertainty reduction and offer no evidence for change in internal noise. This pattern of results argues against enhancement due to multisensory integration.
The cortical processing of auditory-alone, visual-alone, and audiovisual speech information is temporally and spatially distributed, and functional magnetic resonance imaging (fMRI) cannot adequately ...resolve its temporal dynamics. In order to investigate a hypothesized spatiotemporal organization for audiovisual speech processing circuits, event-related potentials (ERPs) were recorded using electroencephalography (EEG). Stimuli were congruent audiovisual /ba/, incongruent auditory /ba/ synchronized with visual /ga/, auditory-only /ba/, and visual-only /ba/ and /ga/. Current density reconstructions (CDRs) of the ERP data were computed across the latency interval of 50–250 ms. The CDRs demonstrated complex spatiotemporal activation patterns that differed across stimulus conditions. The hypothesized circuit that was investigated here comprised initial integration of audiovisual speech by the middle superior temporal sulcus (STS), followed by recruitment of the intraparietal sulcus (IPS), followed by activation of Broca’s area Miller, L.M., d’Esposito, M., 2005. Perceptual fusion and stimulus coincidence in the cross-modal integration of speech. Journal of Neuroscience 25, 5884–5893. The importance of spatiotemporally sensitive measures in evaluating processing pathways was demonstrated. Results showed, strikingly, early (<
100 ms) and simultaneous activations in areas of the supramarginal and angular gyrus (SMG/AG), the IPS, the inferior frontal gyrus, and the dorsolateral prefrontal cortex. Also, emergent left hemisphere SMG/AG activation, not predicted based on the unisensory stimulus conditions was observed at approximately 160 to 220 ms. The STS was neither the earliest nor most prominent activation site, although it is frequently considered the
sine qua non of audiovisual speech integration. As discussed here, the relatively late activity of the SMG/AG solely under audiovisual conditions is a possible candidate audiovisual speech integration response.