Vision perceptually restores auditory spectral dynamics in speech

John Plass*, David Brang, Satoru Suzuki, Marcia Grabowecky

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Visual speech facilitates auditory speech perception, but the visual cues responsible for these benefits and the information they provide remain unclear. Low-level models emphasize basic temporal cues provided by mouth movements, but these impoverished signals may not fully account for the richness of auditory information provided by visual speech. High-level models posit interactions among abstract categorical (i.e., phonemes/visemes) or amodal (e.g., articulatory) speech representations, but require lossy remapping of speech signals onto abstracted representations. Because visible articulators shape the spectral content of speech, we hypothesized that the perceptual system might exploit natural correlations between midlevel visual (oral deformations) and auditory speech features (frequency modulations) to extract detailed spectrotemporal information from visual speech without employing high-level abstractions. Consistent with this hypothesis, we found that the time-frequency dynamics of oral resonances (formants) could be predicted with unexpectedly high precision from the changing shape of the mouth during speech. When isolated from other speech cues, speech-based shape deformations improved perceptual sensitivity for corresponding frequency modulations, suggesting that listeners could exploit this cross-modal correspondence to facilitate perception. To test whether this type of correspondence could improve speech comprehension, we selectively degraded the spectral or temporal dimensions of auditory sentence spectrograms to assess how well visual speech facilitated comprehension under each degradation condition. Visual speech produced drastically larger enhancements during spectral degradation, suggesting a condition-specific facilitation effect driven by cross-modal recovery of auditory speech spectra. The perceptual system may therefore use audiovisual correlations rooted in oral acoustics to extract detailed spectrotemporal information from visual speech.

Original languageEnglish (US)
Pages (from-to)16920-16927
Number of pages8
JournalProceedings of the National Academy of Sciences of the United States of America
Volume117
Issue number29
DOIs
StatePublished - Jul 21 2020

Keywords

  • Audiovisual speech
  • Multisensory
  • Spectrotemporal
  • Speech perception

ASJC Scopus subject areas

  • General

Fingerprint Dive into the research topics of 'Vision perceptually restores auditory spectral dynamics in speech'. Together they form a unique fingerprint.

Cite this