Exploiting Visual Information in Automatic Speech Processing

Petar Aleksic, Gerasimos Potamianos, Aggelos K. Katsaggelos

Research output: Chapter in Book/Report/Conference proceedingChapter

9 Scopus citations


This chapter focuses on how the joint processing of visual and audio signals, both generated by a talking person, can provide valuable speech information to benefit a number of audiovisual speech processing applications crucial to human-computer interactions. The analysis of visual signals has been done followed by a description of various possible ways of representing and extracting the speech information available in them. It has been shown in the chapter that the obtained visual features can complement features extracted from the acoustic signal and that the two modality representations can be fused together to allow joint audiovisual speech processing. The general bimodal integration framework is subsequently applied to three problems-automatic speech recognition, talking face synthesis, and speaker identification and authentication. In all three cases, issues specific to the particular application have been discussed, several relevant systems that have been reported in the literature have been reviewed, and the results using the implementations developed at IBM Research and Northwestern University have been presented.

Original languageEnglish (US)
Title of host publicationHandbook of Image and Video Processing
PublisherElsevier Inc
Number of pages27
ISBN (Print)9780121197926
StatePublished - 2005

ASJC Scopus subject areas

  • Computer Science(all)


Dive into the research topics of 'Exploiting Visual Information in Automatic Speech Processing'. Together they form a unique fingerprint.

Cite this