Audio-visual continuous speech recognition using MPEG-4 compliant visual features

Petar S. Aleksic*, Jay J. Williams, Zhilin Wu, Aggelos K. Katsaggelos

*Corresponding author for this work

Research output: Contribution to conferencePaperpeer-review

12 Scopus citations


In this paper we utilize Facial Animation Parameters (FAPs), supported by the MPEG-4 standard for the visual representation of speech, in order to significantly improve automatic speech recognition (ASR). We describe a robust and automatic algorithm for extraction of FAPs from visual data that requires no hand labeling or extensive training procedures. Multi-stream Hidden Markov Models (HMM) were used to integrate audio and visual information. ASR experiments were performed under both clean and noisy audio conditions using relatively large vocabulary (approximately 1000 words). The proposed system reduces the word error rate (WER) by 20% to 23% relatively to audio-only SAR WERs, at various SNRs with additive white Gaussian noise, and by 19% relatively to audio-only ASR WER under clean audio conditions.

Original languageEnglish (US)
StatePublished - 2002
EventInternational Conference on Image Processing (ICIP'02) - Rochester, NY, United States
Duration: Sep 22 2002Sep 25 2002


OtherInternational Conference on Image Processing (ICIP'02)
Country/TerritoryUnited States
CityRochester, NY

ASJC Scopus subject areas

  • Hardware and Architecture
  • Computer Vision and Pattern Recognition
  • Electrical and Electronic Engineering


Dive into the research topics of 'Audio-visual continuous speech recognition using MPEG-4 compliant visual features'. Together they form a unique fingerprint.

Cite this