Product HMMs for audio-visual continuous speech recognition using facial animation parameters

P. S. Aleksic*, Aggelos K Katsaggelos

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

9 Scopus citations

Abstract

The use of visual information in addition to acoustic can improve automatic speech recognition. In this paper we compare different approaches for audio-visual information integration and show how they affect automatic speech recognition performance. We utilize facial animation parameters (FAPs), supported by the MPEG-4 standard for the visual representation as visual features. We use both single-stream and multi-stream hidden Markov models (HMM) to integrate audio and visual information. We performed both state and phone synchronous multi-stream integration. Product HMM topology is used to model the phone-synchronous integration. ASR experiments were performed under noisy audio conditions using a relatively large vocabulary (approximately 1000 words) audio-visual database. The proposed phone-synchronous system, which performed the best, reduces the word error rate (WER) by approximately 20% relatively to audio-only ASR (A-ASR) WERs, at various SNRs with additive white Gaussian noise.

Original languageEnglish (US)
Title of host publicationProceedings - 2003 International Conference on Multimedia and Expo, ICME
PublisherIEEE Computer Society
PagesII481-II484
ISBN (Electronic)0780379659
DOIs
StatePublished - Jan 1 2003
Event2003 International Conference on Multimedia and Expo, ICME 2003 - Baltimore, United States
Duration: Jul 6 2003Jul 9 2003

Publication series

NameProceedings - IEEE International Conference on Multimedia and Expo
Volume2
ISSN (Print)1945-7871
ISSN (Electronic)1945-788X

Other

Other2003 International Conference on Multimedia and Expo, ICME 2003
Country/TerritoryUnited States
CityBaltimore
Period7/6/037/9/03

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Computer Science Applications

Fingerprint

Dive into the research topics of 'Product HMMs for audio-visual continuous speech recognition using facial animation parameters'. Together they form a unique fingerprint.

Cite this