Product HMMs for audio-visual continuous speech recognition using facial animation parameters

P. S. Aleksic*, Aggelos K Katsaggelos

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

7 Scopus citations

Abstract

The use of visual information in addition to acoustic can improve automatic speech recognition. In this paper we compare different approaches for audio-visual information integration and show how they affect automatic speech recognition performance. We utilize facial animation parameters (FAPs), supported by the MPEG-4 standard for the visual representation as visual features. We use both single-stream and multi-stream hidden Markov models (HMM) to integrate audio and visual information. We performed both state and phone synchronous multi-stream integration. Product HMM topology is used to model the phone-synchronous integration. ASR experiments were performed under noisy audio conditions using a relatively large vocabulary (approximately 1000 words) audio-visual database. The proposed phone-synchronous system, which performed the best, reduces the word error rate (WER) by approximately 20% relatively to audio-only ASR (A-ASR) WERs, at various SNRs with additive white Gaussian noise.

Original languageEnglish (US)
Title of host publicationProceedings - 2003 International Conference on Multimedia and Expo, ICME
PublisherIEEE Computer Society
PagesII481-II484
ISBN (Electronic)0780379659
DOIs
StatePublished - Jan 1 2003
Event2003 International Conference on Multimedia and Expo, ICME 2003 - Baltimore, United States
Duration: Jul 6 2003Jul 9 2003

Publication series

NameProceedings - IEEE International Conference on Multimedia and Expo
Volume2
ISSN (Print)1945-7871
ISSN (Electronic)1945-788X

Other

Other2003 International Conference on Multimedia and Expo, ICME 2003
CountryUnited States
CityBaltimore
Period7/6/037/9/03

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Computer Science Applications

Fingerprint Dive into the research topics of 'Product HMMs for audio-visual continuous speech recognition using facial animation parameters'. Together they form a unique fingerprint.

  • Cite this

    Aleksic, P. S., & Katsaggelos, A. K. (2003). Product HMMs for audio-visual continuous speech recognition using facial animation parameters. In Proceedings - 2003 International Conference on Multimedia and Expo, ICME (pp. II481-II484). [1221658] (Proceedings - IEEE International Conference on Multimedia and Expo; Vol. 2). IEEE Computer Society. https://doi.org/10.1109/ICME.2003.1221658