Comparison of MPEG-4 facial animation parameter groups with respect to audio-visual speech recognition performance

Petar S. Aleksic*, Aggelos K Katsaggelos

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

4 Scopus citations

Abstract

In this paper, we describe an audio-visual automatic speech recognition (AV-ASR) system that utilizes Facial Animation Parameters (FAPs), supported by the MPEG-4 standard, for the visual representation of speech. We describe the visual feature extraction algorithms used for extracting FAPs, which control outer- and inner-lip movement. Principal component analysis (PCA) is performed on both inner- and outer-lip FAP vector in order to decrease their dimensionality and decorrelate them. The PCA-based projection weights of the extracted FAP vectors are used as visual features. Multi-stream Hidden Markov Models (HMMs) and a late integration approach are used to integrate audio and visual information and train a continuous AV-ASR system. We compare the performance of the developed AV-ASR system utilizing outer- and inner lip FAPs, individually and jointly. Experiments were performed for different dimensionalities of the visual features, at various SNRs (0-30dB) with additive white Gaussian noise, on a relatively large vocabulary (approximately 1000 words) database. The proposed system reduces the word error rate (WER) by 20% to 23% relatively to audio-only ASR WERs. Conclusions are drawn on the individual and combined effectiveness of the inner- and outer-lip FAPs, the trade off between the dimensionality of the visual features and the amount of speechreading information contained in them and its influence on the AV-ASR performance.

Original languageEnglish (US)
Title of host publicationIEEE International Conference on Image Processing 2005, ICIP 2005
Pages501-504
Number of pages4
DOIs
StatePublished - Dec 1 2005
EventIEEE International Conference on Image Processing 2005, ICIP 2005 - Genova, Italy
Duration: Sep 11 2005Sep 14 2005

Publication series

NameProceedings - International Conference on Image Processing, ICIP
Volume3
ISSN (Print)1522-4880

Other

OtherIEEE International Conference on Image Processing 2005, ICIP 2005
CountryItaly
CityGenova
Period9/11/059/14/05

ASJC Scopus subject areas

  • Engineering(all)

Fingerprint Dive into the research topics of 'Comparison of MPEG-4 facial animation parameter groups with respect to audio-visual speech recognition performance'. Together they form a unique fingerprint.

Cite this