Audio-visual continuous speech recognition using MPEG-4 compliant visual features

Petar S. Aleksic*, Jay J. Williams, Zhilin Wu, Aggelos K Katsaggelos

*Corresponding author for this work

Research output: Contribution to conferencePaper

10 Citations (Scopus)

Abstract

In this paper we utilize Facial Animation Parameters (FAPs), supported by the MPEG-4 standard for the visual representation of speech, in order to significantly improve automatic speech recognition (ASR). We describe a robust and automatic algorithm for extraction of FAPs from visual data that requires no hand labeling or extensive training procedures. Multi-stream Hidden Markov Models (HMM) were used to integrate audio and visual information. ASR experiments were performed under both clean and noisy audio conditions using relatively large vocabulary (approximately 1000 words). The proposed system reduces the word error rate (WER) by 20% to 23% relatively to audio-only SAR WERs, at various SNRs with additive white Gaussian noise, and by 19% relatively to audio-only ASR WER under clean audio conditions.

Original languageEnglish (US)
StatePublished - Jan 1 2002
EventInternational Conference on Image Processing (ICIP'02) - Rochester, NY, United States
Duration: Sep 22 2002Sep 25 2002

Other

OtherInternational Conference on Image Processing (ICIP'02)
CountryUnited States
CityRochester, NY
Period9/22/029/25/02

Fingerprint

Continuous speech recognition
Speech recognition
Animation
Motion Picture Experts Group standards
Hidden Markov models
Labeling
Experiments

ASJC Scopus subject areas

  • Hardware and Architecture
  • Computer Vision and Pattern Recognition
  • Electrical and Electronic Engineering

Cite this

Aleksic, P. S., Williams, J. J., Wu, Z., & Katsaggelos, A. K. (2002). Audio-visual continuous speech recognition using MPEG-4 compliant visual features. Paper presented at International Conference on Image Processing (ICIP'02), Rochester, NY, United States.
Aleksic, Petar S. ; Williams, Jay J. ; Wu, Zhilin ; Katsaggelos, Aggelos K. / Audio-visual continuous speech recognition using MPEG-4 compliant visual features. Paper presented at International Conference on Image Processing (ICIP'02), Rochester, NY, United States.
@conference{441f68394d044f4082d5654106055668,
title = "Audio-visual continuous speech recognition using MPEG-4 compliant visual features",
abstract = "In this paper we utilize Facial Animation Parameters (FAPs), supported by the MPEG-4 standard for the visual representation of speech, in order to significantly improve automatic speech recognition (ASR). We describe a robust and automatic algorithm for extraction of FAPs from visual data that requires no hand labeling or extensive training procedures. Multi-stream Hidden Markov Models (HMM) were used to integrate audio and visual information. ASR experiments were performed under both clean and noisy audio conditions using relatively large vocabulary (approximately 1000 words). The proposed system reduces the word error rate (WER) by 20{\%} to 23{\%} relatively to audio-only SAR WERs, at various SNRs with additive white Gaussian noise, and by 19{\%} relatively to audio-only ASR WER under clean audio conditions.",
author = "Aleksic, {Petar S.} and Williams, {Jay J.} and Zhilin Wu and Katsaggelos, {Aggelos K}",
year = "2002",
month = "1",
day = "1",
language = "English (US)",
note = "International Conference on Image Processing (ICIP'02) ; Conference date: 22-09-2002 Through 25-09-2002",

}

Aleksic, PS, Williams, JJ, Wu, Z & Katsaggelos, AK 2002, 'Audio-visual continuous speech recognition using MPEG-4 compliant visual features' Paper presented at International Conference on Image Processing (ICIP'02), Rochester, NY, United States, 9/22/02 - 9/25/02, .

Audio-visual continuous speech recognition using MPEG-4 compliant visual features. / Aleksic, Petar S.; Williams, Jay J.; Wu, Zhilin; Katsaggelos, Aggelos K.

2002. Paper presented at International Conference on Image Processing (ICIP'02), Rochester, NY, United States.

Research output: Contribution to conferencePaper

TY - CONF

T1 - Audio-visual continuous speech recognition using MPEG-4 compliant visual features

AU - Aleksic, Petar S.

AU - Williams, Jay J.

AU - Wu, Zhilin

AU - Katsaggelos, Aggelos K

PY - 2002/1/1

Y1 - 2002/1/1

N2 - In this paper we utilize Facial Animation Parameters (FAPs), supported by the MPEG-4 standard for the visual representation of speech, in order to significantly improve automatic speech recognition (ASR). We describe a robust and automatic algorithm for extraction of FAPs from visual data that requires no hand labeling or extensive training procedures. Multi-stream Hidden Markov Models (HMM) were used to integrate audio and visual information. ASR experiments were performed under both clean and noisy audio conditions using relatively large vocabulary (approximately 1000 words). The proposed system reduces the word error rate (WER) by 20% to 23% relatively to audio-only SAR WERs, at various SNRs with additive white Gaussian noise, and by 19% relatively to audio-only ASR WER under clean audio conditions.

AB - In this paper we utilize Facial Animation Parameters (FAPs), supported by the MPEG-4 standard for the visual representation of speech, in order to significantly improve automatic speech recognition (ASR). We describe a robust and automatic algorithm for extraction of FAPs from visual data that requires no hand labeling or extensive training procedures. Multi-stream Hidden Markov Models (HMM) were used to integrate audio and visual information. ASR experiments were performed under both clean and noisy audio conditions using relatively large vocabulary (approximately 1000 words). The proposed system reduces the word error rate (WER) by 20% to 23% relatively to audio-only SAR WERs, at various SNRs with additive white Gaussian noise, and by 19% relatively to audio-only ASR WER under clean audio conditions.

UR - http://www.scopus.com/inward/record.url?scp=0036447870&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0036447870&partnerID=8YFLogxK

M3 - Paper

ER -

Aleksic PS, Williams JJ, Wu Z, Katsaggelos AK. Audio-visual continuous speech recognition using MPEG-4 compliant visual features. 2002. Paper presented at International Conference on Image Processing (ICIP'02), Rochester, NY, United States.