Abstract
In this paper we utilize Facial Animation Parameters (FAPs), supported by the MPEG-4 standard for the visual representation of speech, in order to significantly improve automatic speech recognition (ASR). We describe a robust and automatic algorithm for extraction of FAPs from visual data that requires no hand labeling or extensive training procedures. Multi-stream Hidden Markov Models (HMM) were used to integrate audio and visual information. ASR experiments were performed under both clean and noisy audio conditions using relatively large vocabulary (approximately 1000 words). The proposed system reduces the word error rate (WER) by 20% to 23% relatively to audio-only SAR WERs, at various SNRs with additive white Gaussian noise, and by 19% relatively to audio-only ASR WER under clean audio conditions.
Original language | English (US) |
---|---|
State | Published - Jan 1 2002 |
Event | International Conference on Image Processing (ICIP'02) - Rochester, NY, United States Duration: Sep 22 2002 → Sep 25 2002 |
Other
Other | International Conference on Image Processing (ICIP'02) |
---|---|
Country | United States |
City | Rochester, NY |
Period | 9/22/02 → 9/25/02 |
ASJC Scopus subject areas
- Hardware and Architecture
- Computer Vision and Pattern Recognition
- Electrical and Electronic Engineering