TY - GEN
T1 - Product HMMs for audio-visual continuous speech recognition using facial animation parameters
AU - Aleksic, P. S.
AU - Katsaggelos, Aggelos K
PY - 2003/1/1
Y1 - 2003/1/1
N2 - The use of visual information in addition to acoustic can improve automatic speech recognition. In this paper we compare different approaches for audio-visual information integration and show how they affect automatic speech recognition performance. We utilize facial animation parameters (FAPs), supported by the MPEG-4 standard for the visual representation as visual features. We use both single-stream and multi-stream hidden Markov models (HMM) to integrate audio and visual information. We performed both state and phone synchronous multi-stream integration. Product HMM topology is used to model the phone-synchronous integration. ASR experiments were performed under noisy audio conditions using a relatively large vocabulary (approximately 1000 words) audio-visual database. The proposed phone-synchronous system, which performed the best, reduces the word error rate (WER) by approximately 20% relatively to audio-only ASR (A-ASR) WERs, at various SNRs with additive white Gaussian noise.
AB - The use of visual information in addition to acoustic can improve automatic speech recognition. In this paper we compare different approaches for audio-visual information integration and show how they affect automatic speech recognition performance. We utilize facial animation parameters (FAPs), supported by the MPEG-4 standard for the visual representation as visual features. We use both single-stream and multi-stream hidden Markov models (HMM) to integrate audio and visual information. We performed both state and phone synchronous multi-stream integration. Product HMM topology is used to model the phone-synchronous integration. ASR experiments were performed under noisy audio conditions using a relatively large vocabulary (approximately 1000 words) audio-visual database. The proposed phone-synchronous system, which performed the best, reduces the word error rate (WER) by approximately 20% relatively to audio-only ASR (A-ASR) WERs, at various SNRs with additive white Gaussian noise.
UR - http://www.scopus.com/inward/record.url?scp=34247584561&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=34247584561&partnerID=8YFLogxK
U2 - 10.1109/ICME.2003.1221658
DO - 10.1109/ICME.2003.1221658
M3 - Conference contribution
AN - SCOPUS:34247584561
T3 - Proceedings - IEEE International Conference on Multimedia and Expo
SP - II481-II484
BT - Proceedings - 2003 International Conference on Multimedia and Expo, ICME
PB - IEEE Computer Society
T2 - 2003 International Conference on Multimedia and Expo, ICME 2003
Y2 - 6 July 2003 through 9 July 2003
ER -