An HMM-based speech-to-video synthesizer

Jay J. Williams*, Aggelos K. Katsaggelos

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

32 Scopus citations

Abstract

Emerging broadband communication systems promise a future of multimedia telephony. The addition of visual information, for example, during telephone conversations would be most beneficial to people with impaired hearing and the ability to speechread. For the present, it is useful to consider the problem of generating the critical information useful for speechreading, based on existing narrowband communications systems used for speech. This paper focuses on the problem of synthesizing visual articulatory movements given the acoustic speech signal. In this application, the acoustic speech signal is analyzed and the corresponding articulatory movements are synthesized for speechreading. This paper describes a hidden Markov model (HMM)-based visual speech synthesizer designed to improve speech understanding. The key elements in the application of HMMs to this problem are the decomposition of the overall modeling task into key stages and the judicious determination of the observation vector's components for each stage. The main contribution of this paper is a novel correlation HMM model that is able to integrate independently trained acoustic and visual HMMs for speech-to-visual synthesis. This model allows increased flexibility in choosing model topologies for the acoustic and visual HMMs. Moreover the propose model reduces the amount of training data compared to early integration modeling techniques. Results from objective experiments analysis show that the propose approach can reduce time alignment errors by 37.4% compared to conventional temporal scaling method. Futhermore, subjective results indicated that the purpose model can increase speech understanding.

Original languageEnglish (US)
Pages (from-to)900-915
Number of pages16
JournalIEEE Transactions on Neural Networks
Volume13
Issue number4
DOIs
StatePublished - Jul 2002

Keywords

  • Audio-visual recognition
  • Hidden Markov model (HMM) modeling
  • Multimodal signal processing
  • Visual synthesis

ASJC Scopus subject areas

  • Software
  • Computer Science Applications
  • Computer Networks and Communications
  • Artificial Intelligence

Fingerprint Dive into the research topics of 'An HMM-based speech-to-video synthesizer'. Together they form a unique fingerprint.

Cite this