Abstract
Emerging broadband communication systems promise a future of multimedia telephony. The addition of visual information, for example, during telephone conversations would be most beneficial to people with impaired hearing and the ability to speechread. For the present, it is useful to consider the problem of generating the critical information useful for speechreading, based on existing narrowband communications systems used for speech. This paper focuses on the problem of synthesizing visual articulatory movements given the acoustic speech signal. A Hidden Markov Model (HMM)-based visual speech synthesizer is designed to improve speech understanding. The key elements in the application of HMMs to this problem are: a) the decomposition of the overall modeling task into key stages; and, b) the judicious determination of the components of the observation vector for each stage. The main contribution of this paper is the development of a novel correlation HMM model that is able to integrate independently trained acoustic and visual HMMs for speech-to-visual synthesis. This model allows increased flexibility in choosing model topologies for the acoustic and visual HMMs. It also reduces the amount of required training data compared to early integration modeling techniques. Results from objective and subjective analysis show that an HMM correlating model can significantly decrease audio-visual synchronization errors and increase speech understanding.
Original language | English (US) |
---|---|
Pages (from-to) | 544-555 |
Number of pages | 12 |
Journal | Proceedings of SPIE - The International Society for Optical Engineering |
Volume | 4299 |
DOIs | |
State | Published - 2001 |
Event | Human Vision and Electronic Imaging VI - San Jose, CA, United States Duration: Jan 22 2001 → Jan 25 2001 |
ASJC Scopus subject areas
- Electronic, Optical and Magnetic Materials
- Condensed Matter Physics
- Computer Science Applications
- Applied Mathematics
- Electrical and Electronic Engineering