This chapter focuses on how the joint processing of visual and audio signals, both generated by a talking person, can provide valuable speech information to benefit a number of audiovisual speech processing applications crucial to human-computer interactions. The analysis of visual signals has been done followed by a description of various possible ways of representing and extracting the speech information available in them. It has been shown in the chapter that the obtained visual features can complement features extracted from the acoustic signal and that the two modality representations can be fused together to allow joint audiovisual speech processing. The general bimodal integration framework is subsequently applied to three problems-automatic speech recognition, talking face synthesis, and speaker identification and authentication. In all three cases, issues specific to the particular application have been discussed, several relevant systems that have been reported in the literature have been reviewed, and the results using the implementations developed at IBM Research and Northwestern University have been presented.
ASJC Scopus subject areas
- Computer Science(all)