There has been significant work on investigating the relationship between articulatory movements and vocal tract shape and speech acoustics (Fant, 1960; Flanagan, 1965; Narayanan & Alwan, 2000; Schroeter & Sondhi, 1994). It has been shown that there exists a strong correlation between face motion, and vocal tract shape and speech acoustics (Grant & Braida, 1991; Massaro & Stork, 1998; Summerfield, 1979, 1987, 1992; Williams & Katsaggelos, 2002; Yehia, Rubin, & Vatikiotis-Bateson, 1998). In particular, dynamic lip information conveys not only correlated but also complimentary information to the acoustic speech information. Its integration into an automatic speech recognition (ASR) system, resulting in an audio-visual (AV) system, can potentially increase the system's performance. Although visual speech information is usually used together with acoustic information, there are applications where visual-only (V-only) ASR systems can be employed achieving high recognition rates. Such include small vocabulary ASR (digits, small number of commands, etc.) and ASR in the presence of adverse acoustic conditions. The choice and accurate extraction of visual features strongly affect the performance of AV and V-only ASR systems. The establishment of lip features for speech recognition is a relatively new research topic. Although a number of approaches can be used for extracting and representing visual lip information, unfortunately, limited work exists in the literature in comparing the relative performance of different features. In this chapter, the authors describe various approaches for extracting and representing important visual features, review existing systems, evaluate their relative performance in terms of speech and speaker recognition rates, and discuss future research and development directions in this area.
ASJC Scopus subject areas
- Agricultural and Biological Sciences(all)