TY - GEN
T1 - Audio-visual anticipatory coarticulation modeling by human and machin
AU - Terry, Louis H.
AU - Livescu, Karen
AU - Pierrehumbert, Janet B.
AU - Katsaggelos, Aggelos K.
PY - 2010
Y1 - 2010
N2 - The phenomenon of anticipatory coarticulation provides a basis for the observed asynchrony between the acoustic and visual onsets of phones in certain linguistic contexts. This type of asynchrony is typically not explicitly modeled in audio-visual speech models. In this work, we study within-word audiovisual asynchrony using manual labels of words in which theory suggests that audio-visual asynchrony should occur, and show that these hand labels confirm the theory. We then introduce a new statistical model of audio-visual speech, the asynchrony-dependent transition (ADT) model. This model allows asynchrony between audio and video states within word boundaries, where the audio and video state transitions depend not only on the state of that modality, but also on the instantaneous asynchrony. The ADT model outperforms a baseline synchronous model in mimicking the hand labels in a forced alignment task, and its behavior as parameters are changed conforms to our expectations about anticipatory coarticulation. The same model could be used for speech recognition, although here we consider it only for the task of forced alignment for linguistic analysis.
AB - The phenomenon of anticipatory coarticulation provides a basis for the observed asynchrony between the acoustic and visual onsets of phones in certain linguistic contexts. This type of asynchrony is typically not explicitly modeled in audio-visual speech models. In this work, we study within-word audiovisual asynchrony using manual labels of words in which theory suggests that audio-visual asynchrony should occur, and show that these hand labels confirm the theory. We then introduce a new statistical model of audio-visual speech, the asynchrony-dependent transition (ADT) model. This model allows asynchrony between audio and video states within word boundaries, where the audio and video state transitions depend not only on the state of that modality, but also on the instantaneous asynchrony. The ADT model outperforms a baseline synchronous model in mimicking the hand labels in a forced alignment task, and its behavior as parameters are changed conforms to our expectations about anticipatory coarticulation. The same model could be used for speech recognition, although here we consider it only for the task of forced alignment for linguistic analysis.
KW - Anticipatory coarticulation
KW - Audio-visual asynchrony
KW - Audio-visual speech recognition
KW - Dynamic Bayesian networks
UR - http://www.scopus.com/inward/record.url?scp=79959815300&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=79959815300&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:79959815300
T3 - Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010
SP - 2682
EP - 2685
BT - Proceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010
PB - International Speech Communication Association
ER -