Prosodic Hierarchy as an Organizing Framework for the Sources of Context in Phone-Based and Articulatory-Feature-Based Speech Recognition

Mark Hasegawa-Johnson, Jennifer Sandra Cole, Ken Chen, Lal Partha, Amit Juneja, Taejin Yoon, Sarah Borys, Xiaodan Zhuang

Research output: Chapter in Book/Report/Conference proceedingChapter

Abstract

Automatic speech recognition (ASR) is like solving a crossword puzzle. Context at every level is used to resolve ambiguity: the more context we can bring to bear, the higher will be the accuracy of the ASR. One of the ways in which ASR uses context is by defining context-dependent phonological units. This paper reviews and applies two types of phonological units that we find useful in ASR: “phones” (segmental units), and “articulatory features” (units roughly corresponding to bundles of articulatory gestures and/or quantized tract variables). Although the details of the phone or feature inventory vary from system to system, the requirements for a phone or feature inventory are easy to define: each phone (or each vector of articulatory features) must be both “acoustically compact” (the acoustic correlates of a phone or feature vector are predictable) and “phonologically compact” (the phone or feature correlates of a word, in context, are predictable). This paper proposes that the two goals of a phone inventory may be satisfied by defining phones that are sensitive to prosodic context, or alternatively, by using prosodic context to constrain the temporal evolution of recognized articulatory features. Example systems are described that incorporate contextual constraints from five different levels of the prosodic hierarchy, and from the prosodic disruptions caused by disfluency. Intonational-phrase-sensitive phones know whether they are final or nonfinal within an intonational phrase. Phrasal-prominence-sensitive phones know whether or not they have phrasal prominence. Word-level context is incorporated, for example, in audiovisual speech recognition models that represent pronunciation variability by way of within-word asynchrony among the targets achieved by the tongue, lips, and glottis/velum. Foot-sensitive phones represent the alternation among reduced, unreduced, and lexically stressed vowels. The syllable-sensitive phones described in this paper are, in fact, not phone models in the traditional sense at all; rather, they are better understood to be models of the consonant release and closure landmarks that initiate and terminate each syllable. Finally, two of the acoustic effects of disfluency have been represented: the unique acoustic characteristics of the phones in filled pauses, and the glottalization of phones in the final syllable of a reparandum. We report experimental results demonstrating that many of these context features may reduce the word error rate (WER) of a speech recognizer in at least one specified transcription task.1 Computational complexity limitations, and training data limitations, have thus far prevented the integration of all proposed context features into any single ASR application.
Original languageEnglish (US)
Title of host publicationLinguistic Patterns of Spontaneous Speech
EditorsS Tseng
Place of PublicationTaipei, Taiwan
PublisherAcademica Sinica
Pages101-128
Number of pages28
StatePublished - 2009

Fingerprint

Dive into the research topics of 'Prosodic Hierarchy as an Organizing Framework for the Sources of Context in Phone-Based and Articulatory-Feature-Based Speech Recognition'. Together they form a unique fingerprint.

Cite this