Automatic speech recognition (ASR) is like solving a crossword puzzle. Context at every level is used to resolve ambiguity: the more context we can bring to bear, the higher will be the accuracy of the ASR. One of the ways in which ASR uses context is by defining context-dependent phonological units. This paper reviews and applies two types of phonological units that we find useful in ASR: “phones” (segmental units), and “articulatory features” (units roughly corresponding to bundles of articulatory gestures and/or quantized tract variables). Although the details of the phone or feature inventory vary from system to system, the requirements for a phone or feature inventory are easy to define: each phone (or each vector of articulatory features) must be both “acoustically compact” (the acoustic correlates of a phone or feature vector are predictable) and “phonologically compact” (the phone or feature correlates of a word, in context, are predictable). This paper proposes that the two goals of a phone inventory may be satisfied by defining phones that are sensitive to prosodic context, or alternatively, by using prosodic context to constrain the temporal evolution of recognized articulatory features. Example systems are described that incorporate contextual constraints from five different levels of the prosodic hierarchy, and from the prosodic disruptions caused by disfluency. Intonational-phrase-sensitive phones know whether they are final or nonfinal within an intonational phrase. Phrasal-prominence-sensitive phones know whether or not they have phrasal prominence. Word-level context is incorporated, for example, in audiovisual speech recognition models that represent pronunciation variability by way of within-word asynchrony among the targets achieved by the tongue, lips, and glottis/velum. Foot-sensitive phones represent the alternation among reduced, unreduced, and lexically stressed vowels. The syllable-sensitive phones described in this paper are, in fact, not phone models in the traditional sense at all; rather, they are better understood to be models of the consonant release and closure landmarks that initiate and terminate each syllable. Finally, two of the acoustic effects of disfluency have been represented: the unique acoustic characteristics of the phones in filled pauses, and the glottalization of phones in the final syllable of a reparandum. We report experimental results demonstrating that many of these context features may reduce the word error rate (WER) of a speech recognizer in at least one specified transcription task.1 Computational complexity limitations, and training data limitations, have thus far prevented the integration of all proposed context features into any single ASR application.
|Original language||English (US)|
|Title of host publication||Linguistic Patterns of Spontaneous Speech|
|Place of Publication||Taipei, Taiwan|
|Number of pages||28|
|State||Published - 2009|