Will computers sound completely human-like in the future?
Words like okay, alright and right pose a particular challenge for automatic speech recognition because they have a wide range of functions in conversation. As a result, words with such a variety of uses can be ambiguous in their interpretation. In natural conversation, their meaning can be interpreted through their position in the utterance and, importantly, various auditory and acoustic cues, such as the intonation pattern. However, in computational systems such as TTS (Text-to-Speech), how can the correct intonation pattern be assigned to a word such as okay if the system cannot decide which function it is serving in the utterance?
Agustin Gravano, Julia Hirschberg and Štefan Beňuš analysed a group of words classified as Affirmative Cue Words (ACWs) for their acoustic and prosodic similarities and differences, to see how these properties could help with computational disambiguation. The functions of ACWs include showing agreement, showing interest and signalling the beginning or end of a topic. Gravano, Hirschberg and Benus found ten potentially different functions of ACWs , though only okay and alright are versatile enough to be used in all ten ways.
The data they used came from a recorded corpus known as the Columbia Games Corpus which represents 13 Standard American English speakers. They identified 5,456 ACWs which represented 7.8% of the speech. Of these, the 6 most common were alright, okay, yeah, mm-hm, uh-huh and right. However, in order to make sure the data wasn’t dominated by one speaker (and therefore skew the results), the data was levelled so that it represented all speakers as equally as possible.
Looking at the position of the words, they found that alright and okay were used in similar positions in the utterance, and that mm-hm and uh-huh also showed similar distribution patterns (frequently when there was a pause either side of the word so that it stood alone), with their primary functions either as backchannels (showing that the listener was following) or to show agreement. They suggest that this means that the members of each pair can be used interchangeably.
In addition to position and function, they looked at factors such as intonation pattern, intensity, duration, pitch and voice quality. In doing this, they could identify the acoustic and prosodic features which were more likely to indicate a particular function of a single ACW.
Using this data, the researchers conducted a number of experiments which tested the ability of computational systems to recognise and correctly classify the function of the ACWs in their data. They noted that their approach allowed for the inclusion of a wide range of available information, which led to greater accuracy in classification. However, of all the factors incorporated into the experiments, data related to the ACW’s position in the intonational phrase turned out to be the most important factor in disambiguating the function of these words. For example, right was the only ACW that could be used with a checking function (e.g. through its use in tags – “it’s there, right?”) and this was one of the few instances where an ACW could be found at the end of an intonational phrase.
They conclude that, by analysing and incorporating a wide range of acoustic and prosodic characterisations in computational testing and subsequent programming, spoken dialogue systems (such as those which aim to both recognise and produce speech) will be able to improve their performance and take one step closer to emulating human speech.
Gravano, A., Hirschberg, J. and Beňuš, Š. (2012) Affirmative Cue Words in Task-Oriented Dialogue. Computational Linguistics 38:1-39
This summary was written by Jenny Amos