Will computers sound completely human-like in the future?
Words like okay, alright and right pose a
particular challenge for automatic speech recognition because they have a wide range of functions in conversation. As a result, words with such a variety
of uses can be ambiguous in their interpretation. In natural conversation, their meaning can be interpreted
through their position in the utterance and, importantly, various auditory and
acoustic cues, such as the intonation pattern. However, in computational systems such as TTS
(Text-to-Speech), how can the correct intonation pattern be assigned to a word
such as okay if the system cannot
decide which function it is serving in the utterance?
Agustin Gravano, Julia
Hirschberg and Štefan
Beňuš analysed a group of
words classified as Affirmative Cue Words
(ACWs) for their acoustic and prosodic similarities and differences, to see how
these properties could help with computational disambiguation. The functions of
ACWs include showing agreement, showing interest and signalling the beginning
or end of a topic. Gravano, Hirschberg and Benus found ten potentially
different functions of ACWs , though only okay
and alright are versatile enough
to be used in all ten ways.
The data they used
came from a recorded corpus known as the Columbia Games Corpus
which represents 13 Standard American English speakers. They identified 5,456 ACWs which
represented 7.8% of the speech. Of these, the 6 most common were alright, okay, yeah, mm-hm, uh-huh and right. However, in order to make sure the
data wasn’t dominated by one speaker (and therefore skew the results), the data
was levelled so that it represented all speakers as equally as possible.
Looking at the
position of the words, they found that alright
and okay were used in similar
positions in the utterance, and that mm-hm
and uh-huh also showed similar
distribution patterns (frequently when there was a pause either side of the
word so that it stood alone), with their primary functions either as backchannels
(showing that the listener was following) or to show agreement. They suggest that this means that the
members of each pair can be used interchangeably.
In addition to
position and function, they looked at factors such as intonation pattern,
intensity, duration, pitch and voice quality. In doing this, they could identify the acoustic and prosodic
features which were more likely to indicate a particular function of a single
ACW.
Using this data, the
researchers conducted a number of experiments which tested the ability of computational
systems to recognise and correctly classify the function of the ACWs in their
data. They noted that their
approach allowed for the inclusion of a wide range of available information,
which led to greater accuracy in classification. However, of all the factors incorporated into the
experiments, data related to the ACW’s position in the intonational phrase
turned out to be the most important factor in disambiguating the function of
these words. For example, right was
the only ACW that could be used with a checking function (e.g. through its use
in tags – “it’s there, right?”) and
this was one of the few instances where an ACW could be found at the end of an
intonational phrase.
They conclude that,
by analysing and incorporating a wide range of acoustic and prosodic
characterisations in computational testing and subsequent programming, spoken
dialogue systems (such as those which aim to both recognise and produce speech)
will be able to improve their performance and take one step closer to emulating
human speech.
__________________________________________________
Gravano, A.,
Hirschberg, J. and Beňuš, Š. (2012) Affirmative Cue Words in Task-Oriented
Dialogue. Computational Linguistics
38:1-39
Doi:10.1162/COLI_a_00083
No comments:
Post a Comment
Note: only a member of this blog may post a comment.