Appendix B: The speech stimuli

The stimuli in this module are specifically designed to help you to home in on areas of probable difficulty with the software. While the stimuli are divided into five fairly independent categories, the ensemble provides enough data to perform other analyses. For example, you will have enough tokens of the point vowels [i, u, a] to plot your own vowel space. The variety of stimuli give you the flexibility to analyze the data as you see fit. The answers to questions that may be asked here are not necessarily known--outcomes may be as different as individual voices are.

Stimuli are presented in 5 main categories, with 10 examples (a.-j.) per category. The categories are: 1. Words in carrier phrases, 2. Function words, 3. Vowels in single words, 4. Liquids [l, r], and 5. Continuous speech.

Each of the stimulus types illustrates a general point. The words in carrier phrases appear with both neutral and informative contexts, particularly highlighting potential differences in recognizing the sounds "s" and "sh" when produced by males and females. The function words are spoken both in isolation and in context, with better performance expected for function words appearing in context, due to the addition of information provided by transitions from and to surrounding words. The vowel stimuli allow one to examine whether the duration of a vowel sound is important to identification. Testing with liquid sounds will demonstrate some of the difficulties with sounds with both low-frequency and low-amplitude energy. Finally, the continuous speech paragraph, which focuses on ambiguous word boundaries, will help illustrate some of the broad issues that arise in implementing speech recognition software, as well as providing baseline information on each individual's normative speech. Together, these stimuli are designed to draw attention to a large number of issues and problems in the area of speech recognition--so keep looking and thinking.

Interacting with the SR software

In general, the SR software will process until you force it to stop. It analyzes on a sentence level, but you will not always be working on a sentence level. The best way to indicate breaking points to the software is to say "NEW PARAGRAPH" after each stimulus. This forces the software to produce an outcome quickly and displays the results line-by-line.

1. Words in carrier phrases

Phoneticians often ask subjects to read words in carrier phrases so that the speech will be more natural, instead of a word in isolation. For the purposes of speech recognition, the transitions between words can help disambiguate words. With a very neutral carrier phrase, you can test the behavior of the software when the vague context doesn't provide possible contextual disambiguation. An informative carrier phrase might provide additional information to help the software correctly identify the words by giving additional information about the meaning or cueing a common phrase. Here the question of interest is whether an informative carrier phrase necessarily improves word recognition.

Neutral carrier phrase
a. They say sip is in the dictionary.
b. They say ship is in the dictionary.
c. They say sheep is in the dictionary.
d. They say saw is in the dictionary.
e. They say gnaw is in the dictionary.
Informative carrier phrase
f. Take a sip of the tea.
g. The ship sails through the water.
h. Goats and sheep are good for milking.
i. I cut down the tree with a saw.
j. The teething baby wanted to gnaw on the biscuit.

2. Function words

This set again consists of words that are presented either with or without an informative, semantically related sentence context. In this case, however, function words are involved, highlighting the potential role of semantic context for recognizing speech segments that have little meaning in and of themselves. Function words express the relationship between other words and phrases in a sentence but don't convey much meaning by themselves. Examples of function words are conjunctions like "and" and "or," prepositions like "in" and "of," articles like "a" or "the," and personal pronouns like "I" or "them." Function words are not usually responsible for introducing new information and are not often used contrastively. These words are often unstressed and are usually very short. Their meaning depends heavily on the surrounding content words and they may actually have very show low levels of acoustic energy due to some of the sounds that make them up. The overall result is that function words tend to be less distinctive than content words. Nonetheless, accurate recognition of function words is crucial to speech comprehension, as they are important in understanding sentence structure.

Isolated function words
a. the
b. of
c. my
d. on
e. this
Function words in context
f. It's the dumbest thing I've ever heard of.
g. On the top of the building, there's a pool.
h. Give me my umbrella please.
i. Put it on the table.
j. It's this stupid headache again.

3. Vowels in single words

For these words, there is little information for the program to rely on. Diphthongs and tense vowels are longer than lax vowels. Will a longer vowel lead to increased correct recognition?

Diphthongs
a. pie
b. toy
c. cow
Tense vowels
d. key
e. pa
f. two
Lax vowels
g. put
h. pet
i. caught
j. tip

4. Liquids [r, l]

Again, this set provides very little information for the program to rely on. The sounds [l] and [r] have different realizations depending on their position in a word. How does the software do with these sounds in various word positions?

Rhotic liquid [r]
a. rude
b. earring
c. strap
d. rose
e. ruse
Lateral liquid [l]
f. full
g. fool
h. file
i. vile
j. laws

5. Continuous speech

Here, fluent speech is tested, highlighting the challenges posed by interpreting a continuous signal, rather than differences between male and female talkers per se. The SR software's ability (or lack thereof) to disambiguate tricky word junctures in the absence of pauses is the focus. Target words are underlined.

Here the question of interest is whether the software will be able to use a rather tricky context to figure out what to transcribe. When scoring the outcomes for these stimuli, only count performance on the underlined words; simply ignore the rest of the material though you might like to note what the software transcribes because it might help to understand some of the mistakes. The bracketed words illustrate possible ambiguities which may trick the software.

  1. Dear Zoo Supporters, I am writing to let you, too [two], know about our efforts
  2. that include a whole dear [hole deer] family of zoo enthusiasts.
  3. An aim [a name] of ours is holding a benefit,
  4. not black tie, but white tie [why tie].
  5. This will be a seashore event, so why wreck a nice beach [why recognize speech]?
  6. Instead we will sail in an ark [arc] of shelled creatures--as many as possible.
  7. With the tools at your disposal, won't you help a snail aboard? [us nail a board]
  8. Benefactors give new clear [nuclear] power with the energy brought to our efforts.
  9. Please go if the future you value is theirs [there's].
  10. P.S. If you'll be flying to the airport in the dark, may you have a safe flight [light].