Cognitive Studies/Psychology/Computer Science 201

Human-machine interaction:
The case of speech-recognition (SR) software


Michael J. Owren and Lisa M. Lavoie
Departments of Psychology and Linguistics, Cornell University

Table of contents


Recent improvements in speech-recognition (SR) software for personal computers are ushering in a new era in human-machine interaction. While automatic recognition of a very limited vocabulary has been possible for some time, it has proven very difficult to program computers to understand continuous spoken language as it would normally be produced--rapidly and with a large vocabulary. One reason it has been challenging to design such a system is that while human listeners perceive speech to be a well-behaved, orderly sequence of individual, linguistic units, the acoustic signal itself is actually quite variable. For instance, when a given talker produces the words "tea" and "taught," a listener hears a "t" at the beginning of both words but the acoustic properties of the two "t's" turn out to be rather different ( Fischer-Jorgensen 1954, reprinted 1967; Olive, Greenwood and Coleman 1993 and references found within these works). The sounds of speech are highly susceptible to "coarticulation" which is what happens when the vocal tract configuration during one sound is a blend of that required for the current sound and for neighboring sounds. While you're pronouncing the "t" in "tea," your mouth is in the configuration for producing "ea" and that's quite a different configuration than for "aught" in "taught" (see spectrogram below).

Spectrograms of "tea" and "taught"

Spectrograms of "tea" (on left) and "taught" (on right)

Because speech scientists had expected to readily be able to identify specific consistent acoustic characteristics for any given speech sound (no matter who was talking and what the surrounding speech sounds were), they dubbed this observation the "lack of invariance problem." It is not that the acoustic attributes of a given speech sound are arbitrary, but rather that the relationship between the incoming physical signal and corresponding psychological interpretation is surprisingly ambiguous in many cases. While human speech recognition is not completely flawless, it is typically effortless. Since the perceptual process through which a human listener can interpret that ambiguous incoming signal is not well understood, it is especially difficult to program a computer to do it.

In this module, we will examine some of the basic attributes of speech in the context of a well-known SR program, Dragon NaturallySpeaking™;. This software is currently being marketed as an automated dictation device and has received rave reviews. In order to use the software, however, the user must first train it by reading specific passages to the program. This training allows the software to represent important aspects of that particular talker's speech by mapping this acoustic input onto intended utterances that it already "knows." In addition, the program is designed to produce plausible interpretations of problematic words by making probabilistic computations based on the other words in the sentence. While this program (and other similar systems) is a vast improvement over previously available technology, it also frequently fails, and its performance can vary dramatically from talker to talker. The specific effect of interest for this module is that speech produced by males is easier for the software than speech produced by females. We will examine why that is so by first documenting that performance difference, and then attempting to improve the software's performance by specifically modifying certain speech characteristics during the dictation process.

The module consists of several components. In the first, you will test the "out-of-box" performance of Dragon NaturallySpeaking™ without the standard training period, using a standardized set of testing stimuli. At this point, you will also test the software by modifying your pitch to see if performance improves with different pitches. Second, you will examine some important aspects of speech production and perception using an interactive, tutorial course available on CD-ROM. This course will explain how sounds are represented in spectrographic form (showing energy distribution patterns and how they change over time), and how to extract important acoustic parameters of the speech signal. After that, you will record some speech and analyze it using Signalyze, an acoustic analysis program, and chart your individual vowel spaces. Then you will increase the distinctiveness of your vowels and check the program's performance again. Finally, you will carry out the extended training recommended by the manufacturer and test Dragon's performance again. In all cases, it will be of particular interest to compare the relative performance for male and female talkers.

In this module, you will use both PCs and Macintosh-compatible computers. Dragon NaturallySpeaking™ is implemented on the five PCs in the CSIC laboratory, which will also be used for the CD-ROM course. You will use the Macintosh-compatibles to enter the performance data into a spreadsheet and to carry out acoustic analyses of your speech with the Signalyze program. Instructions for using the various programs appear as you need them in the module, either in the exercises or in an appendix.

Week 1: Documenting baseline performance of SR software and modifying pitch

Reading assignment 1 (complete these readings in order before beginning the first component)

Eisenberg, A. (1998, April 22)
Computers are starting to listen, and understand. The New York Times. (read pp. G1, G8).
Speech technology timeline. PC Magazine Online. March 10, 1998.
Ladefoged, P. (1993)
A course in phonetics. 3rd edition. NY: Harcourt, Brace, Jovanovich. (read pp. 1-15, 27, and 31)
Lavoie, L. (1998)
Why wreck a nice beach? Why recognize speech? Some important concepts for the speech recognition module. Unpublished manuscript. Included here as Appendix A.

Exercise 1

In the first exercise, performance of Dragon NaturallySpeaking™ is tested using a standardized set of testing stimuli. These stimuli have been selected to highlight both the sorts of words that are challenging to SR software and that are likely to demonstrate performance differences between the speech of males and females. The stimulus set is described in detail in Appendix B, which you should read before proceeding to the tests themselves, data sheets for which appear in Appendix E. For the first test, the program will only be minimally trained. For the second test, you will alter your speaking pitch. This exercise involves use of one of the five PC's.

Start Dragon NaturallySpeaking™ from Windows

To begin, press the START button in the lower left of the screen and select PROGRAMS. Then select NaturallySpeaking Users. Finally, select the folder that has been set up under your name, and choose NaturallySpeaking Personal Edition to start the program.

Perform preliminary adjustment

Select the ADJUST VOLUME ONLY item, and follow the directions for adjusting the sound output level to a comfortable level. Follow the directions to test the sound input from the microphone by reading the paragraph that is presented--this test checks for input and then provides feedback on sound quality.

Continue to adjust the microphone position until sound quality is at least average. Try to get above average quality by redoing the test, but average is acceptable. To achieve the acceptable sound quality, you might need to put the microphone quite close to your mouth or speak slightly louder than usual. Remember that you will need to be able to recreate the same input quality when using the program later.

General training

Follow the directions for calibration, which involves reading several sentences in response to prompts. The system now calibrates itself and opens the input screen, which is a standard word-processing graphical interface. Do not move on to the longer training session at this point; so click TRAIN LATER. The Dragon NaturallySpeaking™ interface now opens automatically.

Before beginning the first test, practice dictating some sentences. Click on the microphone icon on the toolbar above the text input area to activate the input process. At the end of each word, phrase, or sentence, say "New-Paragraph."

Testing system performance before training ("Before" condition)

Proceed through each of the 5 tests at your own pace, reading each item in a natural speaking voice. When test words are presented in the context of a phrase or sentence, they are underlined. However, read the whole phrase or sentence as normally as possible, without unduly emphasizing the test word.

For each item, record whether or not the program transcribed the target word correctly. If you yourself make a mistake in reading, repeat that item. However, try not to have to repeat any items if you can avoid it, in order to achieve the most accurate and unbiased outcome. It is also a good idea to note exactly what the program transcribed for each of the items.

As you complete each test, tally the percent correct for that section on the datasheet. Once you have completed all 50 items, exit the program without saving the document (the program considers what it transcribes from your speech to be a document).

Entering results

We will be tabulating and analyzing results using the Statview program on a designated Macintosh-compatible machine (#32 in Uris 259). Enter the results in the document entitled "Speech Recognition Results." If the program is not already running with that document open, open it by selecting the "Speech Recognition Manual f" folder in the Apple Menu and releasing on "Speech Recognition Results" in that folder.

The datasheet is organized as rows and columns. The first column is marked "Subject" and you should enter your data in the row whose number corresponds to the Subject number corresponding to you. Type your results into that row, being sure to enter the values in the appropriate column. The columns are organized in the following order:

Variable Name Description
Subject a unique whole number assigned to you (probably by last-name order)
Preliminary input quality
input quality evaluation (1 if average, 2 if above average)
Training input quality same scoring as in Preliminary input quality
Sex enter Male or Female (StatView will complete this as soon as you type M or F)
for each of the listed variables, enter % correct (this value
should be an multiple of 10 between 0 and 100)
Carrier phrase testing
CP_neutral_before %correct
CP_context_before %correct
CP_overall_before %correct
Function word testing
FW_isolated_before %correct
FW_context_before %correct
FW_overall_before %correct
Vowels in single words
Dipth_before %correct
Tense_before %correct
Lax_before %correct
VSW_overall_before %correct
Liq_R_before %correct
Liq_L_before %correct
Liq_overall_before %correct
Continuous speech paragraph
CS_before %correct

Modifying your pitch

In the modified pitch test, female speakers will decrease their pitch (speak in a deep voice) but male speakers will increase their pitch (speak in a falsetto). These modifications, which should be fairly easy to achieve, test the software's performance in various pitch ranges. The data sheets for these tests appear in Appendix F. The modified pitch test uses fewer test items as you will see when you examine the sheets in Appendix F. Practice each modification first on both a few of the individual test words and a complete sentence. You are aiming for a one-third change in PITCH which you will later verify using Signalyze.

This exercise is relevant to a language phenomenon known as "motherese," which has more recently been dubbed "infant-directed speech." As has now been widely noted, parents, older siblings, and other caretakers of infants and very young children often use a distinctive, "baby-talk" style of speaking when addressing these little ones. In fact, many pet owners use the same style of speech when addressing their beloved animals. When first documented, the single-most important finding was that infant-directed speech involves speaking with an unusually high fundamental frequency (pitch) and strongly exaggerating fundamental frequency changes. In other words, talkers use high-pitched voices with dramatic, wide sweeps that result in a unique, sing-song like auditory quality (see Fernald 1992, for a recent review).

These findings have clear implications for the problem of automatic speech recognition. Specifically, we can hypothesize that the sorts of modifications exhibited by adult talkers when addressing offspring (and other beings!) that are not proficient language users can also improve the performance of SR software. Unlike the case of infant-directed speech, however, SR software appears to perform better with lowered, rather than raised vocal pitch.

Entering your results

Enter your data in the Statview database, as before. Again, the requisite variables have already been created in the program, and your particular results can be entered in the same row as in the earlier tests. Data analyses require that everyone complete the exercise.

Variable Name Description
Carrier phrase testing
CP_neutral_pitch %correct
CP_context_pitch %correct
Function word testing
FW_context_pitch %correct
Vowels in single words
Dipth_pitch %correct
Tense_pitch %correct
Lax_pitch %correct
VSW_pitch %correct

Week 2: Analysis of baseline SR software performance and training in speech production and analysis

Reading Assignment 2 (read this before beginning the second exercise)

Nygaard, L. C.; Sommers, M. S.; & Pisoni, D. B. (1994)
Speech perception as a talker-contingent process. Psychological Science, 5, 42-46.

Exercise 2

In this component, you will analyze the performance data from last week and complete an interactive tutorial on speech production and perception.

Analyzing the performance data

Data analysis is contingent on everyone in the class having completed both components of Exercise 1 and entering those data in the Statview datasheet. Those data will be checked by lab personnel and thereafter made available on several of the Macintosh-compatible computers. The two main issues to address at this point are first if there was a difference in software performance for female versus male voices, and second whether or not changing voice pitch had an effect on recognition accuracy.

The first question can be addressed in a number of ways, for instance by comparing software performance in each of the 9 testing conditions for which a mean value was calculated. To simplify the comparisons somewhat, test for differences by comparing results for females and males using the variables CP_overall_before, FW_overall_before, V_overall_before, Liq_overall_before, and CS_before. In other words, a given comparison should include each subject's mean value on the variable in question, and the t-test will compare a group composed of the females' mean values to the males' mean values. Based on the pattern of outcomes, is there evidence of a consistent difference in software performance for female and male voices?

The second question, concerning the effect of pitch changes, should be addressed separately for females and males. Here the variables in question are CP_overall, FW_context, and VSW. To make the comparison, the more powerful paired t-test can be used, as each subject is essentially being compared to themselves under the two testing conditions. To test the CP_overall variable, for instance, mean values for each of the female subjects on CP_overall_before should be compared as a group to the corresponding values on CP_overall_pitch. The other variables should be tested in the same way, and the outcomes for males should then be examined in a separate series of tests. Based on the pattern of outcomes, is there evidence of consistent differences in software performance in either or both cases?

Tutorial on speech production and perception

The self-paced tutorial course is implemented on a CD-ROM called Speech Production and Perception I (Sensimetrics Corporation). This CD-ROM is designed for PCs and can be run using any PC with multimedia capabilities. As there are only five PCs in the CSIC lab and there may be more students in the class, CD-ROMs will be available for you to check out and use in the public computer labs or at your own appropriately equipped PC if you need more time. We estimate that completing the tutorial will take about 2 hours.

If you use the CD-ROM in the CSIC lab, insert it into the CD-ROM drive. Click the START button on the desktop toolbar, go to the PROGRAMS menu and select CBCAP.

If you use the CD-ROM elsewhere, you will first need to install the CBCAP program, which simply coordinates access to the various modules on the CD-ROM. First insert the CD-ROM into the drive. Then use the START button on the toolbar to select the RUN window that is always used for installation on PC's. Type D:/SETUP in this window and hit return (if the CD-ROM drive has some letter other than D, enter that letter instead). The CBCAP program will now install itself and provide prompts if any further action is needed.

You can use the CD-ROM in PCs in any of the following CIT labs: MVR, Sibley, Upson, Noyes Community Center, RPCC and Uris Library. The precise locations and hours of these labs are listed on the web at You may install the CBCAP program on one of the lab PCs although it might be deleted from the machine later. You will need to get headphones from the lab attendant to listen to the CD-ROM.

The Speech Production and Perception I tutorial consists of 3 components. Our focus will be on material found under the ACTIVITIES heading. However, you should also be aware that reference material is available in the LIBRARY, including a glossary of important terms, references to further readings, and a cross-index to textbooks that are commonly used in speech science. In addition, the LAB component provides acoustic analysis tools. Once you have completed the assigned ACTIVITIES, you will be familiar with these basic tools. Therefore, feel free to explore the LAB, for instance recording and examining the rate, pitch, and formant characteristics of your own and others' speech. These are the characteristics that will be important in Exercise 3, when you explicitly attempt to modify your speech in order to improve the SR software performance. Have fun exploring the CD-ROM. There are many excellent demonstrations and sounds, such as a chart of phonetic symbols that lets you hear each of the sounds by clicking. When you're in the ACTIVITIES section, there is a contents button on the right side of the screen that is helpful in navigating the CD-ROM. And the right arrow at the bottom right corner of the screen lets you move quickly when you are skipping portions of the CD-ROM.

Wherever you choose to use the CD-ROM, you should complete the following components of the ACTIVITIES section. Not all of the activities are listed; some are skipped. To skip a topic, click the right arrow at the bottom right corner of the screen.


Vowel acoustics

Consonant acoustics

Vowel perception

Week 3: Measuring your speech with Signalyze, modifying formants and testing the software

Reading Assignment 3 (read this before beginning the third exercise)

Kuhl, P. K.; Andruski, J. E.; Chistovich, I. A., Chistovich, L. A.; Kozhevnikova, E. V.; Ryskina, V. L.; Stolyarova, E. I.; Sundberg, U.; & Lacerda, F. (1997).
Cross-language analysis of phonetic units in language addressed to infants. Science, 277, 684-686.

Exercise 3

In this exercise, you will measure your pitch with Signalyze, practice modifying your formants, make formant measurements in Signalyze and then retest Dragon NaturallySpeaking™ with more extreme formants.

Measure your pitch range

Instructions for using Signalyze appear in Appendix C. Appendix G, the pitch range data sheet, provides further instructions on measuring your pitch.

Altering formants

In addition to being high-pitched, infant-directed speech is typically marked by having more carefully enunciated vowel sounds than is speech aimed at adults. Kuhl et al. (1997) propose that one function of infant-directed speech is to increase its intelligibility relative to adult-directed speech, making the language-learning task easier for preverbal infants and children. They highlight the possible importance of increasing the acoustic contrast among the various vowels occurring in a given language. This increased differentiation of the vowel sounds results in an enlarged overall vowel space, as measured by the frequency values of the first two formants. This effect was documented by Kuhl et al. when they compared the /a/, /i/, and /u/ vowels of mothers from three different cultures who were recorded while producing both infant-directed speech and adult-directed speech.

These findings have clear implications for the problem of automatic speech recognition. Specifically, we can hypothesize that the sorts of modifications exhibited by adult talkers when addressing offspring (and other beings!) that are not proficient language users can also improve the performance of SR software.

In this exercise, you will try to improve performance of the SR software by enlarging the vowel space used during speech production. Testing will occur using a subset of the standardized stimulus set, the results of which can then be compared to the data obtained earlier. Before doing the tests with the extreme formants, you will practice modifying the formants and check those modifications with Signalyze.

Measuring normal and extreme formants

Appendix H, the formant data sheet, gives instructions and specific items to measure. Appendix C tells you how to measure the formants with Signalyze.

The dimensions of interest are the frequencies of the first two formants (F1 and F2) of the talker's point vowels, /i/, /a/, and /u/ (hereafter referred to as "FORMANTS"). This final round therefore requires 2 sets of tests, each of which will be conducted as before, except with fewer items. All talkers should increase the distinctiveness of their vowel sounds by changing the formant values used. Practice each modification first on both a few of the individual test words and a complete sentence. Do so a few times, while listening to each of the two dimensions of the utterance. You might get the best results if you imagine yourself to be interacting with an infant or a foreigner who is not a native English speaker. However, try to avoid raising your pitch too much when you are increasing the distinctiveness of your formants. Have someone else listen to see if your vowels sound more distinctive.

Aim for at least a 10% change in FORMANTS (calculated as the mean change for both F1 and F2--whether a decrease on increase in the formants).

Make all of the measurements asked for in Appendix H, calculate the means that are indicated, and plot these values on the formant chart included in Appendix I. First plot the formant values for your normal speech and then plot the values for the extreme formants in another color. Compare these plots to the plots for English given by Kuhl et al. (1997), which are shown below.

Formant Frequencies (F1, F2) for English, Russian, and Swedish

Formant frequency-plots for F1 and F2 in English
          (left), Russian (center), and Swedish (right)

Conducting modified formant tests with Dragon NaturallySpeaking™

Once you are confident of modifying your formants, begin the actual tests which are given in Appendix J. This time, use the extreme formants for the entire sentence, not just the target word. Some of the early testers of this module found that although performance on the target words did not necessarily improve, there was often improvement in the surrounding sentence. It is again a good idea to record precisely what the software transcribes for each of the test items.

Entering your results

Enter your data in the Statview database, as before. The requisite variables have again already been created in the program, and your particular results can be entered in the same row as in the earlier tests. Data analyses require that everyone complete the exercise.

Variable Name Description
Carrier phrase testing
CP_neutral_formant %correct
CP_context_formant %correct
Function word testing
FW_context_formant %correct
Vowels in single words
Dipth_formant %correct
Tense_formant %correct
Lax_formant %correct
VSW_formant %correct

Analyzing the data

Data analysis here is identical to the analyses conducted earlier, except that here the formant characteristics have been modified instead of pitch. The issue to address then is whether or not changing the formants had an effect on software performance. Use the same approach as before, comparing the outcomes for variables CP_overall, FW_context, and females and males separately.

Week 4: Training the SR software and retesting

Reading Assignment 4 (complete these readings in order before beginning this exercise)

Hagiwara, R. (1995)
Acoustic realizations of American /r/ as produced by women and men. UCLA Working Papers in Phonetics. Los Angeles: Phonetics Laboratory, Department of Linguistics, UCLA. (read pp. 2, 18-40, and 52-59)
Sommers, M. S.; Nygaard, L. C.; & Pisoni, D. B. (1994)
Stimulus variability and spoken word recognition: Effects of variability in speaking rate and overall amplitude. Journal of the Acoustical Society of America, 96, 1314-1324.
Bradlow, A. R.; Torretta, G. M.; & Pisoni, D. B. (1996)
Intelligibility of normal speech I: Global and fine-grained acoustic-phonetic talker characteristics. Speech Communication, 20, 255-272.

Exercise 4

In this exercise, you finally train the software as the directions indicate. The other tests tried to improve recognition without the individualized training, but this time you will fully train the software on your voice.

Training the SR software

Open Dragon NaturallySpeaking™ again. After the preliminary adjustments and calibrations, go on to the longer training session. This session requires more than 30 minutes of reading in the same manner as in the preliminary training, so plan to spend some time on the process. You will be following prompts as before. Once you've completed the reading component, the system requires about 15 minutes for calibration to your voice. Leave the computer running and undisturbed during this period.

Testing performance of the trained system ("After" condition)

Test the system in exactly the same manner as in the Before condition. Data sheets for the After condition appear in Appendix K. Again, proceed through the 5 tests at your own pace, reading the items in a natural voice. Record the outcome for each test on the datasheet, and when all 50 items have been completed, exit the program without saving the document, and tabulate percent-correct for each test. Again, it is a good idea to record exactly what the software transcribes for each of the sentences. For the After condition, enter values in the spreadsheet exactly as you did for the Before condition (for the corresponding variables, shown below):

for each of the listed variables, enter % correct (this value
should be an multiple of 10 between 0 and 100)
Variable Name Description
Carrier phrase testing
CP_neutral_after %correct
CP_context_after %correct
CP_overall_after %correct
Function word testing
FW_isolated_after %correct
FW_context_after %correct
FW_overall_after %correct
Vowels in single words
Dipth_after %correct
Tense_after %correct
Lax_after %correct
VSW_overall_after %correct
Liq_R_after %correct
Liq_L_after %correct
Liq_overall_after %correct
Continuous speech paragraph
CS_after %correct

Analyzing the data

Here, there are again two main issues to address. The first is whether or not training made a significant difference to software performance, and the second is whether any change that occurred was comparable for females and males. The first issue can be addressed using paired t-tests, as before. Data from females and males should be examined separately, for instance comparing outcomes in each of the nine testing conditions used in the statistical tests conducted in Week 2.

There are a number of ways in which to address the second issue. For instance, one can simply compare the number of statistically significant differences that occur in the Before and After conditions for females and males, respectively. Another, more in-depth approach would be to create difference scores for each variable and test whether these differences were on average greater or males or females. For instance, subtracting the CP_overall_before value for a given female subject from the same individual's CP_overall_after value produced a score on a new variable that might be called CP_overall_improvement. Calculating this score for all the females and all the males produces two sets of such scores, which can then be compared using a two-sample t-test.

As there are a number of possible approaches , you can make your own determination as to how to proceed. Think carefully both about which variables to test and about how to conduct those tests. The most important consideration is that you want to draw the most accurate conclusion possible and to present that conclusion in a convincing way in your final report.

Final report for the speech recognition module

After you have completed the exercises, you will be ready to prepare a final written report. This write-up should be in standard journal article format--select a journal in a discipline of interest to you and follow their instructions to contributors. If you have any doubts about what format to use, follow the guidelines of the American Psychological Association (APA), which is the standard for many disciplines, including psychology and linguistics. Lab personnel can direct you to sources that describe the APA style.

Be sure to address the results of the statistical comparisons that you have conducted. The discussion in Appendix B about the tokens may be helpful in deciding on particular issues that should be discussed in the lab report.

A very important aspect of the lab write-up is to draw conclusions that the data support. Do not be afraid to make conclusions based on the data that you and your colleagues in the class have collected. Discuss how you gathered the data so that the reader knows it was done carefully and methodically. Make your findings perfectly clear to the reader and explicitly draw the conclusions that are indicated by the empirical data.

As part of your final report, you should include the mean values for fundamental frequency and formant measurements. Appendix I includes forms for charting both your normal and modified formants.

In preparing your report, you will need to consider some specific questions. Everyone should answer the following three questions and choose three others to address.

  1. How do male and female voices compare in terms of correct recognition?
  2. Did the training of the software increase correct recognition?
  3. Did the voice modifications increase correct recognition?

For the three other questions, you may use some of these or think of your own.

  1. How was the recognition of final consonants?
  2. What was the of including a carrier phrase or an informative context?
  3. Was there an effect of variation in individual fundamental frequency within the male and female groups?
  4. How did the manipulations of vowel space impact recognition?
  5. How did your speech change when you produced infant-directed speech? For example, did your vowels get longer and did your pitch get higher?