Because speech scientists had expected to readily be able to identify specific consistent acoustic characteristics for any given speech sound (no matter who was talking and what the surrounding speech sounds were), they dubbed this observation the "lack of invariance problem." It is not that the acoustic attributes of a given speech sound are arbitrary, but rather that the relationship between the incoming physical signal and corresponding psychological interpretation is surprisingly ambiguous in many cases. While human speech recognition is not completely flawless, it is typically effortless. Since the perceptual process through which a human listener can interpret that ambiguous incoming signal is not well understood, it is especially difficult to program a computer to do it.
In this module, we will examine some of the basic attributes of speech in the context of a well-known SR program, Dragon NaturallySpeaking;. This software is currently being marketed as an automated dictation device and has received rave reviews. In order to use the software, however, the user must first train it by reading specific passages to the program. This training allows the software to represent important aspects of that particular talker's speech by mapping this acoustic input onto intended utterances that it already "knows." In addition, the program is designed to produce plausible interpretations of problematic words by making probabilistic computations based on the other words in the sentence. While this program (and other similar systems) is a vast improvement over previously available technology, it also frequently fails, and its performance can vary dramatically from talker to talker. The specific effect of interest for this module is that speech produced by males is easier for the software than speech produced by females. We will examine why that is so by first documenting that performance difference, and then attempting to improve the software's performance by specifically modifying certain speech characteristics during the dictation process.
The module consists of several components. In the first, you will test the "out-of-box" performance of Dragon NaturallySpeaking without the standard training period, using a standardized set of testing stimuli. At this point, you will also test the software by modifying your pitch to see if performance improves with different pitches. Second, you will examine some important aspects of speech production and perception using an interactive, tutorial course available on CD-ROM. This course will explain how sounds are represented in spectrographic form (showing energy distribution patterns and how they change over time), and how to extract important acoustic parameters of the speech signal. After that, you will record some speech and analyze it using Signalyze, an acoustic analysis program, and chart your individual vowel spaces. Then you will increase the distinctiveness of your vowels and check the program's performance again. Finally, you will carry out the extended training recommended by the manufacturer and test Dragon's performance again. In all cases, it will be of particular interest to compare the relative performance for male and female talkers.
In this module, you will use both PCs and Macintosh-compatible computers. Dragon NaturallySpeaking is implemented on the five PCs in the CSIC laboratory, which will also be used for the CD-ROM course. You will use the Macintosh-compatibles to enter the performance data into a spreadsheet and to carry out acoustic analyses of your speech with the Signalyze program. Instructions for using the various programs appear as you need them in the module, either in the exercises or in an appendix.
In the first exercise, performance of Dragon NaturallySpeaking is tested using a standardized set of testing stimuli. These stimuli have been selected to highlight both the sorts of words that are challenging to SR software and that are likely to demonstrate performance differences between the speech of males and females. The stimulus set is described in detail in Appendix B, which you should read before proceeding to the tests themselves, data sheets for which appear in Appendix E. For the first test, the program will only be minimally trained. For the second test, you will alter your speaking pitch. This exercise involves use of one of the five PC's.
Continue to adjust the microphone position until sound quality is at least average. Try to get above average quality by redoing the test, but average is acceptable. To achieve the acceptable sound quality, you might need to put the microphone quite close to your mouth or speak slightly louder than usual. Remember that you will need to be able to recreate the same input quality when using the program later.
Before beginning the first test, practice dictating some sentences. Click on the microphone icon on the toolbar above the text input area to activate the input process. At the end of each word, phrase, or sentence, say "New-Paragraph."
For each item, record whether or not the program transcribed the target word correctly. If you yourself make a mistake in reading, repeat that item. However, try not to have to repeat any items if you can avoid it, in order to achieve the most accurate and unbiased outcome. It is also a good idea to note exactly what the program transcribed for each of the items.
As you complete each test, tally the percent correct for that section on the datasheet. Once you have completed all 50 items, exit the program without saving the document (the program considers what it transcribes from your speech to be a document).
The datasheet is organized as rows and columns. The first column is marked "Subject" and you should enter your data in the row whose number corresponds to the Subject number corresponding to you. Type your results into that row, being sure to enter the values in the appropriate column. The columns are organized in the following order:
Variable Name | Description |
---|---|
Subject | a unique whole number assigned to you (probably by last-name order) |
Preliminary input quality |
input quality evaluation (1 if average, 2 if above average) |
Training input quality | same scoring as in Preliminary input quality |
Sex | enter Male or Female (StatView will complete this as soon as you type M or F) |
BEFORE CONDITION: for each of the listed variables, enter % correct (this value should be an multiple of 10 between 0 and 100) |
|
Carrier phrase testing | |
CP_neutral_before | %correct |
CP_context_before | %correct |
CP_overall_before | %correct |
Function word testing | |
FW_isolated_before | %correct |
FW_context_before | %correct |
FW_overall_before | %correct |
Vowels in single words | |
Dipth_before | %correct |
Tense_before | %correct |
Lax_before | %correct |
VSW_overall_before | %correct |
Liquids | |
Liq_R_before | %correct |
Liq_L_before | %correct |
Liq_overall_before | %correct |
Continuous speech paragraph | |
CS_before | %correct |
This exercise is relevant to a language phenomenon known as "motherese," which has more recently been dubbed "infant-directed speech." As has now been widely noted, parents, older siblings, and other caretakers of infants and very young children often use a distinctive, "baby-talk" style of speaking when addressing these little ones. In fact, many pet owners use the same style of speech when addressing their beloved animals. When first documented, the single-most important finding was that infant-directed speech involves speaking with an unusually high fundamental frequency (pitch) and strongly exaggerating fundamental frequency changes. In other words, talkers use high-pitched voices with dramatic, wide sweeps that result in a unique, sing-song like auditory quality (see Fernald 1992, for a recent review).
These findings have clear implications for the problem of automatic speech recognition. Specifically, we can hypothesize that the sorts of modifications exhibited by adult talkers when addressing offspring (and other beings!) that are not proficient language users can also improve the performance of SR software. Unlike the case of infant-directed speech, however, SR software appears to perform better with lowered, rather than raised vocal pitch.
Variable Name | Description |
---|---|
Carrier phrase testing | |
CP_neutral_pitch | %correct |
CP_context_pitch | %correct |
CP_overall_pitch |
%correct |
Function word testing | |
FW_context_pitch | %correct |
Vowels in single words | |
Dipth_pitch | %correct |
Tense_pitch | %correct |
Lax_pitch | %correct |
VSW_pitch | %correct |
In this component, you will analyze the performance data from last week and complete an interactive tutorial on speech production and perception.
The first question can be addressed in a number of ways, for instance by comparing software performance in each of the 9 testing conditions for which a mean value was calculated. To simplify the comparisons somewhat, test for differences by comparing results for females and males using the variables CP_overall_before, FW_overall_before, V_overall_before, Liq_overall_before, and CS_before. In other words, a given comparison should include each subject's mean value on the variable in question, and the t-test will compare a group composed of the females' mean values to the males' mean values. Based on the pattern of outcomes, is there evidence of a consistent difference in software performance for female and male voices?
The second question, concerning the effect of pitch changes, should be addressed separately for females and males. Here the variables in question are CP_overall, FW_context, and VSW. To make the comparison, the more powerful paired t-test can be used, as each subject is essentially being compared to themselves under the two testing conditions. To test the CP_overall variable, for instance, mean values for each of the female subjects on CP_overall_before should be compared as a group to the corresponding values on CP_overall_pitch. The other variables should be tested in the same way, and the outcomes for males should then be examined in a separate series of tests. Based on the pattern of outcomes, is there evidence of consistent differences in software performance in either or both cases?
If you use the CD-ROM in the CSIC lab, insert it into the CD-ROM drive. Click the START button on the desktop toolbar, go to the PROGRAMS menu and select CBCAP.
If you use the CD-ROM elsewhere, you will first need to install the CBCAP program, which simply coordinates access to the various modules on the CD-ROM. First insert the CD-ROM into the drive. Then use the START button on the toolbar to select the RUN window that is always used for installation on PC's. Type D:/SETUP in this window and hit return (if the CD-ROM drive has some letter other than D, enter that letter instead). The CBCAP program will now install itself and provide prompts if any further action is needed.
You can use the CD-ROM in PCs in any of the following CIT labs: MVR, Sibley, Upson, Noyes Community Center, RPCC and Uris Library. The precise locations and hours of these labs are listed on the web at http://www.cit.cornell.edu/labs. You may install the CBCAP program on one of the lab PCs although it might be deleted from the machine later. You will need to get headphones from the lab attendant to listen to the CD-ROM.
The Speech Production and Perception I tutorial consists of 3 components. Our focus will be on material found under the ACTIVITIES heading. However, you should also be aware that reference material is available in the LIBRARY, including a glossary of important terms, references to further readings, and a cross-index to textbooks that are commonly used in speech science. In addition, the LAB component provides acoustic analysis tools. Once you have completed the assigned ACTIVITIES, you will be familiar with these basic tools. Therefore, feel free to explore the LAB, for instance recording and examining the rate, pitch, and formant characteristics of your own and others' speech. These are the characteristics that will be important in Exercise 3, when you explicitly attempt to modify your speech in order to improve the SR software performance. Have fun exploring the CD-ROM. There are many excellent demonstrations and sounds, such as a chart of phonetic symbols that lets you hear each of the sounds by clicking. When you're in the ACTIVITIES section, there is a contents button on the right side of the screen that is helpful in navigating the CD-ROM. And the right arrow at the bottom right corner of the screen lets you move quickly when you are skipping portions of the CD-ROM.
Wherever you choose to use the CD-ROM, you should complete the following components of the ACTIVITIES section. Not all of the activities are listed; some are skipped. To skip a topic, click the right arrow at the bottom right corner of the screen.
In this exercise, you will measure your pitch with Signalyze, practice modifying your formants, make formant measurements in Signalyze and then retest Dragon NaturallySpeaking with more extreme formants.
These findings have clear implications for the problem of automatic speech recognition. Specifically, we can hypothesize that the sorts of modifications exhibited by adult talkers when addressing offspring (and other beings!) that are not proficient language users can also improve the performance of SR software.
In this exercise, you will try to improve performance of the SR software by enlarging the vowel space used during speech production. Testing will occur using a subset of the standardized stimulus set, the results of which can then be compared to the data obtained earlier. Before doing the tests with the extreme formants, you will practice modifying the formants and check those modifications with Signalyze.
The dimensions of interest are the frequencies of the first two formants (F1 and F2) of the talker's point vowels, /i/, /a/, and /u/ (hereafter referred to as "FORMANTS"). This final round therefore requires 2 sets of tests, each of which will be conducted as before, except with fewer items. All talkers should increase the distinctiveness of their vowel sounds by changing the formant values used. Practice each modification first on both a few of the individual test words and a complete sentence. Do so a few times, while listening to each of the two dimensions of the utterance. You might get the best results if you imagine yourself to be interacting with an infant or a foreigner who is not a native English speaker. However, try to avoid raising your pitch too much when you are increasing the distinctiveness of your formants. Have someone else listen to see if your vowels sound more distinctive.
Aim for at least a 10% change in FORMANTS (calculated as the mean change for both F1 and F2--whether a decrease on increase in the formants).
Make all of the measurements asked for in Appendix H, calculate the means that are indicated, and plot these values on the formant chart included in Appendix I. First plot the formant values for your normal speech and then plot the values for the extreme formants in another color. Compare these plots to the plots for English given by Kuhl et al. (1997), which are shown below.
Variable Name | Description |
---|---|
Carrier phrase testing | |
CP_neutral_formant | %correct |
CP_context_formant | %correct |
CP_overall_formant |
%correct |
Function word testing | |
FW_context_formant | %correct |
Vowels in single words | |
Dipth_formant | %correct |
Tense_formant | %correct |
Lax_formant | %correct |
VSW_formant | %correct |
In this exercise, you finally train the software as the directions indicate. The other tests tried to improve recognition without the individualized training, but this time you will fully train the software on your voice.
AFTER CONDITION: for each of the listed variables, enter % correct (this value should be an multiple of 10 between 0 and 100) |
|
---|---|
Variable Name | Description |
Carrier phrase testing | |
CP_neutral_after | %correct |
CP_context_after | %correct |
CP_overall_after | %correct |
Function word testing | |
FW_isolated_after | %correct |
FW_context_after | %correct |
FW_overall_after | %correct |
Vowels in single words | |
Dipth_after | %correct |
Tense_after | %correct |
Lax_after | %correct |
VSW_overall_after | %correct |
Liquids | |
Liq_R_after | %correct |
Liq_L_after | %correct |
Liq_overall_after | %correct |
Continuous speech paragraph | |
CS_after | %correct |
There are a number of ways in which to address the second issue. For instance, one can simply compare the number of statistically significant differences that occur in the Before and After conditions for females and males, respectively. Another, more in-depth approach would be to create difference scores for each variable and test whether these differences were on average greater or males or females. For instance, subtracting the CP_overall_before value for a given female subject from the same individual's CP_overall_after value produced a score on a new variable that might be called CP_overall_improvement. Calculating this score for all the females and all the males produces two sets of such scores, which can then be compared using a two-sample t-test.
As there are a number of possible approaches , you can make your own determination as to how to proceed. Think carefully both about which variables to test and about how to conduct those tests. The most important consideration is that you want to draw the most accurate conclusion possible and to present that conclusion in a convincing way in your final report.
After you have completed the exercises, you will be ready to prepare a final written report. This write-up should be in standard journal article format--select a journal in a discipline of interest to you and follow their instructions to contributors. If you have any doubts about what format to use, follow the guidelines of the American Psychological Association (APA), which is the standard for many disciplines, including psychology and linguistics. Lab personnel can direct you to sources that describe the APA style.
Be sure to address the results of the statistical comparisons that you have conducted. The discussion in Appendix B about the tokens may be helpful in deciding on particular issues that should be discussed in the lab report.
A very important aspect of the lab write-up is to draw conclusions that the data support. Do not be afraid to make conclusions based on the data that you and your colleagues in the class have collected. Discuss how you gathered the data so that the reader knows it was done carefully and methodically. Make your findings perfectly clear to the reader and explicitly draw the conclusions that are indicated by the empirical data.
As part of your final report, you should include the mean values for fundamental frequency and formant measurements. Appendix I includes forms for charting both your normal and modified formants.
In preparing your report, you will need to consider some specific questions. Everyone should answer the following three questions and choose three others to address.
For the three other questions, you may use some of these or think of your own.