You can't avoid understanding your native language. In normal circumstances, when someone speaks to you in your mother tongue, it's almost impossible to hear the speech as a meaningless string of sounds--it's spoken communication--you hear words and you understand. Nearly everyone with normal hearing can understand spoken language; its one of the most natural and fundamental tasks of your day. And it's amazing how quickly you get used to a new accent and even start to take it on.
Contrast the simplicity of understanding your native language with the steady stream of sound that you hear when you listen to a language that you don't speak. You can't tell where the words begin and end. Once in a while you can pick out a sentence boundary because of the intonation or because of a pause. Computers trying to recognize speech face a steady stream of sound that they must segment into words, without benefit of a brain or native speaker knowledge of the language.
This brief reading introduces you to some important concepts for your thinking about speech recognition in this module. Spelling, or orthography, as linguists like to call it, will serve as my organizing principle for these concepts.
Here are some hints if you haven't thought of very much: What kinds of stage and actor directions are given in the script of a play? How many different ways could someone say "You're a fine one to talk." It can be said straightforwardly, sarcastically, joyfully, angrily, skeptically, or fearfully, none of which is represented in our spelling system. Spelling doesn't represent emotion, speech rate, or personality and it doesn't represent intonation or tone of voice.
Beware of heard, a dreadful word
That looks like beard and sounds
like bird,
And dead: It's said like bed, not bead--
For goodness' sake, don't call it deed!
The more you scrutinize English spelling, the more inconsistencies you find. For example, there are no fewer than thirteen spellings for sh: show, sugar, issue, mansion, mission, national, suspicion, ocean, conscious, chaperon, schist, fuschia, and pshaw ( Finegan 1992). There are twelve different spellings for the syllable "see": see, senile, sea, seize, scenic, siege, ceiling, cedar, cease, juicy, glossy, sexy (Finegan, 1992). And seven different spellings of the "ee" sound appear in this sentence: Did he believe that Caesar could see the people seize the seas? (Fromkin & Rodman 1988). The bumper sticker "Visualize whirled peas" plays on the ability to represent similar sounds with different combinations of letters.
To factor out all of the different spellings and indicate what a word really sounds like, we can use the International Phonetic Alphabet, or IPA for short. The symbol that represents "ee" in the IPA is [i]. Since [i] is one of the vowels pronounced with the most extreme tongue position, it is known as a point vowel. The other point vowels, also produced with extreme tongue positions are the low vowel [a] as in father and the high back vowel [u] as in glue. If you know a language like Spanish, the IPA vowels are more similar to the Spanish vowels than to the English.
Words with the same spelling may have vastly different pronunciations, such as: cough, tough, bough, through, though, thoroughfare (Finegan 1992). For these kinds of words, phonetic transcription lets you unambiguously represent the pronunciation.
For English-speaking grade schoolers, spelling is a subject--it has to be learned and sometimes with great difficulty. In countries like Spain and Indonesia, whose national languages have a highly consistent relationship between sound and spelling, the spelling bee does not exist.
English spelling was standardized before the language reached its current state so it represents a system of pronunciation that has long since changed drastically. We spell "knight" the way we do because when spelling was standardized, the "k" and the "gh" were actually pronounced. Although we have long since lost these sounds from the word, we have not revamped the spelling. What spelling represents that speech doesn't
Spelling is not entirely worthy of contempt, though, because it does represent some important information that we don't have in the speech stream. Sometimes spelling can disambiguate things that sound the same in spoken language. For example, when you see these phrases written, you have no doubt what they mean although when hearing them, you might not:
Don't you have any cents/sense?
I left my camera on the plain (grassland)/plane (airplane).
And this bumper sticker is disambiguated in writing:
No Jesus, no peace
Know Jesus, know peace
Unlike speech, spelling represents word boundaries. In written English, spaces let you know where one word begins and another ends. In spoken language, however, there are no such spaces. The speech stream is usually continuous. In fact, the silences visible in a representation of the speech stream (a spectrogram) have nothing to do with word boundaries. Look at the spectrogram below to see how the silences (appearing as blank areas) do not match up with word boundaries:
Numerous words in English have changed because speakers mis-cut them, that is, they placed the word boundaries incorrectly. Have you ever heard someone say they needed a whole nother thing? An apron used to be a napron and a newt used to be an ewt. And what do Frosty's friends call him? An ice guy. This kind of mistake is precisely the kind that speech recognition software would be prone to making.
Playing with words and their boundaries is a source of amusement. Knock knock jokes, for example, often depend on recutting a word.
|
Knock, knock. Who's there? Isabel. Isabel who? Isabel-necessary-on-a-bicycle? |
Knock, knock. Who's there? Lettuce. Lettuce who? Lettuce-in. |
Knock, knock. Who's there? Arthur. Arthur who? Arthur-any-more-biscuits? |
National Public Radio's Car Talk always ends with a list of fictional credits that are a tribute to the ambiguous word boundary/juncture genre of humor. Most of us know the fictional law firm of Dewey, Cheetham & Howe. But how about these other staffers?
| Statistician: Marge Innovera |
| Assertiveness Training Coach: Lois Steem |
| Director of Pollution Control: Maury Missions |
| Evasive-Driving Instructor: Vera Bruptly |
| Conservative Political Commentator: Eileen Tudor-Wright |
| Sexual Harassment Intervention Counselor: Pat McCann |
| Director of Staff Pay Increases: Xavier Breath |
| Corporate Spokesperson: Hugh Lyon Sack |
| Boston Traffic Director: Laura Biden |
| Coordinator, 12 Step Recovery Program: Cody Pendant |
| Directory of Photography: Len Scapon |
| Fact Checker: Ella Fynoe |
| Former Peugeot Dealer: Eustace L. Emmons |
There are also misunderstandings, usually displayed by school children, that have come to be known as "pullet surprises," in honor of how one child wrote "Pulitzer Prize." Speech recognition software is particularly prone to pullet surprises.
The following paragraphs are from Fromkin & Rodman (1988, p. 426).
"When we push air out of the lungs through the glottis (vocal cords/larynx), it causes the vocal cords to vibrate; this vibration in turn produces pulses of air, which escape through the mouth (and sometimes also the nose). These pulses are actually small variations in the air pressure, due to the wavelike motion of the air molecules.
"The sounds we produce can be described in terms of how fast the variations of the air pressure occur, which determines the fundamental frequency of the sounds and is perceived by the hearer as pitch. We can also describe the magnitude or intensity of the variations, which determines the loudness of the sound. The quality of the sound is determined by the shape of the vocal tract when the air is flowing through it. An important tool in acoustic research is the speech spectrograph. When you speak into a microphone connected to this machine, a "picture" is made of the speech signal. The patterns produced are called spectrograms or, more vividly, "visible speech." More recently these pictures have been referred to as voiceprints (though this is deceptive in that they do not offer the kind of indisputable identification of the speaker the way fingerprints do).