CHAPTER THREE:

SPEECH PERCEPTION



CHAPTER OUTLINE

INTRODUCTION

THE HISTORICAL ROOTS OF SPEECH PERCEPTION RESEARCH

MAJOR QUESTIONS IN SPEECH PERCEPTION

How Do We Identify and Label Phonetic Segments?

The Lack of Invariance Problem

How Is Speech Perceived under less than Ideal Conditions?

THE SPEECH SIGNAL

How Speech Is Produced

Place of Articulation

Manner of Production

Distinctive Features

Acoustic Properties of Speech Sounds

Acoustic Properties of Consonants

PERCEPTION OF PHONETIC SEGMENTS

The Role of Speech Synthesis in Perceptual Research

Ways in Which Speech Perception Is Tested

Perception of Vowels

Steady State Vs. Formant Transitions in Vowel Identification

Perception of Consonants

Phoneme Identity Is Context Dependent

Voice-onset-time: an Important Acoustic Cue

Categorical Perception of Voicing Contrast

Other Categorical Perception Studies

Categorical Perception: Specific to Speech Perception?

Other Applications of Test Paradigms Used in Categorical Perception Studies

SPEECH PERCEPTION BEYOND A SINGLE SEGMENT

the Perceptual Outcome of Coarticulation

Perceptual Effects of Speaking Rate

Lexical and Syntactic Factors in Word Perception

MODELS OF SPEECH PERCEPTION

the Motor Theory of Speech Perception

Analysis-by-synthesis

the Fuzzy Logical Model

Cohort Model

TRACE Model

SUMMARY



KEY CONCEPTSI. Understanding a spoken message is predicated upon the ability to hear and differentiate the sounds that comprise the words of the message. Although speech decoding occurs rapidly, it is a complex task that relies upon a number of distinct processes and is complicated by the fact that phonemes have varying acoustic characteristics depending on: 1) where they are found within a word, 2) which phonemes they accompany, and 3) the individual speaker. (Page 108)



II. Speech perception research has its roots in the communications and military industries that emerged just prior to and during the Second World War. Much of the pioneering work on speech analysis comes out of the development of equipment for speech synthesis. The first device to decode and recreate speech sounds was the vocoder. The principles used to design the vocoder advanced the development of the sound spectrograph, an instrument that analyzes and plots audio signals on a graph, giving a precise diagram of visual speech known as a spectrogram. (Page 109)



III. The acoustical properties of human speech are very complex, containing many kinds of information in any single moment. Understanding conversational speech requires the ability to process between 25 and 30 phonetic segments per second and to decode these segments into meaningful words. One of the greatest challenges for speech perception research is determining how we isolate and identify individual sounds in the complex speech signal. (Pages 110-111)



IV. Phonemes in a language do not display invariant acoustic characteristics. They change considerably depending on such context effects as coarticulation, whether the sounds are being produced by a man a woman or a child, and the fact that speech sounds are rarely pronounced the same way twice. (Page 111)



V. The lack of invariance in the production of speech sounds makes conversational speech so complex and diverse that at present no machine has been created that is able to recognize and process speech the way that humans can. Some speaker dependent machines have been able to process relatively large vocabularies, while speaker independent machines can only process a very limited vocabulary, such as numbers. (Page 112)



VI.Not only do speech sounds vary considerably, some speakers underarticulate words to such an extent that the words lose much of their identifying information. Other factors, such as lexical, syntactic, and contextual information help listeners to understand such ambiguous speech signals. (Pages 112-113)



VII. Three major systems are involved in speech production:

A. the vocal tract, which is the area from the larynx to the lips, including the pharynx, the nasal cavity and the oral cavity.

B. The larynx, which contains the vocal folds and the glottis: the opening between the vocals folds where they vibrate to produce phonation.

C. the subglottal system, which includes lungs, muscles needed for inhalation and exhalation, and the trachea. (Page 113)



VIII. Speech sounds are divided into vowels and consonants. Consonants are produced with more articulatory movement and more constriction than vowels.

(Page 114)



IX. The various places where the vocal tract is constricted to produce the consonants are called places of articulation. Common places of articulation for English consonants are bilabial, labiodental, interdental, alveolar, and palatal.

(Pages 114-115)



X. The source of acoustic energy for speech sounds comes from modulation of the air stream through the vocal tract. Some sounds are produced by opening and closing the glottis, known as glottal pulsing. Voiced or phonated sounds are made with periodic vibrations in the vocal folds. The rate at which glottal pulsing occurs during sound generation is called the fundamental frequency. The typical FO for men is about 125 pulses per second, for females it is about 200 pulses per second, and for children it is about 300 pulses per second. (Page 116)



XI. A second type of sound source is turbulence when air is forced through a narrow constriction in the oral cavity. This aperiodic sound source is the basis for the production of fricatives and affricatives. (Page 116)



XII. Stopping the airflow completely and then abruptly releasing it produces oral stop consonants. Other types of consonant speech sounds are liquids and glides. (Pages 116-117)



XIII. Vowels are produced when air flow from the lungs is unobstructed. Each vowel is produced with a different configuration of tongue and lip movements.

(Page 117)



XIV. Linguists describe speech sounds within the context of a system of distinctive features. All sounds can be characterized by such features as their place of articulation; the presence or absence of voicing; whether the air stream exits the nose or the mouth; and whether or not the sound is made with continuous air flow. (Page 117)



XV. The production of different vowels is determined by the resonant characteristics of the oral cavity or vocal tract during the production of the sound. The spectrum of the sound at the sound source (the glottis) includes the FO and even multiples of the FO, called harmonics. (Page 118)



XVI. Bands of resonant frequency change in relation to the movement of articulators during speech. These bands are called formants. Vowel formants appear as broad horizontal bands on a sound spectrogram. Consonants appear as columns on the sound spectrogram. (Page 119, 121)



XVII. Men, women and children have different absolute values for the formants of given values, but listeners are able to process these sounds by using a system of pattern recognition. This ability is called speaker normalization.

(Pages 119-121)



XVIII. Isolating the acoustic cues of the complex sound pattern requires speech analysis machines (and machines able to synthesize speech) according to precise specifications. In the 1950s, Cooper, Liberman and Delattre used the Pattern Playback Speech Synthesizer to produce speech by playing back formant patterns drawn on a sound spectrogram. The researchers found that intelligible phonemes can be produced from highly simplified drawings. Manipulations of the data fed into the machine allowed the researchers to discover what acoustic cues are necessary for the identity of particular phonemes. (Pages 123-124)



XIX. Many speech perception experiments focus on discrimination and identification. Research on vowel perception has shown that listeners most accurately identify isolated vowels when presented with the steady state of the vowel sound. In natural speech, however, vowel identification (which occurs between consonants) seems to rely most heavily on vowel duration and formant transitions.

(Pages 124-126)



XX. In both laboratory settings and in conversational speech, vowels are perceived more accurately than consonants, perhaps partly because vowel sounds are of longer duration. Consonants, particularly stop consonants, tend to be bound to vowels; the phonetic segment consisting of the consonant and vowel is said to be encoded. Information about the vowel and the consonant is transmitted simultaneously in what is referred to as parallel transmission. (Page 127)



XXI. The difference between voiced and unvoiced cognate consonants in the initial position appears to be highly dependent upon voice-onset-time (VOT). The VOT for voiced consonants appears to range from just before the burst of air is made to about 30 milliseconds afterwards. The VOT for unvoiced consonants ranges from 40 to 100 milliseconds after the burst of air. (Pages 128-129)



XXII. Discrimination tasks using voiced and unvoiced initial stop consonants present the listener with a continuum of sounds going from a voiced to an unvoiced consonant and ask the listener to determine when the voiced consonant becomes an unvoiced one. Listeners tend to hear many different allophones of the same consonant and then, at a certain point, recognize the consonant as different. This is known as categorical perception. Mid-range stimuli that are sometimes perceived as voiced and sometimes as unvoiced are called the cross-over stimulus. In many perceptual domains, discrimination is better than identification, but the hallmark of categorical perception is that it is not. Categorical perception appears to operate even when the stimulus is a non speech sound.

(Pages 130-133)



XXIII. In categorical perception studies using monolingual speakers of English and Thai, the perceived phonetic boundaries were greatly different. English speakers heard the sounds of their language, Thai speakers of theirs. Studies of bilingual speakers generally show that they have a single perceptual system that is situated at the midpoint between their two languages. (Page 135)

XXIV. Studies have shown that many aphasic listeners have unstable responses for several stimuli in the phonetic boundary area, implying that their phonetic boundaries are not clearly set. (Page 136)



XXV. The acoustic characteristics of sounds alter as the speaking rate increases. Changes from citation form are greater in vowels than in consonants. All phonetic segments are shortened, but shortening vowels also produces changes in formant frequencies. Listeners appear to expect these changes: in experiments in which the target stimulus is placed in fast and then slow carrier phrases, it may be identified differently. (Page 140)

XXVI. Perception of words in fluent speech is influenced by higher level knowledge of semantics and syntax: top-down processing combines with bottom-up processing (using only acoustic information) to allow listeners to decode fluent speech. (Page 140)



XXVII. Listening for mispronunciation tasks have shown that people tend to pay more attention to the initial part of the word than to its end. Sounds in a word are recognized sequentially: the listener accesses a word candidate and "fills in" the end of the word if it is missing or mispronounced. Phonemic restoration also occurs when phonetic segments are replaced by non-speech sounds.

(Page 140-142)



XXVIII The motor theory of speech perception posits that we perceive speech in terms of how we produce speech sounds. This theory was developed to deal with the absence of invariance between the acoustic signal and its phonemic representation. Speech is held to be a special type of auditory stimulus for humans. When we are exposed to it, we shift into a speech mode that enables us to link articulatory gestures involved in the production of a sound to the sound that we hear. Perceiving in the speech mode is held to be innate and species specific.

(Pages 143-144)



XXIX The analysis-by-synthesis model of perception proposes that we analyze speech by implicitly generating (synthesizing) speech from what we have heard and then comparing this synthesized speech to what we have heard. Little direct empirical evidence has been found to support this model. (Pages 144-145)



XXX The fuzzy logical model assumes three operations in speech perception: feature evaluation, integration and decision. Listeners are said to have prototypes of words in their heads which must be matched to the auditory stimulus. This model emphasizes a continuous rather than an all-or-nothing approach to feature decision: the degree of match is evaluated on fuzzy truth values. (Pages 145-146)



XXXI Cohort theory claims that in the first stage of word recognition the acoustic-phonetic information at the beginning of a word activates a cohort of possible words. In the second stage of analysis, all possible sources of information, including higher level processes, help to eliminate words that are not the target word. (Page 146)



XXXII TRACE theory is based on a system of highly interconnected processing units called nodes. Each node has a resting level, a threshold level and an activation level. Phoneme nodes may excite word nodes and word nodes may excite phoneme nodes. (Page 146-147)



MULTIPLE CHOICE QUESTIONS

1. The phoneme /t/ differs from /k/ in:

a. voicing

b. place of articulation

c. manner of production

d. glottal pulsing (Page 116)



2. A sound spectrograph:

a. displays frequency, time and amplitude information

b. cannot analyze voice-onset-time

c. is a vocoder

d. produces synthetic speech (Page 109)



3. Which of the following does not contribute to the absence of invariance in the speech signal?

a. coarticulation

b. allophonic variation

c. physical differences between male, female and child speakers

d. speech synthesis (Pages 111-112)



4. Which of the following structures is not part of the vocal tract?

a. lungs

b. palate

c. nasal cavity

d. tongue, teeth and lips (Page 113)



5. Distinctive features are used to:

a. describe allophonic variation

b. govern speech synthesis

c. determine the resonant properties of the vocal tract

d. describe the specific attributes of the speech sounds of a language

(Page 117)

6. Stop consonants are identified acoustically:

a. by adjacent vowel formant transitions

b. by voice-onset-time

c. by characteristics of the burst

d. all of the above (Page 122)



7. If a sound is perceived categorically, then

a. discrimination will be better than identification

b. perceptual discontinuity results from continuous changes in the physical characteristics of the speech signal

c. it is normalized by the listener

d. it cannot be synthesized adequately (Pages 131-133)



8. Results based on listening for mispronunciation tasks suggest that:

a. we pay more attention to the beginnings of words than to the ends of words

b. we access possible lexical candidates only after we have heard all the sounds in a word

c. context does not affect speech perception

d. top-down models are inadequate to account for speech perception (Page 142)



9. Which of these is not a common place of articulation for English consonants?

a. palatal

b. bilabial

c. labiodental

d. uvular (Pages 114-115)



10. The motor theory of speech perception posits that the speech mode

a. is innate and species-specific

b. is a special mode of perception allowing us to link articulatory gestures involved in the production of a sound to sounds we hear

c. allows us to hear sounds phonetically rather than acoustically

d. all of the above (Pages 143-144)

11. Early research on speech perception made the discovery that

a. natural speech contains many redundant sounds

b. the ability to articulate a sound is necessary to perceiving it

c. humans are uniquely equipped to perceive and understand phonemic information

d. coarticulation of consonants and vowels is necessary for fluent speech (Page 123)

12. Using the Pattern Playback machine, researchers in the 1950s discovered

a. that intelligible speech can be synthesized using highly simplified spectrograms

b. that fluent speech has a highly regular spectrographic pattern

c. that vowel identification is possible even with a fairly simple speech perception machine

d. creating synthetic speech requires a complex pattern of formant transitions (Page 123)

13. The rate of glottal pulsing determines the fundamental frequency (FO) of an utterance:

a. men typically have FOs of around 300 pulses per second, while women and children have FOs of around 200

b. men typically have FOs of around 50 pulses per second, women of around 100 and children around 125

c. men typically have FOs of around 300 pulses per second, women of around 200 and children of around 125

d. men typically have FOs of around 125 pulses per second, women of around 200 and children of around 300

(Page 116)

14. Voice-onset-time is important in determining if

a. consonants are velar or palatal

b. consonants are perceived as voiced or not

c. harmonics are important in speech perception

d. the speaker is male or female (Pages 128-130)



15. If a sound is considered highly encoded, it means that:

a. its perception is unstable

b. it is not an example of parallel transmission of information

c. it is coarticulated

d. its identification is dependent upon information contained in neighboring segments (Page 127)



16. Although men, women and children tend to have different fundamental frequencies for specific speech sounds, listeners are able to understand all three different types of speaker because of

a. voice synthesis

b. speaker normalization

c. formant transitions

d. categorical perception (Page 121)

17. Studies of categorical perception have found that

a. it is specific to speech perception

b. all people, no matter what their native language, are able to perceive the difference between phonemes in most languages

c. bilinguals generally seem to have a single sound perception system

d. changes from citation form are more evident in vowels than in consonants (Page 135)



18. When sounds are produced carefully in isolation, this is called their

a. citation form

b. singular form

c. isolated form

d. solitary form (Page 137)



19. Many phonemes that are perceived as the same are actually slightly different allophones. The production of these slightly different sounds is often the result of

a. coarticulation

b. categorical production

c. parallel processing

d. misinterpretation of acoustic cues (Page 111)

20. When psycholinguists discuss the "lack of invariance" problem, they mean

a. that distinctive phonemes in a language are not associated with standard acoustic patterns

b. that speakers use different fundamental frequencies

c. that the pronunciation of distinctive sounds in normal speech fluctuates

d. all of the above (Page 111)



21. According to the fuzzy logical model of speech perception,

a. listeners understand words by matching them to prototypes of known words

b. listeners understand speech through the activation of a complex network of perception nodes

c. acoustic cues generate a cohort of possible words which must be narrowed down to the target word through logical deduction

d. listeners can rapidly process messily articulated speech by using top down methods of perception (Page 145)

22. Which of these sounds is voiced?

a. the z in buzz

b. the s in bus

c. the f in fog

d. all of the above (Page 116)

23. Which of these words contains both a glide and a fricative?

a. yes

b. less

c. fun

d. wet (Page 116-117)

24. Research on vowel perception has shown that

a. vowels are easier to recognize than consonants

b. consonants are easier to recognize than vowels

c. perception of isolated vowels and those found in continuous speech differ

d. steady state vowels are harder to identify than those that are coarticulated (Page 125)



25. The phenomenon of phonemic restoration suggests that

a. when we listen to a word, our expectations affect what we perceive

b. bottom-up processing is inadequate to account for word recognition

c. parallel processing is integral to both speech production and speech recognition

d. the speech mode allows us to recognize the difference between speech sounds and non-speech sounds (Page 141)

26. The analysis by synthesis model of speech perception posits that

a. speech perception is phonetic and different from auditory perception

b. listeners implicitly generate speech from what they have heard and compare it with the auditory stimulus

c. listeners have ideal models of words in their minds which they compare to what they have heard

d. listeners analyze speech by mentally rehearsing what they have heard (Page 144)

27. The place of articulation for the English consonants [t] and [d] is

a. interdental

b. velar

c. alveolar

d. labiodental (Page 115)

28. The cohort model of word recognition posits that

a. acoustic-phonetic information at the beginning of a word activates all words in memory that resemble it

b. listeners have ideal model of words in their minds to which they compare what they hear

c. acoustic-phonetic information is not sufficient to account for word recognition

d. words are grouped alphabetically in the mental lexicon

(Page 146)



29. Linguists describe speech sounds according to such things as whether or not they are voiced, their place of articulation and whether or not they require continuous air flow. These are called

a. distinguishing features

b. distinctive features

c. discriminant features

d. differentiating features (Page 117)

30. Sounds that differ in only one feature are called

a. cohorts

b. minimal pairs

c. cognates

d. contrasting pairs (Page 129)



SHORT ANSWER QUESTIONS



31. What do we call the portion of the speech production mechanism that provides the air support for speech? What does this system comprise?



32. How do we usually classify speech sounds in terms of articulation?



33. What is (are) the place(s) of articulation for the sounds [b] and [p]?



34. What is (are) the place(s) of articulation for [k] and [g]?



35. What is a stop consonant? Give an example.



36. What is a fricative? Give an example.



37. How would you recognize a vowel on a sound spectrogram?



38. Define the term speaker normalization.



39. What is a bottom-up model of speech perception?



40. What is a top-down model of speech perception?



41. Define the term parallel transmission as it refers to speech recognition.



42. Listening for mispronunciation tasks have been used to examine the way we pay attention to auditory stimuli. Give an example of a question investigators might be able to answer with this kind of experiment.





ESSAY QUESTIONS



43. What is the invariance problem in speech perception? Describe the factors that contribute to the absence of invariance.



44. Describe the three major systems for speech production.



45. Describe and give examples for the following test paradigms in speech perception research: discrimination and identification.



46. What does it mean when we say that phonetic segments are not like beads strung on a string? Explain.



47. Distinguish between top-down and bottom-up models of speech perception. Give examples of how each model might explain word recognition.



48. Describe the phonemic restoration phenomenon. What does it suggest about the nature of speech perception?



49. Describe, contrast, and evaluate two theories of speech perception. For each, delineate what aspect of the theory seems problematic to you.



50. Much information about speech perception has come out of attempts to create computers that can recognize and produce speech. Explain why it is so difficult to program a computer to recognize human language, and what factors make synthesized speech sound unnatural. What does this tell us about our own ability to perform these tasks to effortlessly?

RETURN TO SYLLABUS