Chapter Three
Speech Perception
Outline
Introduction
The historical roots of speech perception research
Major questions in speech perception
- How Do We Identify and Label Phonetic Segments?
- The Lack of Invariance Problem
- How Is Speech Perceived under less than Ideal Conditions?
The speech signal
- How Speech Is Produced
- Place of Articulation
- Manner of Production
- Distinctive Features
- Acoustic Properties of Speech Sounds
- Acoustic Properties of Consonants
Perception of phonetic segments
- The Role of Speech Synthesis in Perceptual Research
- Ways in Which Speech Perception Is Tested
- Perception of Vowels
- Steady State Vs. Formant Transitions in Vowel Identification
- Perception of Consonants
- Phoneme Identity Is Context Dependent
- Voice-onset-time: an Important Acoustic Cue
- Categorical Perception of Voicing Contrast
- Other Categorical Perception Studies
- Categorical Perception: Specific to Speech Perception?
- Other Applications of Test Paradigms Used in Categorical Perception Studies
Speech perception beyond a single segment
- the Perceptual Outcome of Coarticulation
- Perceptual Effects of Speaking Rate
- Lexical and Syntactic Factors in Word Perception
Models of speech perception
- the Motor Theory of Speech Perception
- Analysis-by-synthesis
- the Fuzzy Logical Model
- Cohort Model
- TRACE Model
Summary
Key Concepts From the Book
I. Understanding a spoken message is predicated upon the ability to hear and differentiate the sounds that comprise the words of the message. Although speech decoding occurs rapidly, it is a complex task that relies upon a number of distinct processes and is complicated by the fact that phonemes have varying acoustic characteristics depending on: 1) where they are found within a word, 2) which phonemes they accompany, and 3) the individual speaker. (Page 108)
II. Speech perception research has its roots in the communications and military industries that emerged just prior to and during the Second World War. Much of the pioneering work on speech analysis comes out of the development of equipment for speech synthesis. The first device to decode and recreate speech sounds was the vocoder. The principles used to design the vocoder advanced the development of the sound spectrograph, an instrument that analyzes and plots audio signals on a graph, giving a precise diagram of visual speech known as a spectrogram. (Page 109)
III. The acoustical properties of human speech are very complex, containing many kinds of information in any single moment. Understanding conversational speech requires the ability to process between 25 and 30 phonetic segments per second and to decode these segments into meaningful words. One of the greatest challenges for speech perception research is determining how we isolate and identify individual sounds in the complex speech signal. (Pages 110-111)
IV. Phonemes in a language do not display invariant acoustic characteristics. They change considerably depending on such context effects as coarticulation, whether the sounds are being produced by a man a woman or a child, and the fact that speech sounds are rarely pronounced the same way twice. (Page 111)
V. The lack of invariance in the production of speech sounds makes conversational speech so complex and diverse that at present no machine has been created that is able to recognize and process speech the way that humans can. Some speaker dependent machines have been able to process relatively large vocabularies, while speaker independent machines can only process a very limited vocabulary, such as numbers. (Page 112)
VI. Not only do speech sounds vary considerably, some speakers underarticulate words to such an extent that the words lose much of their identifying information. Other factors, such as lexical, syntactic, and contextual information help listeners to understand such ambiguous speech signals. (Pages 112-113)
VII. Three major systems are involved in speech production:
A. the vocal tract, which is the area from the larynx to the lips, including the pharynx, the nasal cavity and the oral cavity.
B. The larynx, which contains the vocal folds and the glottis: the opening between the vocals folds where they vibrate to produce phonation.
C. the subglottal system, which includes lungs, muscles needed for inhalation and exhalation, and the trachea. (Page 113)
VIII. Speech sounds are divided into vowels and consonants. Consonants are produced with more articulatory movement and more constriction than vowels. (Page 114)
IX. The various places where the vocal tract is constricted to produce the consonants are called places of articulation. Common places of articulation for English consonants are bilabial, labiodental, interdental, alveolar, and palatal. (Pages 114-115)
X. The source of acoustic energy for speech sounds comes from modulation of the air stream through the vocal tract. Some sounds are produced by opening and closing the glottis, known as glottal pulsing. Voiced or phonated sounds are made with periodic vibrations in the vocal folds. The rate at which glottal pulsing occurs during sound generation is called the fundamental frequency. The typical FO for men is about 125 pulses per second, for females it is about 200 pulses per second, and for children it is about 300 pulses per second. (Page 116)
XI. A second type of sound source is turbulence when air is forced through a narrow constriction in the oral cavity. This aperiodic sound source is the basis for the production of fricatives and affricatives. (Page 116)
XII. Stopping the airflow completely and then abruptly releasing it produces oral stop consonants. Other types of consonant speech sounds are liquids and glides. (Pages 116-117)
XIII. Vowels are produced when air flow from the lungs is unobstructed. Each vowel is produced with a different configuration of tongue and lip movements. (Page 117)
XIV. Linguists describe speech sounds within the context of a system of distinctive features. All sounds can be characterized by such features as their place of articulation; the presence or absence of voicing; whether the air stream exits the nose or the mouth; and whether or not the sound is made with continuous air flow. (Page 117)
XV. The production of different vowels is determined by the resonant characteristics of the oral cavity or vocal tract during the production of the sound. The spectrum of the sound at the sound source (the glottis) includes the FO and even multiples of the FO, called harmonics. (Page 118)
XVI. Bands of resonant frequency change in relation to the movement of articulators during speech. These bands are called formants. Vowel formants appear as broad horizontal bands on a sound spectrogram. Consonants appear as columns on the sound spectrogram. (Page 119, 121)
XVII. Men, women and children have different absolute values for the formants of given values, but listeners are able to process these sounds by using a system of pattern recognition. This ability is called speaker normalization. (Pages 119-121)
XVIII. Isolating the acoustic cues of the complex sound pattern requires speech analysis machines (and machines able to synthesize speech) according to precise specifications. In the 1950s, Cooper, Liberman and Delattre used the Pattern Playback Speech Synthesizer to produce speech by playing back formant patterns drawn on a sound spectrogram. The researchers found that intelligible phonemes can be produced from highly simplified drawings. Manipulations of the data fed into the machine allowed the researchers to discover what acoustic cues are necessary for the identity of particular phonemes. (Pages 123-124)
XIX. Many speech perception experiments focus on discrimination and identification. Research on vowel perception has shown that listeners most accurately identify isolated vowels when presented with the steady state of the vowel sound. In natural speech, however, vowel identification (which occurs between consonants) seems to rely most heavily on vowel duration and formant transitions. (Pages 124-126)
XX. In both laboratory settings and in conversational speech, vowels are perceived more accurately than consonants, perhaps partly because vowel sounds are of longer duration. Consonants, particularly stop consonants, tend to be bound to vowels; the phonetic segment consisting of the consonant and vowel is said to be encoded. Information about the vowel and the consonant is transmitted simultaneously in what is referred to as parallel transmission. (Page 127)
XXI. The difference between voiced and unvoiced cognate consonants in the initial position appears to be highly dependent upon voice-onset-time (VOT). The VOT for voiced consonants appears to range from just before the burst of air is made to about 30 milliseconds afterwards. The VOT for unvoiced consonants ranges from 40 to 100 milliseconds after the burst of air. (Pages 128-129)
XXII. Discrimination tasks using voiced and unvoiced initial stop consonants present the listener with a continuum of sounds going from a voiced to an unvoiced consonant and ask the listener to determine when the voiced consonant becomes an unvoiced one. Listeners tend to hear many different allophones of the same consonant and then, at a certain point, recognize the consonant as different. This is known as categorical perception. Mid-range stimuli that are sometimes perceived as voiced and sometimes as unvoiced are called the cross-over stimulus. In many perceptual domains, discrimination is better than identification, but the hallmark of categorical perception is that it is not. Categorical perception appears to operate even when the stimulus is a non speech sound. (Pages 130-133)
XXIII. In categorical perception studies using monolingual speakers of English and Thai, the perceived phonetic boundaries were greatly different. English speakers heard the sounds of their language, Thai speakers of theirs. Studies of bilingual speakers generally show that they have a single perceptual system that is situated at the midpoint between their two languages. (Page 135)
XXIV. Studies have shown that many aphasic listeners have unstable responses for several stimuli in the phonetic boundary area, implyingthat their phonetic boundaries are not clearly set. (Page 136)
XXV. The acoustic characteristics of sounds alter as the speaking rate increases. Changes from citation form are greater in vowels than in consonants. All phonetic segments are shortened, but shortening vowels also produces changes in formant frequencies. Listeners appear to expect these changes: in experiments in which the target stimulus is placed in fast and then slow carrier phrases, it may be identified differently. (Page 140)
XXVI. Perception of words in fluent speech is influenced by higher level knowledge of semantics and syntax: top-down processing combines with bottom-up processing (using only acoustic information) to allow listeners to decode fluent speech. (Page 140)
XXVII. Listening for mispronunciation tasks have shown that people tend to pay more attention to the initial part of the word than to its end. Sounds in a word are recognized sequentially: the listener accesses a word candidate and "fills in" the end of the word if it is missing or mispronounced. Phonemic restoration also occurs when phonetic segments are replaced by non-speech sounds.(Page 140-142)
XXVIII. The motor theory of speech perception posits that we perceive speech in terms of how we produce speech sounds. This theory was developed to deal with the absence of invariance between the acoustic signal and its phonemic representation. Speech is held to be a special type of auditory stimulus for humans. When we are exposed to it, we shift into a speech mode that enables us to link articulatory gestures involved in the production of a sound to the sound that we hear. Perceiving in the speech mode is held to be innate and species specific. (Pages 143-144)
XXIX. The analysis-by-synthesis model of perception proposes that we analyze speech by implicitly generating (synthesizing) speech from what we have heard and then comparing this synthesized speech to what we have heard. Little direct empirical evidence has been found to support this model. (Pages 144-145)
XXX. The fuzzy logical model assumes three operations in speech perception: feature evaluation, integration and decision. Listeners are said to have prototypes of words in their heads which must be matched to the auditory stimulus. This model emphasizes a continuous rather than an all-or-nothing approach to feature decision: the degree of match is evaluated on fuzzy truth values. (Pages 145-146)
XXXI. Cohort theory claims that in the first stage of word recognition the acoustic-phonetic information at the beginning of a word activates a cohort of possible words. In the second stage of analysis, all possible sources of information, including higher level processes, help to eliminate words that are not the target word. (Page 146)
XXXII. TRACE theory is based on a system of highly interconnected processing units called nodes. Each node has a resting level, a threshold level and an activation level. Phoneme nodes may excite word nodes and word nodes may excite phoneme nodes. (Page 146-147)