Question 1

What is the source-filter theory?

Accepted Answer

Gunnar Fant's 1960 monograph, Acoustic Theory of Speech Production, formalized speech as the product of a sound source and a vocal-tract filter. The source — for voiced sounds, the periodic vibration of the vocal folds at fundamental frequency F0; for voiceless sounds, turbulent noise — has a broad-band spectrum. The vocal tract acts as a tube of varying cross-section, with resonances (formants) that boost certain frequencies and damp others. The output spectrum is the source spectrum multiplied by the filter response. The theory underlies all modern speech synthesis, recognition, and acoustic analysis.

Question 2

What are formants?

Accepted Answer

Formants are resonant frequencies of the vocal-tract cavity, numbered F1 (lowest), F2, F3, etc. F1 reflects pharyngeal cavity volume and correlates inversely with tongue height — high vowels [i, u] have low F1 (around 280 Hz), low vowel [a] has high F1 (around 700 Hz). F2 reflects oral cavity length and correlates with tongue frontness — front [i] has high F2 (around 2300 Hz), back [u] has low F2 (around 750 Hz). F3 contributes to rhoticity (English [r] has low F3) and perceptual quality. Peter Ladefoged's measurements (Vowels and Consonants, 2001) provide canonical values across speakers.

Question 3

How does a spectrogram work?

Accepted Answer

A spectrogram displays acoustic energy across frequency (vertical axis) and time (horizontal axis), with intensity shown as darkness or color. The spectrograph was developed at Bell Labs in 1944 by Ralph Potter, George Kopp, and Harriet Green for wartime applications and published in Visible Speech (1947). Wide-band spectrograms (300 Hz analysis bandwidth) emphasize formants; narrow-band (45 Hz) emphasize harmonics. Modern spectrograms use the short-time Fourier transform (STFT), digitally implemented. Praat (Paul Boersma, Universiteit van Amsterdam, 1992 onward) is the standard free spectrogram and analysis tool used by every phonetician.

Question 4

What is fundamental frequency F0?

Accepted Answer

F0 is the rate of vocal-fold vibration, perceived as pitch. Adult male F0 averages around 120 Hz, adult female 220 Hz, children 300 Hz. The fundamental period is the inverse — about 8 ms for a 120-Hz male voice. F0 is controlled by the cricothyroid muscles tensioning the vocal folds. F0 carries linguistic information in tone languages (Mandarin, Yoruba, Cantonese), intonation patterns (English question rise), and stress. Janet Pierrehumbert's autosegmental-metrical theory (1980 dissertation, MIT) modeled English intonation as a sequence of high (H) and low (L) tones aligned with prosodic structure.

Question 5

What is the perception of speech sounds?

Accepted Answer

Categorical perception — first reported by Alvin Liberman, Pierre Delattre, Frank Cooper at Haskins Laboratories (1957) — shows listeners discretize continuous acoustic variation into phonemic categories. A continuum of voice onset times spanning [b] to [p] is perceived not as gradual change but as a sharp boundary. Sine-wave speech (Robert Remez, Phillip Rubin, 1981) shows listeners can perceive speech from highly impoverished signals when they recognize the input as speech. The motor theory (Liberman, Mattingly) proposed listeners decode the gestures producing the sound; the direct realist theory (Carol Fowler) holds the gestures themselves are perceived; analysis-by-synthesis (Stevens) holds listeners hypothesize and verify. The debate continues.

Question 6

What is voice quality?

Accepted Answer

Voice quality reflects how the vocal folds vibrate. Modal voice — full periodic vibration — is the default. Breathy voice — incomplete glottal closure, audible aspiration — appears in Mazatec contrasts and English [h]. Creaky voice (vocal fry) — slow irregular pulses — marks ends of utterances in English and is phonemic in some Otomanguean languages. John Laver's The Phonetic Description of Voice Quality (1980) catalogued the parameters. Electroglottography (EGG) measures vocal-fold contact directly. Voice quality interacts with tone (Vietnamese register, Burmese phonation distinctions) and is studied via H1-H2 spectral measures.

Question 7

How does speech bandwidth affect intelligibility?

Accepted Answer

Telephones bandlimit speech to roughly 300-3400 Hz, removing F0 (which the brain reconstructs as the "missing fundamental") and high-frequency content. Intelligibility for speech remains high — most consonant cues live below 4 kHz. Wide-band telephony (G.722, 50-7000 Hz) and Opus codecs (50-20000 Hz) recover natural quality. Sibilants ([s], [ʃ]) carry significant energy at 4-8 kHz; bandlimited speech preserves contrast through other cues. Hearing-aid fitting and cochlear-implant programming exploit these frequency-importance functions to allocate spectral resources to the most informative bands.

Sound Waves of Speech

Interactive visualization

Watch the 60-second explainer

Why speech acoustics matter

Common misconceptions

Frequently asked questions

Interactive visualization

Watch the 60-second explainer

Why speech acoustics matter

Common misconceptions

Frequently asked questions

Related concepts