Acoustic Phonetics
Sound Waves of Speech
Acoustic phonetics — how vocal-tract physics shapes the formants you hear as vowels
Speech is propagated as longitudinal pressure waves through air, generated by the vocal folds and shaped by the vocal tract. The complex waveform decomposes via Fourier analysis into a fundamental frequency (F0, perceived as pitch) and overtones, with resonances of the vocal tract (formants F1, F2, F3) determining vowel identity. The acoustic theory of speech production was established by Gunnar Fant in his 1960 monograph Acoustic Theory of Speech Production, building on Dennes Klatt's source-filter modeling. Spectrograms, invented at Bell Labs in 1944 (Potter, Kopp, Green), made the formants visible. F1 correlates inversely with tongue height; F2 correlates with tongue frontness. The vowel triangle [i] [a] [u] in F1-F2 space recurs across all the world's languages, anchoring acoustic phonetics, speech recognition, and linguistic typology.
- Source-filter theoryGunnar Fant (1960) — vocal folds source, tract shapes filter
- SpectrogramBell Labs, 1944; visible speech
- Fundamental frequencyF0 ≈ 100-300 Hz (men 120, women 220, children 300)
- Vowel formantsF1 (300-900 Hz) height; F2 (700-2500 Hz) frontness
- Vowel triangle[i] high front, [a] low central, [u] high back
- Speech bandwidth100 Hz - 8 kHz; telephone bandlimited to 300-3400 Hz
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
Why speech acoustics matter
- Phonetic transcription. Acoustic measurements ground IPA categories empirically.
- Speech recognition. Formant tracking and spectral features feed acoustic models.
- Speech synthesis. Source-filter and parametric synthesis rely on formant control.
- Forensic phonetics. Speaker identification uses formant patterns and F0 distributions.
- Hearing aids and cochlear implants. Frequency-importance functions guide signal processing.
- Sociolinguistics. Vowel space measurements track dialect change (Northern Cities Shift, etc.).
- Language documentation. Praat-based fieldwork preserves under-studied languages.
Common misconceptions
- Vowels are pure tones. They are spectrally complex, defined by formant patterns.
- F0 equals pitch. F0 is acoustic; pitch is perceptual; the brain reconstructs missing harmonics.
- Formants vary independently. They covary with articulation; a vowel triangle constrains them.
- Speech is digital. Acoustically continuous; categorical perception emerges from listener processing.
- Telephone speech is degraded beyond use. Most consonant cues survive bandlimiting.
- One spectrogram fits all analyses. Wide- vs. narrow-band trade time and frequency resolution.
Frequently asked questions
What is the source-filter theory?
Gunnar Fant's 1960 monograph, Acoustic Theory of Speech Production, formalized speech as the product of a sound source and a vocal-tract filter. The source — for voiced sounds, the periodic vibration of the vocal folds at fundamental frequency F0; for voiceless sounds, turbulent noise — has a broad-band spectrum. The vocal tract acts as a tube of varying cross-section, with resonances (formants) that boost certain frequencies and damp others. The output spectrum is the source spectrum multiplied by the filter response. The theory underlies all modern speech synthesis, recognition, and acoustic analysis.
What are formants?
Formants are resonant frequencies of the vocal-tract cavity, numbered F1 (lowest), F2, F3, etc. F1 reflects pharyngeal cavity volume and correlates inversely with tongue height — high vowels [i, u] have low F1 (around 280 Hz), low vowel [a] has high F1 (around 700 Hz). F2 reflects oral cavity length and correlates with tongue frontness — front [i] has high F2 (around 2300 Hz), back [u] has low F2 (around 750 Hz). F3 contributes to rhoticity (English [r] has low F3) and perceptual quality. Peter Ladefoged's measurements (Vowels and Consonants, 2001) provide canonical values across speakers.
How does a spectrogram work?
A spectrogram displays acoustic energy across frequency (vertical axis) and time (horizontal axis), with intensity shown as darkness or color. The spectrograph was developed at Bell Labs in 1944 by Ralph Potter, George Kopp, and Harriet Green for wartime applications and published in Visible Speech (1947). Wide-band spectrograms (300 Hz analysis bandwidth) emphasize formants; narrow-band (45 Hz) emphasize harmonics. Modern spectrograms use the short-time Fourier transform (STFT), digitally implemented. Praat (Paul Boersma, Universiteit van Amsterdam, 1992 onward) is the standard free spectrogram and analysis tool used by every phonetician.
What is fundamental frequency F0?
F0 is the rate of vocal-fold vibration, perceived as pitch. Adult male F0 averages around 120 Hz, adult female 220 Hz, children 300 Hz. The fundamental period is the inverse — about 8 ms for a 120-Hz male voice. F0 is controlled by the cricothyroid muscles tensioning the vocal folds. F0 carries linguistic information in tone languages (Mandarin, Yoruba, Cantonese), intonation patterns (English question rise), and stress. Janet Pierrehumbert's autosegmental-metrical theory (1980 dissertation, MIT) modeled English intonation as a sequence of high (H) and low (L) tones aligned with prosodic structure.
What is the perception of speech sounds?
Categorical perception — first reported by Alvin Liberman, Pierre Delattre, Frank Cooper at Haskins Laboratories (1957) — shows listeners discretize continuous acoustic variation into phonemic categories. A continuum of voice onset times spanning [b] to [p] is perceived not as gradual change but as a sharp boundary. Sine-wave speech (Robert Remez, Phillip Rubin, 1981) shows listeners can perceive speech from highly impoverished signals when they recognize the input as speech. The motor theory (Liberman, Mattingly) proposed listeners decode the gestures producing the sound; the direct realist theory (Carol Fowler) holds the gestures themselves are perceived; analysis-by-synthesis (Stevens) holds listeners hypothesize and verify. The debate continues.
What is voice quality?
Voice quality reflects how the vocal folds vibrate. Modal voice — full periodic vibration — is the default. Breathy voice — incomplete glottal closure, audible aspiration — appears in Mazatec contrasts and English [h]. Creaky voice (vocal fry) — slow irregular pulses — marks ends of utterances in English and is phonemic in some Otomanguean languages. John Laver's The Phonetic Description of Voice Quality (1980) catalogued the parameters. Electroglottography (EGG) measures vocal-fold contact directly. Voice quality interacts with tone (Vietnamese register, Burmese phonation distinctions) and is studied via H1-H2 spectral measures.
How does speech bandwidth affect intelligibility?
Telephones bandlimit speech to roughly 300-3400 Hz, removing F0 (which the brain reconstructs as the "missing fundamental") and high-frequency content. Intelligibility for speech remains high — most consonant cues live below 4 kHz. Wide-band telephony (G.722, 50-7000 Hz) and Opus codecs (50-20000 Hz) recover natural quality. Sibilants ([s], [ʃ]) carry significant energy at 4-8 kHz; bandlimited speech preserves contrast through other cues. Hearing-aid fitting and cochlear-implant programming exploit these frequency-importance functions to allocate spectral resources to the most informative bands.