Acoustic Phonetics

Voice Onset Time

The millisecond gap between stop release and vocal-fold vibration that distinguishes [b] from [p]

Voice Onset Time (VOT) is the interval between the release of a stop consonant and the onset of vocal-fold vibration in the following vowel. English [pa] has a long positive VOT (around +60 ms — the puff of air precedes voicing); English [ba] has a short positive VOT (around 0-15 ms); Spanish [ba] has a negative VOT (around -100 ms — voicing begins before release, called prevoicing). Leigh Lisker and Arthur Abramson, in their classic 1964 paper "A Cross-Language Study of Voicing in Initial Stops" (Word), measured VOT in eleven languages and showed it carved the world's stop-voicing systems into three categories: voiced (negative VOT), voiceless unaspirated (short positive), voiceless aspirated (long positive). Different languages select different boundaries, accounting for the famous categorical perception of stop consonants and the voicing confusions of L2 learners.

  • DefinitionTime from stop release to onset of voicing
  • English categories[b]: short lag (0-25 ms); [p]: long lag (40-100 ms)
  • Spanish [b]Pre-voiced (-100 ms typical) — voicing precedes release
  • Three-way contrastThai, Hindi — voiced, voiceless unaspirated, voiceless aspirated
  • Foundational studyLisker and Abramson, Word 1964
  • Categorical perceptionSharp perceptual boundary in VOT continuum

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

Why VOT matters

  • Phonetics. VOT is the cleanest acoustic dimension distinguishing voiced from voiceless stops.
  • Cross-linguistic comparison. Lisker and Abramson's three-region model fits most stop systems.
  • Categorical perception. VOT continua produced the foundational evidence for sharp perceptual boundaries.
  • L2 learning. Adult learners struggle to acquire native VOT distributions; teaching focuses on this.
  • Speech disorders. VOT abnormalities mark apraxia, dysarthria, and Parkinson's speech.
  • Forensic phonetics. VOT contributes to speaker comparison and L1 identification.
  • Speech recognition and synthesis. Realistic stop generation requires accurate VOT modeling.

Common misconceptions

  • Voiced stops always have voicing through closure. Many "voiced" English [b]s are voiceless during closure — short-lag VOT.
  • Aspiration equals voicelessness. Voiceless stops can be unaspirated (Spanish [p]); aspiration is a separate dimension.
  • VOT boundaries are universal. Languages select different boundaries; English [b]/[p] boundary differs from Spanish.
  • Categorical perception is purely innate. Language experience tunes boundaries; infants start broader.
  • VOT is binary. It is continuous; boundary placement creates phonemic categories.
  • VOT only matters in lab. It is consistently produced in spontaneous speech and L2 acquisition.

Frequently asked questions

How is VOT measured?

VOT is measured from a wide-band spectrogram or waveform. The release of the stop is identified as a transient burst — a brief spike of energy as the closure opens. The onset of voicing is identified as the first appearance of regular vocal-fold vibration, visible as periodic pulses in the waveform or as a low-frequency striation pattern in the spectrogram. The interval between these two events is the VOT. Positive VOT means voicing follows release; negative VOT (prevoicing) means voicing precedes release. Praat (Boersma 1992 onward) is the standard tool, with semi-automated VOT measurement scripts in widespread use.

What languages have what VOT systems?

Lisker and Abramson's 1964 cross-language study identified three categories. Two-way voiced/voiceless systems (Spanish, French, Russian, Hungarian) contrast prevoiced [b] (negative VOT) with voiceless unaspirated [p] (short positive). Two-way aspirated systems (English, German) contrast short-lag [b] (close to zero) with long-lag [pʰ] (positive aspirated). Three-way systems (Thai, Hindi, Korean — though Korean is technically four-way with tense/lenis/aspirated distinctions) use all three regions. Some languages have implosives or ejectives — different airstream mechanisms beyond the VOT axis. The VOT continuum is one of the cleanest cross-linguistic acoustic dimensions.

What is categorical perception of VOT?

Alvin Liberman, Pierre Delattre, and Frank Cooper at Haskins Laboratories (1957), and later Liberman, Harris, Hoffman, Griffith (1957), generated synthetic stops with VOTs varying continuously and asked English-speaking listeners to identify them. Listeners showed a sharp boundary around +25 ms — stimuli below were heard as [b], above as [p], with little gradient response. Discrimination was likewise enhanced across the boundary. The phenomenon — categorical perception — was initially considered uniquely linguistic but later shown for other domains (color, faces). Patricia Kuhl's infant studies (1980s) showed infants from non-English backgrounds also discriminate at English boundaries before language-specific tuning narrows perception.

How do bilinguals handle different VOT systems?

Bilinguals show partial separation of VOT distributions across their two languages, but rarely full nativelike values. Fred Genesee's and James Flege's work shows L2 learners produce intermediate VOTs, blending native and target patterns. A Spanish-English bilingual may produce English [p] with shorter VOT than monolingual English speakers. Code-switching often triggers VOT shifts mid-utterance. Forensic phonetics uses VOT as one feature in speaker identification — speakers' VOT distributions are reasonably stable signatures within a language but vary across registers.

How does VOT change developmentally?

Children's early productions show shorter VOT distinctions than adult productions. English-learning toddlers may produce [b] and [p] both with short-lag VOT, gradually extending the [p] distribution to long-lag values by age four. Fundamental motor control over the timing of laryngeal and oral gestures matures over years. Susan Curtin and colleagues have documented the gradual VOT separation in longitudinal studies. By kindergarten, most children produce nativelike VOT distributions, though some delays persist into elementary school.

What are the articulatory mechanisms underlying VOT?

VOT depends on the relative timing of two events: release of the supralaryngeal closure (lips for labials, tongue tip for coronals, tongue back for velars) and initiation of vocal-fold vibration. Vocal-fold vibration requires (a) adduction (the folds touch with appropriate tension) and (b) airflow across them creating Bernoulli oscillation. Voiced stops require active expansion of the supraglottal cavity (larynx lowering, pharynx widening) to maintain airflow against intraoral pressure. Voiceless aspirated stops require glottal abduction at release. Languages select one of these articulatory configurations as default for each phoneme.

How is VOT used in clinical and forensic phonetics?

VOT is a sensitive marker of motor control and is altered in speech disorders. Apraxia of speech often shows abnormally variable VOT. Parkinson's disease patients tend toward shortened VOT for voiceless stops (reduced aspiration). Stuttering involves VOT disruptions during disfluency. Speech-language pathologists measure VOT to track therapy progress. Forensic linguists use VOT distributions as one feature in speaker comparison — though VOT is too overlapping across populations to be a unique identifier, it contributes to multivariate profiles. Bilingual VOT signatures can sometimes identify a speaker's L1.