Machine Learning

Connectionist Temporal Classification (CTC)

Learn the alignment you were never given

Connectionist Temporal Classification (CTC) trains sequence models on unsegmented data by summing the probability of every valid alignment between a length-T input and a shorter label sequence, computed in O(T·L) by a forward-backward dynamic program.

  • IntroducedGraves et al., ICML 2006
  • Loss computeO(T·L)
  • Alignments summedall valid paths
  • Extra symbolblank (ε)
  • ConstraintT ≥ L + repeats

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

The labeling problem CTC solves

Imagine training a network to transcribe speech. The microphone hands you 16,000 samples per second; a one-second clip becomes maybe 100 frames after feature extraction. The transcript is "cat" — three characters. Which frames are the c? Which are the a? Nobody knows, and labeling every frame by hand is both impossible and meaningless — the boundaries are fuzzy and the audio is noisy.

Classical sequence training needed exactly that frame-level alignment. You either hand-aligned the data (expensive, error-prone) or ran a separate forced-alignment pass with a hidden Markov model before you could train anything. Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber removed the requirement entirely in a 2006 ICML paper titled Connectionist Temporal Classification. The trick: don't pick one alignment. Treat the alignment as a hidden variable and sum the probability over all alignments that could have produced the transcript.

So the network emits one probability distribution per frame over the alphabet plus a special blank symbol. A frame can say "this is an a" or "this is blank — nothing here." Many different frame-by-frame outputs collapse to the same word. CTC trains the network to make the total probability of all collapsing-to-"cat" paths high, and it never has to commit to a single segmentation.

The collapse rule and the alignment grid

CTC defines a many-to-one map, written B, from a length-T frame sequence to a label string. The rule has two steps:

  1. Merge runs of the same label into one. aaa → a.
  2. Delete all blanks. a ε a → aa.

Order matters: merge first, then delete. That is why the blank is indispensable. To output the double letter in "hello" you need l ε l — a blank wedged between the two ls — so the merge step doesn't fold them into one. Without a blank, CTC could never produce any word with a repeated character.

The probability of a transcript is the sum over its preimage — every frame sequence that collapses to it:

P(y | x) = Σ over π in B⁻¹(y) of  Π_{t=1..T} p_t(π_t | x)

where p_t(k | x) is the softmax probability the network assigns to symbol k at frame t. The number of valid paths is exponential in T, so you can't enumerate them. The escape is dynamic programming on an extended label sequence.

Take the target y of length L and interleave blanks: a blank at the start, between every pair of labels, and at the end. The result y' has length 2L + 1. For target cat you get ε c ε a ε t ε (length 7). Now lay a grid: one column per frame (T columns) and one row per extended-label position (2L+1 rows). A valid alignment is a monotonic path through this grid from the top-left region to the bottom-right region. At each step the path may stay on the same row, drop one row, or — only when moving between two different non-blank labels — skip the blank and drop two rows.

The forward-backward recurrence

Define the forward variable α_t(s) as the total probability of all paths that, by frame t, have consumed the first s positions of the extended label y' and currently sit on position s. The recurrence is:

base:   α_1(1) = p_1(ε)          // start on a blank
        α_1(2) = p_1(y'_2)       // or on the first real label
        α_1(s) = 0  for s > 2

step:   α_t(s) = ( α_{t-1}(s) + α_{t-1}(s-1) ) · p_t(y'_s)
        // PLUS α_{t-1}(s-2) ONLY IF y'_s != ε AND y'_s != y'_{s-2}

The third term — the "skip" — is what lets the path jump over a separating blank when moving to a genuinely new label, but is forbidden between two copies of the same letter (so the mandatory blank can't be skipped). The total sequence probability is read off the last column:

P(y | x) = α_T(2L+1) + α_T(2L)
loss     = -ln P(y | x)

The backward variable β_t(s) is the mirror image, filled right-to-left. Together α and β give the per-cell occupation probability α_t(s)·β_t(s), and the gradient of the loss with respect to each softmax output has a clean closed form: subtract the expected blank-and-label occupancy from the predicted probability. That is the whole magic — the exponential sum and its exact gradient both come out of one O(T·L) grid sweep.

Complexity, in numbers

  • Naive sum over alignments: exponential. With an alphabet of V symbols and T frames there are up to V^T frame sequences. For T = 100 and V = 30 that is 30^100 — astronomically infeasible.
  • Forward-backward: O(T·L) time and space. The grid has (2L+1)·T cells, each updated from at most three predecessors in constant time. A 1,000-frame utterance with a 50-character target is about (101)·1000 ≈ 101,000 cell updates — microseconds on a GPU, dominated by the network forward pass, not the loss.
  • Length constraint. A target of length L with r adjacent repeated characters needs at least T ≥ L + r frames. "bookkeeper" (10 letters, 3 doubled pairs) needs T ≥ 13. Shorter inputs make P(y|x) = 0 and the log-loss explode to infinity.
  • Numerical range. Multiplying hundreds of probabilities underflows float32 (the smallest normal is ≈1.2e-38) within ~50 frames. Production CTC runs entirely in log-space using a log-sum-exp accumulator, trading the multiply for an add and a cheap exp.

CTC vs other sequence-training methods

CTCAttention seq2seqRNN-Transducer (RNN-T)HMM-GMM
Needs frame-level labelsNoNoNoYes (forced alignment)
AlignmentMonotonic, frame-synchronousFree / reorderableMonotonic, label-synchronousMonotonic
Output dependenciesConditionally independent given xFull (autoregressive)Models prediction historyMarkov order-1
Training cost per utteranceO(T·L)O(T·L) attention + decodeO(T·L·V) latticeEM, multiple passes
Streaming-friendlyYesHard (needs full input)Yes (designed for it)Yes
Failure modeSpiky output, weak LMHallucinate / loop on long audioMemory-heavy latticeBrittle, hand-engineered
Typical useSpeech, OCR, handwritingTranslation, summarizationStreaming ASR (Google, Apple)Pre-2012 ASR

The headline trade-off is the conditional-independence assumption. CTC's frame outputs don't see each other, so it leans on an external language model at decode time and tends to produce "spiky" distributions — confident blanks separated by sharp label peaks. Attention models capture output dependencies but lose the monotonic guarantee, which is why they can skip or repeat whole phrases on long recordings. RNN-T keeps CTC's monotonic alignment while adding a prediction network that models output history — it is the workhorse behind most on-device streaming dictation today.

JavaScript implementation

A log-space forward pass that returns the CTC loss. logProbs[t][k] is the network's log-softmax output for symbol k at frame t; blank is the index of the blank symbol; labels is the integer-encoded target.

const NEG_INF = -Infinity;

function logSumExp(a, b) {
  if (a === NEG_INF) return b;
  if (b === NEG_INF) return a;
  const m = Math.max(a, b);
  return m + Math.log(Math.exp(a - m) + Math.exp(b - m));
}

function ctcLoss(logProbs, labels, blank = 0) {
  const T = logProbs.length;
  // Build the extended label: blank, l1, blank, l2, ..., blank
  const ext = [blank];
  for (const c of labels) { ext.push(c); ext.push(blank); }
  const S = ext.length;                 // S = 2L + 1
  if (T < labels.length) return Infinity; // no valid alignment: P = 0, NLL = +∞

  // alpha[s] for the current frame, log-space
  let alpha = new Array(S).fill(NEG_INF);
  alpha[0] = logProbs[0][ext[0]];                 // start on blank
  if (S > 1) alpha[1] = logProbs[0][ext[1]];      // or first real label

  for (let t = 1; t < T; t++) {
    const next = new Array(S).fill(NEG_INF);
    for (let s = 0; s < S; s++) {
      let v = alpha[s];                            // stay
      if (s > 0) v = logSumExp(v, alpha[s - 1]);   // come from previous position
      // skip the separating blank — only to a NEW non-blank label
      if (s > 1 && ext[s] !== blank && ext[s] !== ext[s - 2]) {
        v = logSumExp(v, alpha[s - 2]);
      }
      next[s] = v + logProbs[t][ext[s]];
    }
    alpha = next;
  }

  // Total probability ends on the final blank or the final real label
  const logP = logSumExp(alpha[S - 1], S > 1 ? alpha[S - 2] : NEG_INF);
  return -logP;                                    // negative log-likelihood
}

// Greedy (best-path) decode: argmax per frame, then collapse.
function greedyDecode(logProbs, blank = 0) {
  const out = [];
  let prev = -1;
  for (const frame of logProbs) {
    let best = 0;
    for (let k = 1; k < frame.length; k++) if (frame[k] > frame[best]) best = k;
    if (best !== blank && best !== prev) out.push(best); // merge repeats, drop blanks
    prev = best;
  }
  return out;
}

Two things to notice. The forward pass never multiplies probabilities — every product becomes a sum in log-space, and every sum-over-paths becomes a logSumExp. And the skip term is gated by ext[s] !== ext[s-2], the single condition that enforces the mandatory blank between repeated letters.

Python implementation

The same forward pass in NumPy, plus the trick everyone actually uses in production — call the framework op. PyTorch's nn.CTCLoss is a hand-tuned C++/CUDA forward-backward that also returns the exact gradient.

import numpy as np

def ctc_loss(log_probs, labels, blank=0):
    """log_probs: (T, V) log-softmax. labels: list[int]. Returns NLL."""
    T = log_probs.shape[0]
    ext = [blank]
    for c in labels:
        ext += [c, blank]               # blank, l1, blank, l2, ..., blank
    S = len(ext)                         # 2L + 1
    if T < len(labels):
        return np.inf                    # no valid alignment

    NEG = -np.inf
    alpha = np.full(S, NEG)
    alpha[0] = log_probs[0, ext[0]]
    if S > 1:
        alpha[1] = log_probs[0, ext[1]]

    def lse(a, b):
        if a == NEG: return b
        if b == NEG: return a
        m = max(a, b)
        return m + np.log(np.exp(a - m) + np.exp(b - m))

    for t in range(1, T):
        nxt = np.full(S, NEG)
        for s in range(S):
            v = alpha[s]
            if s > 0:
                v = lse(v, alpha[s - 1])
            if s > 1 and ext[s] != blank and ext[s] != ext[s - 2]:
                v = lse(v, alpha[s - 2])
            nxt[s] = v + log_probs[t, ext[s]]
        alpha = nxt

    log_p = lse(alpha[S - 1], alpha[S - 2] if S > 1 else NEG)
    return -log_p

# Production: let the framework do the backward pass for you.
import torch
def torch_ctc(logits, targets, input_lens, target_lens):
    # logits: (T, N, V) raw scores; CTCLoss wants log-probs over dim=-1
    log_probs = torch.log_softmax(logits, dim=-1)
    loss_fn = torch.nn.CTCLoss(blank=0, zero_infinity=True)
    return loss_fn(log_probs, targets, input_lens, target_lens)

Note zero_infinity=True: when an input is shorter than its target the loss is genuinely +∞, and this flag clamps it (and its gradient) to zero so a single bad batch element doesn't poison the whole update with NaNs.

Variants and decoders worth knowing

Prefix beam search. Greedy decoding ignores that many low-probability paths can pile onto the same prefix and collectively beat the single best path. Prefix beam search keeps the top-B prefixes and, at each frame, tracks two probabilities per prefix — ending in blank vs. ending in a non-blank — so it correctly merges repeats. It folds in an external language model with a tunable weight and a word-insertion bonus. This is the standard high-accuracy CTC decoder.

RNN-Transducer (RNN-T). Graves's 2012 follow-up adds a prediction network that conditions on the labels emitted so far, lifting CTC's conditional-independence assumption while preserving the monotonic, streaming-friendly alignment. It is the dominant architecture for on-device speech recognition.

Joint CTC/attention. Train one encoder with both a CTC head and an attention decoder, weighting the two losses (typically λ ≈ 0.3 for CTC). CTC's monotonic constraint regularizes the attention alignment and speeds convergence; this is the default recipe in toolkits like ESPnet.

Gram-CTC and wordpiece CTC. Instead of single characters, the output units are character n-grams or subword pieces, shrinking T/L mismatch and capturing common letter sequences in one emission.

Wav2Vec 2.0 fine-tuning. The most common modern use of CTC isn't a from-scratch RNN at all — it's a thin linear CTC head bolted onto a frozen or fine-tuned self-supervised transformer encoder, reaching strong word-error rates with an hour of labeled audio.

Common bugs and edge cases

  • Working in probability space, not log space. Products of hundreds of probabilities underflow float32 to exactly 0 within ~50 frames, making the loss −ln 0 = ∞. Always accumulate in log-space with log-sum-exp.
  • Input shorter than the target. If T < L + repeats, no path exists, P = 0, and the loss is +∞. Filter these out in your data loader or set zero_infinity=True.
  • Wrong blank index. Frameworks disagree: PyTorch defaults to blank=0, so your vocabulary must reserve index 0 for blank and shift real symbols up by one. A mismatch silently trains the network to emit the wrong symbol as "nothing."
  • Forgetting the same-label skip guard. Dropping the ext[s] != ext[s-2] condition lets the path skip the mandatory blank between repeated letters, so the model can never learn to produce doubled characters correctly.
  • Feeding raw logits where log-probs are expected. CTCLoss wants log_softmax output, not logits. Passing logits silently inflates the loss and slows learning.
  • The "peaky" / spiky output trap. CTC happily concentrates all probability on blanks except for one sharp spike per character. This is normal, but it means raw posterior probabilities are poorly calibrated — don't use them as confidence scores without smoothing, and always decode with a language model for accuracy.
  • Repeated labels in the target itself. A target like [a, a] is fine, but remember the network must insert a blank between them at inference; if your training data preprocessing collapses repeats in the labels, the model can never recover them.

Frequently asked questions

What problem does CTC actually solve?

It removes the need for per-frame labels. Audio is sampled thousands of times a second but the transcript has only a few dozen characters, and nobody knows which frame produced which letter. CTC lets a network output one label per frame, then sums over every alignment that collapses to the correct transcript, so the model learns the alignment on its own.

What is the blank token in CTC and why is it needed?

The blank (often written ε) is an extra output symbol meaning "no label here." It does two jobs: it lets the network emit nothing on frames between characters, and it separates intentional repeats. To output 'hello' the network must emit l, then a blank, then l again — otherwise the collapse rule would merge the two l's into one.

How fast is the CTC loss to compute?

O(T·L) time and space, where T is the number of input frames and L the label length. The naive sum over all alignments is exponential — roughly V^T paths — but the forward-backward dynamic program collapses it to a grid of (2L+1)·T cells, each filled in constant time. A 1,000-frame utterance with a 50-character target is about 100,000 cell updates, microseconds on a GPU.

What is the difference between CTC and an attention seq2seq model?

CTC assumes outputs are conditionally independent given the input and produces them frame-synchronously with a monotonic alignment — perfect for speech and handwriting. Attention encoder-decoders model output dependencies and can reorder freely, which helps translation but can hallucinate or skip on long audio. Modern ASR systems often combine both (e.g. RNN-T or joint CTC/attention).

Why does CTC require the input to be at least as long as the output?

Each output label must be emitted on at least one frame, and consecutive identical labels need a blank between them. So a target of length L with r adjacent repeats needs at least L + r frames. If T is smaller than that, no valid alignment exists, the path probability is zero, and the log-loss is infinite — a common cause of NaNs.

How do you decode the network output back into text?

The cheapest decoder is greedy (best-path): take the argmax label at each frame, then apply the collapse rule — merge runs of identical labels and drop blanks. For better accuracy use prefix beam search, which sums probabilities of alignments that share a prefix and can fold in a language model. Greedy is O(T·V); beam search is O(T·B·V) for beam width B.